NCI: New Corpus for Ireland
The New Corpus for Ireland (NCI) is a language corpus developed as part of the set-up phase of a project for a new English-to-Irish Dictionary (NEID). The project is under the direction of Foras na Gaeilge, a public body responsible for the promotion of the Irish language.
The corpus was collected in three main ways:
- incorporating existing corpora
- contacting publishers, authors, newspaper companies, etc. to request permission to use their texts
- collecting data from the web.
In Sketch Engine, the project is composed of two separate corpora:
- 30-million-word corpus of Irish
- 200-million-word corpus of English including Hiberno-English (the variety of English that is spoken in Ireland)
Part-of-speech tagset
The NCI corpus, the Irish part, was processed by the morphological analyzer/generator for Irish (Uı´ Dhonn chadha) with the following POS tagset. The English part of the NCI was tagged by TreeTagger using Penn Treebank tagset.
Tools to work with the New Corpus for Ireland
A complete set of Sketch Engine tools is available to work with this NCI corpus to generate:
- word sketch – Irish collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- word lists – lists of Irish nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- keywords – terminology extraction of one-word
- text type analysis – statistics of metadata in the corpus
Bibliography
Kilgarriff, Adam, Michael Rundell, and Elaine Uí Dhonnchadha. Efficient corpus development for lexicography: building the New Corpus for Ireland. Language resources and evaluation 40.2 (2006): 127-152.
Search the New Corpus for Ireland
Sketch Engine offers a range of tools to work with the New Corpus for Ireland.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.