Icelandic Gigaword Corpus 2017
The Icelandic Gigaword Corpus 2017 is an Icelandic corpus made up of texts collected from the Internet. The texts are official texts (e.g. parliamentary speeches, law texts), texts from news media or other sources.
Each text is accompanied by metadata (author, document title, publication date etc.), which is possible to view using the Text Type Analysis. The corpus is intended for linguistic research and for use in language technology projects.
The official documentation is available at: https://clarin.is/en/resources/gigaword/
Note: According to the official website, the corpus is divided into two parts – IGC1 and IGC2. Only the second part IGC2 is freely available and accessible in Sketch Engine.
Part-of-speech tagset and lemmatization
The isTenTen Icelandic corpus was part-of-speech tagged by IceNLP toolkit with IFD Tagset.
Icelandic Gigaword Corpus 2017 corpus sizes
Frequency | |
Tokens | 600,301,903 |
Words | 532,028,866 |
Sentences | 27,252,906 |
Documents | 1,550,779 |
Search the Icelandic Gigaword Corpus 2017
Sketch Engine offers a range of tools to work with this Icelandic corpus.
Tools to work with the Icelandic corpus
A set of Sketch Engine tools is available to work with this Icelandic corpus to generate:
- word sketch – Icelandic collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word
- word lists – lists of Icelandic nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Changelog
Icelandic Gigaword Corpus 2017 (icelandic_gigaword17)
version icelandic_gigaword17 (February 2024)
Bibliography
Steingrímsson, Steinþór, Sigrún Helgadóttir, Eiríkur Rögnvaldsson, Starkaður Barkarson and Jón Guðnason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. Proceedings of LREC 2018, pp. 4361-4366. Myazaki, Japan.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.