DAGW: Danish Gigaword Corpus
The Danish Gigaword Corpus (DAGW) is a 964-million-word Danish corpus made up of texts collected from the Internet. The corpus texts consist of various web sources such as European Parliaments, OPUS, Wikipedia, etc. The Danish Gigaword Corpus was created by Leon Derczynski and Manuel R. Ciosici and it is freely distributed with attribution. In comparison with the original Danish Gigaword corpus, the Sketch Engine version of the corpus is smaller (approx. 80 million words less) because General Discussions and Parliament Elections sections were not included.
For further information, visit the homepage of the Danish Gigaword Project.
Part-of-speech tagset
The Danish Gigaword corpus was tagged by Sketch Engine using TreeTagger with a Danish model respecting the ePos tagset trained using the ePAROLE corpus.
Copyright
Texts in the corpus are provided under Creative Commons Attribution 4.0 International (CC BY 4.0).
Sample attributions
In a press release:
The model is pre-trained using the Danish Gigaword Corpus (https://gigaword.dk), developed at the IT University of Copenhagen.
In academic writing:
Derczynski, L., Ciosici, M. R., et al. (2021). The Danish Gigaword Corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).
Danish Gigaword corpus in detail
Basic statistics of the corpus
Frequency* | |
Tokens | 1,197,941,586 |
Words | 964,617,784 |
Sentences | 56,979,231 |
Documents | 511,160 |
Further information about texts in the corpus
A list of subcorpora
Subcorpus name | Sources | Size (in tokens) | % of the whole corpus |
Conversation | Movie subtitles, Debates, Conversation, Speeches | 329,037,536 | 27.5 |
Legal | Laws, Tax code, Court cases | 333,236,660 | 27.8 |
News | News | 44,472,637 | 3.7 |
Other | Other, Sønderjysk | 1,409,439 | 0.1 |
Social Media | forum | 257,051,120 | 21.5 |
Web | Web | 118,757,859 | 9.9 |
Wiki & Books | Encyclopaedic, Literature, Manuals, JVJ’s works, Religious | 113,976,335 | 9.5 |
A list of text types
Dialect – Danish dialect
Section – it corresponds to a single source of text
Publication date – the publication date of the source document
Year of publication – the year CE that the source document was published
Document ID – document ID corresponds to the original filename
Form – a form of the text – written or spoken
Detailed information on text types available in the Danish Gigaword corpus can be found at http://www.derczynski.com/papers/dagw.pdf
Tools to work with the Danish Gigaword Corpus
A complete set of tools is available to work with this Danish corpus to generate:
- word sketch – Danish collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of Danish nouns, verbs, adjectives, etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- trends – diachronic analysis automatically identifies neologisms and changes in use
- text type analysis – statistics of metadata in the corpus
Bibliography
Derczynski, L., Ciosici, M. R., et al. (2021). The Danish Gigaword Corpus. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021).
website: https://gigaword.dk/
Search the Danish Gigaword Corpus
Sketch Engine offers a range of tools to work with this Danish corpus from the web.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.