Corpus of Estonian Web sentences
The Corpus of Estonian Web Sentences is a large collection of mainly web-based texts from various sources, such as Estonian National Corpus, Estonian Trends corpus, Estonian Collocations Dictionary 2019, etc. This Estonian corpus consists of sentences only, i.e. corpus does not contain whole documents. The selection of sentences was based on their GDEx score (reflecting text quality), with only those scoring higher than 0.500 included in the corpus. Finally, all sentences have been sorted so that the sentences with the highest scores stand at the beginning of the corpus.
Part-of-speech tagset
Corpora of Estonian Web sentences are morphologically annotated by the tagging tool EstNLTK v1.6 with the following part-of-speech tagset summary. The corpus texts also contain lemmatization when each word form from the corpus is assigned to its base form (lemma).
There are two types of POS tag attributes:
- the abbreviated tag contains only basic information about part of speech (see this overview),
- the longtag contains detailed information, including other categories for particular parts of speech.
Overview corpus versions
The Corpus of Estonian Web sentences comprises the following corpus versions:
- Corpus of Estonian Web sentences 2021 – 473 million words, comprised of 9 sources: Estonian National Corpus 2021, Estonian Collocations Dictionary 2019 (ECD), Estonian Trends corpus, Estonian Wikipedia, articles from Directory of Open Access Journals (DOAJ), various web sources, fiction, various timestamped sources; fine-tuned Estonian GDEx configuration version 1.4 with minor changes
- Corpus of Estonian Web sentences 2020 – 280 million words, comprised of 3 sources: Estonian National Corpus 2019, Estonian Collocations Dictionary 2019 (ECD) [only sentences with the above score 0.9], and Estonian Trends corpus; Estonian GDEx configuration version 1.4
Tools to work with Corpus of Estonian Web sentences
A complete set of Sketch Engine tools is available to work with these Estonian corpora to generate:
- word sketch – Estonian collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of Estonian nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Changelog
Estonian National Corpus 2021 (Estonian NC 2021)
- 558 million tokens
Estonian National Corpus 2019 (Estonian NC 2019)
- 331 million tokens
Bibliography
Search the Corpus of Estonian Web sentences
Sketch Engine offers a range of tools to work with this Estonian corpora.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.