Corpus of Estonian Web sentences

The Corpus of Estonian Web Sentences is a large collection of mainly web-based texts from various sources, such as Estonian National Corpus, Estonian Trends corpus, Estonian Collocations Dictionary 2019, etc. This Estonian corpus consists of sentences only, i.e. corpus does not contain whole documents. The selection of sentences was based on their GDEx score (reflecting text quality), with only those scoring higher than 0.500 included in the corpus. Finally, all sentences have been sorted so that the sentences with the highest scores stand at the beginning of the corpus.

Part-of-speech tagset

Corpora of Estonian Web sentences are morphologically annotated by the tagging tool EstNLTK v1.6 with the following part-of-speech tagset summary. The corpus texts also contain lemmatization when each word form from the corpus is assigned to its base form (lemma).

There are two types of POS tag attributes:

the abbreviated tag contains only basic information about part of speech (see this overview),
the longtag contains detailed information, including other categories for particular parts of speech.

Overview corpus versions

The Corpus of Estonian Web sentences comprises the following corpus versions:

Corpus of Estonian Web sentences 2021 – 473 million words, comprised of 9 sources: Estonian National Corpus 2021, Estonian Collocations Dictionary 2019 (ECD), Estonian Trends corpus, Estonian Wikipedia, articles from Directory of Open Access Journals (DOAJ), various web sources, fiction, various timestamped sources; fine-tuned Estonian GDEx configuration version 1.4 with minor changes
Corpus of Estonian Web sentences 2020 – 280 million words, comprised of 3 sources: Estonian National Corpus 2019, Estonian Collocations Dictionary 2019 (ECD) [only sentences with the above score 0.9], and Estonian Trends corpus; Estonian GDEx configuration version 1.4

Tools to work with Corpus of Estonian Web sentences

A complete set of Sketch Engine tools is available to work with these Estonian corpora to generate:

word sketch – Estonian collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of Estonian nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Changelog

Estonian National Corpus 2021 (Estonian NC 2021)

558 million tokens

Estonian National Corpus 2019 (Estonian NC 2019)

331 million tokens

Bibliography

Koppel, Kristina (2020). Näitelausete korpuspõhine automaattuvastus eesti keele õppesõnastikele. (Doktoritöö, Tartu Ülikool). Tartu: Tartu Ülikooli Kirjastus.

Koppel, Kristina; Kallas, Jelena; Khokhlova, Maria; Suchomel, Vít; Baisa, Vít; Michelfeit, Jan (2019). SkELL corpora as a part of the language portal Sõnaveeb: problems and perspectives. In: Kosem, I., Zingano Kuhn, T., Correia, M., Ferreria, J. P., Jansen, M., Pereira, I., Kallas, J., Jakubíček, M., Krek, S. & Tiberius, C. (Ed.). Electronic lexicography in the 21st century. Proceedings of the eLex 2019 conference. 1-3 October 2019, Sintra, Portugal. (763−782). Brno: Lexical Computing CZ, s.r.o.

Koppel, Kristina; Tavast, Arvi; Langemets, Margit; Kallas, Jelena (2019). Aggregating dictionaries into the language portal Sõnaveeb: issues with and without a solution. In: Kosem, I., Zingano Kuhn, T., Correia, M., Ferreria, J. P., Jansen, M., Pereira, I., Kallas, J., Jakubíček, M., Krek, S. & Tiberius, C. (Ed.). Proceedings of the eLex 2019 conference. 1-3 October 2019, Sintra, Portugal.. (434−452). Brno: Lexical Computing CZ, s.r.o.

Search the Corpus of Estonian Web sentences

Sketch Engine offers a range of tools to work with this Estonian corpora.

open in Sketch Engine

about Sketch Engine

Other text corpora in Sketch Engine

Sketch Engine offers 800+ language corpora.

corpora in Sketch Engine

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide