etTenTen – Estonian corpus from the web

etTenTen: Corpus of the Estonian Web

The Estonian Web Corpus (etTenTen) is an Estonian corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of the web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 40 languages.

The Estonian Web 2023 corpus was crawled by the SpiderLing web spider from March 2023 to September 2023. The final size of the corpus consists of 1.5 billion words. The Estonian Web 2023 corpus contains semi-automatically detected genres and topics such as academic, blogs, discussion, education, sports, …

Detailed information about TenTen corpora is on the separate page Common TenTen corpora attributes.

Part-of-speech tagset

The etTenTen corpus was annotated by the Estonian NLTK tool tagger using the following Estonian Filosoft tagset.

Overview of Estonian TenTen corpora

These web corpora were crawled and processed repeatedly during the years:

Estonian Web corpus 2023 (etTenTen23) – 1.5 billion words (March 2023 – September 2023); semi-automatically detected genre annotation and topic classification
Estonian Web corpus 2021 (etTenTen21) – 725 million words (May 2021 – September 2021); semi-automatically detected genre annotation and topic classification
Estonian Web corpus 2019 (etTenTen19) – 508 million words (September 2019 – January 2020; semi-automatically detected Text types)
Estonian Web corpus 2017 (etTenTen15) – 658 million words (July–November 2017)
Estonian Web corpus 2013 (etTenTen13) – 260 million words

Estonian Web 2023 corpus sizes

	Frequency
Tokens	1.8+ billion
Words	1.5+ billion
Sentences	100+ million
Web pages	5+ million

Genre annotation and topic classification

A part of the Estonian Web 2023 corpus contains genre annotation and topic classification. These can be displayed as corpus structures in Concordance or in the Text type Analysis tool.

The charts show the distribution of genres and topics in the Estonian Web corpus 2023. The corpus is classified into 5 genres and 23 topics.

Hover over the chart to display a number of tokens of the particular topic.

Hover over the chart to display a number of tokens of the particular genre.

Tools to work with the Estonian corpora

A complete set of Sketch Engine tools is available to work with these Estonian corpora to generate:

word sketch – Estonian collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of Estonian nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Changelog

etTenTen 2023 (March 2025)

tagged with the Filosoft tagset (Estonian NLTK + Filosoft pipeline version 4)
semi-manually detected genres and topics
bad content removal

etTenTen 2021 (May 2022)

tagged with the Filosoft tagset (Estonian NLTK + Filosoft pipeline version 3)
semi-manually detected genres and topics
new word sketch grammar (version 2), new term grammar (version 2)

etTenTen 2017 (February 2021)

tagged with the Filosoft tagset (Estonian NLTK + Filosoft pipeline v.2)

etTenTen 2019 (January 2021)

crawled by SpiderLing from September 2019 to January 2020
622 million tokens
tagged with the Filosoft tagset (Estonian NLTK + Filosoft pipeline v.2)
semi-automatically detected Text types

etTenTen 2017 (February 2018)

crawled by SpiderLing from July to November 2017
807 million tokens

etTenTen 2013 (May 2017)

new word sketches

etTenTen 2013 (May 2014)

tagging & word sketches
tagged with the Filosoft tagset (Estonian NLTK + Filosoft pipeline v.1)

etTenTen 2013 (March 2013)

obtained from the web in January 2013
260 million words
no tagging

Bibliography

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Search the Estonian corpus

Sketch Engine offers a range of tools to work with the etTenTen corpus.

open in Sketch Engine

about Sketch Engine

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide