TenTen Corpus Family
The TenTen Corpus Family (TenTen corpora) is a family of text corpora created from the Web. All TenTen corpora are prepared according to the same criteria and can be regarded as comparable corpora. The corpora are built using technology specialized in collecting only linguistically valuable web content.
The name TenTen refers to the target corpus size 10+ billion words per language. These TenTen corpora are currently available in 50+ languages, such as English, Spanish, Japanese, Chinese, Greek, Estonian, Arabic, Russian, etc.
TenTen corpora available in Sketch Engine
A total list of TenTen corpora which can be found in Sketch Engine.
Search the TenTen corpora
Sketch Engine offers a range of tools to work with the TenTen corpora.
arTenTen (Arabic web corpus) | beTenTen (Belarusian web corpus) | bgTenTen (Bulgarian web corpus) |
bnTenTen (Bengali web corpus) | caTenTen (Catalan web corpus) | cebTenTen (Cebuano web corpus) |
csTenTen (Czech web corpus) | daTenTen (Danish web corpus) | deTenTen (German web corpus) |
elTenTen (Greek web corpus) | enTenTen (English web corpus) | esTenTen (Spanish web corpus with European/American Spanish subcorpora) |
etTenTen (Estonian web corpus) | fiTenTen (Finnish web corpus) | frTenTen (French web corpus) |
gaTenTen (Irish web corpus) | guTenTen (Gujarati web corpus) | heTenTen (Hebrew web corpus) |
hiTenTen (Hindi web corpus) | huTenTen (Hungarian web corpus) | idTenTen (Indonesian web corpus) |
isTenTen (Icelandic web corpus) | itTenTen (Italian web corpus) | jaTenTen (Japanese web corpus) |
kmTenTen (Khmer web corpus) | koTenTen (Korean web corpus) | loTenTen (Lao & Isan web corpus) |
ltTenTen (Lithuanian web corpus) | lvTenTen (Latvian web corpus) | miTenTen (Māori web corpus) |
msTenTen (Malay web corpus) | myTenTen (Burmese web corpus) | nlTenTen (Dutch web corpus) |
noTenTen (Norwegian web corpus) | plTenTen (Polish web corpus) | pnbTenTen (Western Punjabi web corpus) |
ptTenTen (Portuguese web corpus) | roTenTen (Romanian web corpus) | ruTenTen (Russian web corpus) |
skTenTen (Slovak web corpus) | slTenTen (Slovenian web corpus) | sqTenTen (Albanian web corpus) |
svTenTen (Swedish web corpus) | taTenTen (Tamil Web Corpus) | teTenTen (Telugu Web Corpus) |
thTenTen (Thai Web Corpus) | tlTenTen (Tagalog Web corpus) | trTenTen (Turkish web corpus) |
ukTenTen (Ukrainian web corpus) | urTenTen (Urdu web corpus) | zhTenTen (Chinese Simplified and Traditional characters web corpora) |
How are TenTen corpora built?
- Texts are crawled from the Internet by Spiderling tool, a web spider designed for linguistic purposes.
- Texts are cleaned by jusText which removes undesirable content such as navigation links, advertisements, headers, footers, etc.
- A tokenization process when texts are separated into individual positions (tokens).
- Language Filter is used for language identification to detect and remove longer texts of different languages, but foreign words or phrases are kept (e.g. sentences with movie titles).
- The onion tool performs deduplication on the paragraph level.
- The sample texts of the biggest web domains which account for 55% – 95% of all corpus texts are checked (combination of manual techniques with our standard automatic methods) and content with poor quality text and spam are removed.
- Corpora are recompiled with removing poor quality texts.
- The largest web domains are classified into genres (referring to writing styles) and topic (inspired by categories used by https://curlie.org/).
- Then corpus texts are lemmatized and part-of-speech tagged for language for which there are tagger and lemmatizer tools are available.
- Final checking of corpora in the interface.
- Publishing corpora.
Detailed information about the mentioned tools can be read on the corpus.tools website and the building of TenTen corpora TenTen building is described in the bibliography (below). Also, read more about building corpora from the web on our blog.
Corpus metadata
A list of corpus metadata (structural attributes in corpus linguistics) shared by all TenTen corpora.
Document structures
- Crawl date – the date when a particular site was downloaded, e.g. 2019-11-30 11:10
- Crawl year – the year when a particular site was downloaded, e.g. 2020
- Genre – it refers to writing styles, e.g. blog, discussion, legal (read more about our genre and topic classification)
- length – e.g., “0–1k” (length of the document in thousands of words)
- Source – information about the source and the year of a part of the corpus, e.g. web19, wiki20
- Title – the name written in the source code of the site between the tags
, e.g. Sports News – ABC News Radio
- Top-level domain – e.g. “com”
- Topic – inspired by categories used by https://curlie.org/ (formerly dmoz.org), e.g. arts, business, health, society, sport, … (read more about our genre and topic classification)
- URL – e.g. “https://en.wikipedia.org/wiki/Wikipedia” (URL of the source document)
- Website – e.g. “wikipedia.org”
- Web domain – e.g. “en.wikipedia.org”
- Wikipedia categories – categories used by Wikipedia (written at the bottom of the page) e.g. English_footballers
The number of structures may vary between the TenTen corpora. The structures written in bold should be presented in most of the contemporary TenTen corpora (since 2020).
Paragraph structure
- heading – number “1” means headline texts, “0” other texts
Attributes specific to particular corpora can be found on the corpus information page.
Tools to work with TenTen Corpora
A complete set of Sketch Engine tools is available to work with TenTen billion-word corpora to generate:
- word sketch – collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Changelog
Tools for building new TenTen corpora have constantly developed. More information about these tools is available at http://corpus.tools/
Bibliography
TenTen corpora
SUCHOMEL, Vít. Better Web Corpora For Corpus Linguistics And NLP. 2020. Available also from: https://is.muni.cz/th/u4rmz/. Doctoral thesis. Masaryk University, Faculty of Informatics, Brno. Supervised by Pavel RYCHLÝ.
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).
Genre annotation
SUCHOMEL, Vít. Genre Annotation of Web Corpora: Scheme and Issues. In Kohei Arai, Supriya Kapoor, Rahul Bhatia. Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1. Vancouver, Canada: Springer Nature Switzerland AG, 2021. s. 738-754. ISBN 978-3-030-63127-7. doi:10.1007/978-3-030-63128-4_55.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.