Turkic corpora from the Web
The Turkic Web Corpora are a set of corpora made up of texts collected from the Internet. They include six Turkic languages, each fo one in the separate corpus:
Azerbaijani corpus | Kazakh corpus | Kyrgyz corpus |
Turkmen corpus | Turkish corpus | Uzbek corpus |
For more information visit an info page for the particular language.
The overview of Turkic corpora
LANGUAGE | WORDS | DOCUMENTS (in thousands) | DATA UPDATES |
AZERBAIJANI | 94 million | 365 thousand | Jan 2012 |
KAZAKH | 139 million | 378 thousand | Jan 2012 |
KYRGYZ | 19 million | 67 thousand | Jan 2012 |
TURKISH | 3.38 billion | 12 million | Dec 2011, Jan 2012 |
TURKMEN | 2 million | 5 thousand | Jan 2012 |
UZBEK | 18 million | 57 thousand | Jan 2012 |
Source data
The source texts were crawled by the SpiderLing web spider in December 2011 and January 2012. The crawling was constrained to the top level internet domains corresponding to the countries where the selected languages are officially spoken (.az, .kz, .kg, .tr, .tm, .uz), several exceptions were allowed.
Tools to work with the Turkic corpora
A complete set of Sketch Engine tools is available to work with these Turkic web corpora to generate:
- word sketch– collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- word lists – lists of nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- keywords– terminology extraction of one-word
- text type analysis – statistics of metadata in the corpus
Changelog
September 23, 2012
- The Turkic part crawled from the Turkish domain .tr was renamed to trTenTen [2012]
initial version (March 6, 2012)
- initial version, 6 languages
- no tagging, no sketches
Bibliography
Turkic Web corpora
Vít Baisa and Vít Suchomel (2012). Large Corpora for Turkic Languages and Unsupervised Morphological Analysis. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Turkey, May 2012, pp. 28–32.
Search the Turkic corpora
Sketch Engine offers a range of tools to work with these Turkic corpora.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.