thTenTen — Thai corpus from the web

thTenTen: Corpus of the Thai Web

The Thai web corpus (thTenTen) is a Thai corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

The Thai language also called as Ayutthaya or Siamese is the official and national language of Thailand. This Thai corpus was crawled by SpiderLing in August and September 2018. Sources included Thai Web and Thai Wikipedia. Text were tokenised by SWATH (Smart Word Analysis for THai) segmenter and not part-of-speech tagged yet.

For detailed information about TenTen corpora, see Common TenTen corpora attributes.

Tools to work with the Thai corpus

A complete set of tools is available to work with this Thai corpus to generate:

keywords – terminology extraction of one-word units
word lists – lists of Thai words organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Changelog

Thai Web 2018 (thTenTen18)

crawled in August and September with an initial size 695 million tokens
texts only tokenized

Bibliography

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).