thTenTen: Corpus of the Thai Web
The Thai web corpus (thTenTen) is a Thai corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.
The Thai language also called as Ayutthaya or Siamese is the official and national language of Thailand. This Thai corpus was crawled by SpiderLing in August and September 2018. Sources included Thai Web and Thai Wikipedia. Text were tokenised by SWATH (Smart Word Analysis for THai) segmenter and not part-of-speech tagged yet.
For detailed information about TenTen corpora, see Common TenTen corpora attributes.
Tools to work with the Thai corpus
A complete set of tools is available to work with this Thai corpus to generate:
- keywords – terminology extraction of one-word units
- word lists – lists of Thai words organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Changelog
Thai Web 2018 (thTenTen18)
- crawled in August and September with an initial size 695 million tokens
- texts only tokenized
Bibliography
TenTen corpora
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).
Search the Thai corpus
Sketch Engine offers a range of tools to work with this Thai corpus from the web.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.