loTenTen: Corpus of the Lao Web

The Lao Web Corpus (loTenTen) is a Lao corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

The data were crawled by Spiderling in August and September 2018 and 2019 from the following sources: Lao Wikipedia, Lao web. Texts were tokenized using our in-house segmenter and tagged using the in-house RFTagger model.

For detailed information about TenTen corpora, see  Common TenTen corpora attributes.

Part-of-speech tagset

This Lao corpus was tagged using the PAN localization part-of-speech tags.

loTenTen corpus in detail

Basic statistics information about the Lao Web Corpus 2019.

Frequency
Tokens 121,266,009
Words 105,018,584
Sentences 5,782,107
Web pages 1,307,516

Tools to work with the Lao corpus

A complete set of tools is available to work with this Lao corpus to generate:

  • word sketch – Lao collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywordsterminology extraction of one-word units
  • word lists – lists of Lao nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Lao Web 2019 (loTenTen19)

6th version (July 2021)

  • processed semi-automatic revised attributes into standard attributes

4th version (June 2020)

  • corpus size 121 million tokens
  • tokenized by in-house segmenter
  • part-of-speech tagged by RFTagger model
  • revised attributes – semi-automatically corrected

Lao Web 2018 (loTenTen18)

1st version (October 2018)

  • crawled data in the size of 17.4 million tokens
  • tokenized, not tagged

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Processing Lao data

V. Baisa, M. Blahuš, M. Cukr, O. Herman, M. Jakubíček, Kovář. V., Měchura Medveď, P. Rychlý, V. Suchomel. Automating dictionary production: a Tagalog-English-Korean dictionary from scratch. Proceedings of the 6th Biennial Conference on Electronic Lexicography, 2019. [Download PDF]

Blahuš, M., Cukr, M., Herman, O., Jakubíček, M., Kovář. V. Medveď, M. Semi-automatic building of large-scale digital dictionaries. Proceedings of the 6th Biennial Conference on Electronic Lexicography, 2021.

Search the Lao corpus

Sketch Engine offers a range of tools to work with this Laotian corpus from the web.

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.