cebTenTen — Cebuano corpus from the web

cebTenTen: Corpus of the Cebuano Web

The Cebuano Web Corpus (cebTenTen) is a Cebuano corpus made up of texts collected from the Internet. The corpus belongs to the TenTen corpus family which is a set of web corpora built using the same method with a target size 10+ billion words. Sketch Engine currently provides access to TenTen corpora in more than 30 languages.

The Cebuano language is an Austronesian language spoken by a part of the population in the Philippines. Texts for the corpus were crawled from the web during June–July 2018.

Tools to work with the Cebuano corpus

A complete set of tools is available to work with this cebTenTen Cebuan corpus to generate:

keywords – terminology extraction of one-word units
word lists – lists of Cebuano words organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Changelog

Cebuano Web 2018 (cebTenTen18)

5 million tokens crawled from the web in June–July 2018

Bibliography

TenTen corpora

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).