tiWaC: Tigrynia web corpus
The Tigrynia web corpus (tiWac) is a Tigrinia corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
Data was crawled by the SpiderLing web spider in January 2016 and comprised of 2 million words.
Document count – the most frequent web domains and domain size distribution:
Top level domains | Web domains | Secon level domain size distribution | |||
---|---|---|---|---|---|
org | 1,023 | *.blogspot.com | 349 | At least 1000 documents | 0 |
com | 699 | *.jw.org | 307 | At least 500 documents | 0 |
net | 55 | tewahdo.org | 174 | At least 100 documents | 4 |
harnnet.org | 116 | At least 50 documents | 8 | ||
eritreantewahdo.org | 97 | At least 10 documents | 28 | ||
mekaleh-eritra.org | 78 | At least 5 documents | 42 | ||
mahberemariamisrael.com | 76 | At least 1 document | 129 | ||
asmarino.com | 76 | ||||
fnoteatnatiewos.com | 46 | ||||
erena.org | 41 | ||||
forumeritrea.org | 38 | ||||
dehnet.org | 32 |
The content of news/politics and religious sites has a significant presence in the corpus sources.
This Tigrinia corpus was created in the framework of the HaBiT project (Harvesting big text data for under-resourced languages), see more at https://habit-project.eu/wiki/TigrinyaCorpus
Part-of-speech tagset
The tiWaC corpus contains POS annotation based on Universal dependencies, a multilingual parser development.
Tools to work with the Tigrynia Web corpus
A complete set of Sketch Engine tools is available to work with this Tigrynia corpus from the web to generate:
- word sketch – Tigrinya collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- word lists – lists of Tigrinya nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- keywords– terminology extraction of one-word
- text type analysis – statistics of metadata in the corpus
Bibliography
Corpus factory method
Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010, May). A corpus factory for many languages. In LREC.
Search the Tigrinya corpus
Sketch Engine offers a range of tools to work with this Tigrinya corpus.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.