yoWaC: Corpus of the Yoruba Web
The Yoruba Web corpus (YorubaWaC) is a Yoruba corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
Data was crawled by the SpiderLing web spider and the WebBootCat tool in 2012 with a final size of 2.8 million words.
Tools to work with the Yoruba corpus
A complete set of Sketch Engine tools is available to work with this Yoruba Web corpus to generate:
- word lists – lists of Yoruba nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- keywords– terminology extraction of one-word
- text type analysis – statistics of metadata in the corpus
Changelog
version 2 (17 January 2012)
- corpus tagged using a new POS tagger (77.63% accuracy), lemmatizer and morph analyser downloaded from http://sivareddy.in/downloads
Bibliography
Sketch Engine general reference
<code>@article{kilgarriff2014sketch, title={The Sketch Engine: ten years on}, author={Kilgarriff, Adam and Baisa, Vít and Bušta, Jan and Jakubíček, Miloš and Kovář, Vojtěch and Michelfeit, Jan and Rychlý, Pavel and Suchomel, Vít}, journal={Lexicography}, year={2014}, volume={1}, pages={7--36}, publisher={Springer} }</code>
WaC corpora
<code>@article{kilgarriff2010corpus, title={A Corpus Factory for Many Languages}, author={Kilgarriff, Adam and Reddy, Siva and Pomikálek, Jan and PVS, Avinesh}, journal={Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)}, year={2010}, volume={1}, pages={904--910}, publisher={European Language Resources Association (ELRA)} }</code>
<code>@article{kilgarriff2006large, title={Large linguistically-processed web corpora for multiple languages}, author={Kilgarriff, Adam and Baroni, Marco}, journal={Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters \& Demonstrations}, year={2006}, volume={1}, pages={87--90}, publisher={Association for Computational Linguistics} }</code>
Search the Yoruba corpus
Sketch Engine offers a range of tools to work with the Yoruba corpus.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.