yoWaC: Corpus of the Yoruba Web
The Yoruba Web corpus (YorubaWaC) is a Yoruba corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
Data was crawled by the SpiderLing web spider and the WebBootCat tool in 2012 with a final size of 2.8 million words.
Tools to work with the Yoruba corpus
A complete set of Sketch Engine tools is available to work with this Yoruba Web corpus to generate:
- word lists – lists of Yoruba nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- keywords– terminology extraction of one-word
- text type analysis – statistics of metadata in the corpus
Changelog
version 2 (17 January 2012)
- corpus tagged using a new POS tagger (77.63% accuracy), lemmatizer and morph analyser downloaded from http://sivareddy.in/downloads
Bibliography
Sketch Engine general reference
@article{kilgarriff2014sketch,
title={The Sketch Engine: ten years on},
author={Kilgarriff, Adam and Baisa, Vít and Bušta, Jan and Jakubíček, Miloš and Kovář, Vojtěch and Michelfeit, Jan and Rychlý, Pavel and Suchomel, Vít},
journal={Lexicography},
year={2014},
volume={1},
pages={7--36},
publisher={Springer}
}
WaC corpora
@article{kilgarriff2010corpus,
title={A Corpus Factory for Many Languages},
author={Kilgarriff, Adam and Reddy, Siva and Pomikálek, Jan and PVS, Avinesh},
journal={Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)},
year={2010},
volume={1},
pages={904--910},
publisher={European Language Resources Association (ELRA)}
}
@article{kilgarriff2006large,
title={Large linguistically-processed web corpora for multiple languages},
author={Kilgarriff, Adam and Baroni, Marco},
journal={Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters \& Demonstrations},
year={2006},
volume={1},
pages={87--90},
publisher={Association for Computational Linguistics}
}
Search the Yoruba corpus
Sketch Engine offers a range of tools to work with the Yoruba corpus.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.