yoWaC: Corpus of the Yoruba Web

The Yoruba Web corpus (YorubaWaC) is a Yoruba corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).

Data was crawled by the SpiderLing web spider and the WebBootCat tool in 2012 with a final size of 2.8 million words.

Tools to work with the Yoruba corpus

A complete set of Sketch Engine tools is available to work with this Yoruba Web corpus to generate:

  • word lists – lists of Yoruba nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • keywordsterminology extraction of one-word
  • text type analysis – statistics of metadata in the corpus

version 2 (17 January 2012)

Sketch Engine general reference

Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, Vít Suchomel. The Sketch Engine: ten years on. Lexicography, 1: 7-36, 2014.
@article{kilgarriff2014sketch,
  title={The Sketch Engine: ten years on},
  author={Kilgarriff, Adam and Baisa, Vít and Bušta, Jan and Jakubíček, Miloš and Kovář, Vojtěch and Michelfeit, Jan and Rychlý, Pavel and Suchomel, Vít},
  journal={Lexicography},
  year={2014},
  volume={1},
  pages={7--36},
  publisher={Springer}
}

WaC corpora

Adam Kilgarriff, Siva Reddy, Jan Pomikálek, Avinesh PVS. A Corpus Factory for Many Languages. Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10), 1: 904-910, 2010.
@article{kilgarriff2010corpus,
  title={A Corpus Factory for Many Languages},
  author={Kilgarriff, Adam and Reddy, Siva and Pomikálek, Jan and PVS, Avinesh},
  journal={Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)},
  year={2010},
  volume={1},
  pages={904--910},
  publisher={European Language Resources Association (ELRA)}
}
Adam Kilgarriff, Marco Baroni. Large linguistically-processed web corpora for multiple languages. Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations, 1: 87-90, 2006.
@article{kilgarriff2006large,
  title={Large linguistically-processed web corpora for multiple languages},
  author={Kilgarriff, Adam and Baroni, Marco},
  journal={Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters \& Demonstrations},
  year={2006},
  volume={1},
  pages={87--90},
  publisher={Association for Computational Linguistics}
}

Search the Yoruba corpus

Sketch Engine offers a range of tools to work with the Yoruba corpus.

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.