ruWaC: Russian web corpus
The Russian web corpus (ruWaC) is a language corpus made up of texts collected from the Internet. The corpus was prepared by Serge Sharoff at the University of Leeds according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010).
The ruWaC corpus comprises of 140 million words and contains word sketches created by Maria Khokhlova.
Part-of-speech tagset
The Russian WaC corpus was POS tagged with the TreeTagger that has been trained for Russian also by Serge Sharoff. The part-of-speech tagset legend is available here.
Tools to work with the Russian Web corpus
A complete set of Sketch Engine tools is available to work with this ruWaC corpus to generate:
- word sketch – Russian collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- word lists – lists of Russian nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
Changelog
version 2 (28th August 2017)
- created lemposes
initial version (2009)
- size 147 million words
Bibliography
Corpus factory method
Kilgarriff, A., Reddy, S., Pomikálek, J., & Avinesh, P. V. S. (2010, May). A corpus factory for many languages. In LREC.
Russian word sketches
Khokhlova, M. (2010). Building Russian Word Sketches as Models of Phrases. Proc. EURALEX 2010, Leeuwarden.
Search the Russian Web corpus
Sketch Engine offers a range of tools to work with the Russian Web corpus.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.