slWaC – Slovenian corpus from the web

slWaC: Slovenian corpus from the web

The Slovenian web corpus (slWaC) is a Slovenian corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010). The corpus was created in June 20217 and its total size is 754 million words.

Part-of-speech tagset

The bsWaC corpus was PoS tagged with MULTEXT-East Slovenian part-of-speech tagset version 5 indicating the part of speech and grammatical category. The corpus texts also contain lemmatization when each word form from the corpus is assigned to its base form (lemma).

Tools to work with the Slovenian corpus

A complete set of tools is available to work with this Slovenian corpus to generate:

word sketch – Slovenian collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
word lists – lists of Slovenian nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
keywords– terminology extraction of one-word units
text type analysis – statistics of metadata in the corpus

Overview of Slovenian slWaC corpora

This is a list of Slovenian Web corpora available in Sketch Engine:

Slovenian Web (slWaC 2.1) – 754 million words tagged using MULTEXT tagset version 5
Slovenian Web (slWaC 2.1, TreeTagger version 2) – 755 million words processed with TreeTagger version 2