slWaC: Slovenian corpus from the web

The Slovenian web corpus (slWaC) is a Slovenian corpus made up of texts collected from the Internet. The corpus was prepared according to standards described in the document A Corpus Factory for Many Languages (Kilgarriff et al. at LREC 2010). The corpus was created in June 20217 and its total size is 754 million words.

Part-of-speech tagset

The bsWaC corpus was PoS tagged with MULTEXT-East Slovenian part-of-speech tagset version 5 indicating the part of speech and grammatical category. The corpus texts also contain lemmatization when each word form from the corpus is assigned to its base form (lemma).

Tools to work with the Slovenian corpus

A complete set of tools is available to work with this Slovenian corpus to generate:

  • word sketch – Slovenian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • word lists – lists of Slovenian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • keywords– terminology extraction of one-word units
  • text type analysis – statistics of metadata in the corpus

Overview of Slovenian slWaC corpora

This is a list of Slovenian Web corpora available in Sketch Engine:

  • Slovenian Web (slWaC 2.1) – 754 million words tagged using MULTEXT tagset version 5
  • Slovenian Web (slWaC 2.1, TreeTagger version 2) – 755 million words processed with TreeTagger version 2

Search the slWaC corpus

Sketch Engine offers a range of tools to work with this Slovenian corpus from the web.

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.