Lektor: Slovenian Learner corpus of proofreading and translations

Corpus Lektor is an error-annotated Slovenian corpus of the author’s corrections of texts and translations. The aim of the corpus is to obtain an insight into the most common linguistic errors in Slovenian and the proofreading process. The texts were manually tagged and classified. The corpus contains a rich list of metadata such as types of corrections, information about the proofreader (gender, age, education – linguistics/non-linguistics, Slovenian/non-Slovenian), and information about the origin – whether it is a translation or an author’s text.

Part-of-speech tagset and lemmatization

This Slovene learner corpus Lektor is part-of-speech tagged with the following Slovenian tagset summary indicating the part of speech and grammatical category. The corpus texts also contain lemmatization when each word form from the corpus is assigned to its base form (lemma).

This is a list of error codes used in the Slovenian learner corpus Lektor.

SLOG (STYLE)

Dvojnica/variantni zapis S-Dvojnica
Tujka S-Tujka
Kolokacija S-Kolokacija
Izbris S-Izbris
Dodajanje S-Dodajanje
Prevzemanje     S-Prevzemanje
Vezljivost S-Vezljivost
Besednovrstna pretvorba S-Pretvorba
Koreferenca S-Koreferenca
Drugo S-Drugo

OBLIKA (FORM)

Pregibanje domačih osebnih poimenovanj O-DomacaOsebna
Pregibanje tujih osebnih poimenovanj O-TujaOsebna
Pregibanje domačih zemljepisnih imen O-DomacaKrajevna
Pregibanje tujih zemljepisnih imen O-TujaKrajevna
Pregibanje stvarnih lastnih imen/občnih besed O-StvarnaObcna
Pregibanje pridevnikov O-Pridevniki
Pregibanje glagolov O-Glagoli
Pregibanje/zapis števnikov O-Stevniki
Pregibanje nepregibnih/funkcijskih besed O-Funkcijska
Pregibanje zaimkov O-Zaimki

PRAVOPIS (SPELLING)

Tipkarska napaka P-Tipkarska
Zapis P-Zapis
Zapis tvorjenke P-Tvorjenka
Začetnica pri zapisu stvarnega/občnega poimenovanja P-P-ZacetnicaStvarnaObcna
Začetnica pri zapisu imen bitij P-ZacetnicaBitja
Začetnica pri zapisu zemljepisnega imena P-ZacetnicaKrajevno
Začetnica pri zapisu pridevnika P-ZacetnicaPridevnik
Stavčna začetnica P-ZacetnicaStavcna
Stava ločila P-LociloStava
Zamenjava ločila P-LociloZamenjava
Pisanje skupaj/narazen P-SkupajNarazen
Sprememba izrazne oblike P-Izraz
Krajšava P-Krajsava

SKLADNJA (SYNTAX)

Razvezava stavkov Sk-Razvezava
Združitev stavkov Sk-Zdruzitev
Zamenjava veznika Sk-Veznik
Pretvorba skladenjskega razmerja Sk-SkladenjskaPretvorba
Besedni red Sk-BesedniRed
Pretvorba neosebne/brezosebne oblike v tvorno obliko Sk-PretvorbaTvorno
Pretvorba v neosebno/brezosebno obliko Sk-PretvorbaNeosebno
Vezava Sk-Vezava
Stavčno ujemanje/ujemanje naslonskih oblik Sk-Ujemanje
Predlog Sk-Predlog
Drugo Sk-Drugo

PRAGMATIKA (PRAGMATICS)

Prevajalska napaka Pr-Prevajalska
Pomen Pr-Pomen
Faktografija Pr-Faktografija
Komentar Pr-Komentar

Overview of Lektor corpus versions

This is a list of Slovenian learner corpus Lektor available in Sketch Engine:

    • Slovenian Web (slWaC 2.1) –
    • Slovenian Web (slWaC 2.1, TreeTagger version 2) – the corpus version processed with the TreeTagger pipeline version 2

Search the Slovenian Lektor corpus

Sketch Engine offers a range of tools to work with this Slovenian Learner corpus.

Tools to work with the Slovenian learner corpus Lektor

A complete set of Sketch Engine tools is available to work with this Slovene Learner corpus of proofreading and translations:

  • word sketch – Slovenian collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywordsterminology extraction of one-word and multi-word units
  • word lists – lists of Slovenian nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.