deSKELL: German corpus for SKELL
The German corpus for SKELL (deSKELL) is a German corpus made up of texts collected from the Internet. The texts come from the deTenTen corpus 2013 according to the choice of Egon W. Stemle from Eurac Research. The corpus was specially built up in order to provide the best sentence examples.
SKELL
SKELL is an abbreviation of Sketch Engine for Language Learning. It is a freely available web interface suitable for German learning.
Good sentence examples
The corpus consists of only sentences (adjoining sentences do not have to relate to each other) which were sorted according to their text quality. This quality is computed by GDEX system that adds a score to each sentence. The score is mainly based on the sentence length (minimum and maximum length) and the word frequency of particular words which occur in the sentence. The sentences are sorted in the way that the sentences with the highest score are displayed as the first results of a concordance.
Part-of-speech tagset
The deSKELL corpus was tagged by RFTagger using this POS tagset.
Tools to work with the German corpus
A complete set of tools is available to work with this German corpus to generate:
- word sketch – German collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of German nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
Changelog
English Web 2015 (enTenTen15)
- initial size 28 billion words
v2 (spring 2017)
- 15 billion words
- genre classification
- depth analysis of spam and its removal including too short documents
English Web 2013 (enTenTen13)
- 19 billion words
English Web 2012 (enTenTen12)
version 1 (14 June 2012)
- sample of corpus – 3.7 billion words
- crawled by SpiderLing in May 2012
- encoded in UTF-8
version 2 (2012)
- full corpus – 11 billion words
English Web 2008 (enTenTen08)
version 1 (15 November 2010)
- initial version – 3.3 billion tokens
- crawled by Heritrix in 2008
- encoded in Latin1
Bibliography
SkELL corpus
BAISA, Vít a Vít SUCHOMEL. SkELL – Web Interface for English Language Learning. In Eighth Workshop on Recent Advances in Slavonic Natural Language Processing. Brno: Tribun EU, 2014, pp. 63-70. ISSN 2336-4289.
References to SkELL and versioning
From time to time, the underlying corpus data may change (cleaning, refining etc.). To refer to particular results (using bookmarked URLs for example), also refer to a particular version. The web interface may also change occasionally. Each SkELL page carries a version via link “Terms” in the left corner at the bottom, e.g. VERSION1-VERSION2. This refers to the version of the interface and the version of the corpus data respectively.
Search the German corpus
Use a free web interface suitable for German learners.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms. Use our Quick Start Guide to learn it in minutes.