GerManC: Corpus of German Newspapers

Historical Corpus of German Newspapers 1650–1800

The GerManC corpus is a representative Historical Corpus of German Newspapers of the period 1650–1800 distributed by University of Oxford Text Archive.

The corpus consists of short text samples of some 200 words each from German newspapers of the early modern period 1650–1800. The corpus metainformation contains full bibliographic details of the original texts, e.g. region, genre, year of publication, author, title, etc. Texts are divided into three main parts fifty-year subperiod (1650-1700, 1701-1750 and 1751-1800).

Conversion process

The GerManC corpus in Sketch Engine is based on its LING-GATE version that contains both linguistic and structural annotations. All annotations, except those annotating subparts of certain tokens were preserved. Certain phrases (such as headings, acts, speakers, etc.), however, were not annotated within GerManC, thus the values of corresponding attributes were left blank.

Finally, the whole corpus was retagged with the standard tree-tagger analyzator to provide word sketches which enable to explore the grammatical behavior of German in the early modern period.

Part-of-speech tagset

The GerManC POS tagging scheme is based on the STTS tagset for German, with a number of modifications to account for differences between modern and Early Modern German. The POS annotations in GerManC were produced by the re-trained version of the TreeTagger tool. See the STTS tagset for German.

Attributes

Attributes available in the corpus

For all tokens:

word – original word form
tag – TreeTagger output (see the tagset summary)
lempos – lemma+part_of_speech (based on TreeTagger output)

Based on original tagging (partially unavailable):

lemma – base lemma (in its modern form)
norm – normalized word form
lc – lowercase normalized word form
morph – morphological information
tag2 – part-of-speech (original tagger output)
ptag – syntactic category (original tagger output)
kind – (word, number, punctuation, etc…)
pID – word id in sentence (used by parser)
pDepID – dependency relation (parser output)

Authors

The corpus was prepared by Martin Durrell; Paul Bennett; Silke Scheible; Richard J. Whitt.

Bibliography

Durrell, Martin; Ensslin, Astrid and Bennett, Paul (eds.). GerManC. A Historical Corpus of German Newspapers 1650-1800 [Electronic resource].

Attachments

Documentation of GerManC corpus (in pdf)
Appendix1: detailed information about files in the corpus (in xlsx)
Appendix2: names in the genre “newspapers” (in pdf)

Search the GerManC corpus

Sketch Engine offers a range of tools to search the GerManC corpus.

open in Sketch Engine

about Sketch Engine

Other text corpora in Sketch Engine

Sketch Engine offers 800+ language corpora.

corpora in Sketch Engine

Use Sketch Engine in minutes

Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

Historical Corpus of German Newspapers 1650–1800

Conversion process

Part-of-speech tagset

Attributes

Authors

Search the GerManC corpus

Other text corpora in Sketch Engine

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine

GerManC. A Historical Corpus of German Newspapers 1650–1800

Historical Corpus of German Newspapers 1650–1800

Conversion process

Part-of-speech tagset

Attributes

Authors

Search the GerManC corpus

Other text corpora in Sketch Engine

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine