Historical Corpus of German Newspapers 1650–1800
The GerManC corpus is a representative Historical Corpus of German Newspapers of the period 1650–1800 distributed by University of Oxford Text Archive.
The corpus consists of short text samples of some 200 words each from German newspapers of the early modern period 1650–1800. The corpus metainformation contains full bibliographic details of the original texts, e.g. region, genre, year of publication, author, title, etc. Texts are divided into three main parts fifty-year subperiod (1650-1700, 1701-1750 and 1751-1800).
Conversion process
The GerManC corpus in Sketch Engine is based on its LING-GATE version that contains both linguistic and structural annotations. All annotations, except those annotating subparts of certain tokens were preserved. Certain phrases (such as headings, acts, speakers, etc.), however, were not annotated within GerManC, thus the values of corresponding attributes were left blank.
Finally, the whole corpus was retagged with the standard tree-tagger analyzator to provide word sketches which enable to explore the grammatical behavior of German in the early modern period.
Part-of-speech tagset
The GerManC POS tagging scheme is based on the STTS tagset for German, with a number of modifications to account for differences between modern and Early Modern German. The POS annotations in GerManC were produced by the re-trained version of the TreeTagger tool. See the STTS tagset for German.
Attributes
Attributes available in the corpus
For all tokens:
- word – original word form
- tag – TreeTagger output (see the tagset summary)
- lempos – lemma+part_of_speech (based on TreeTagger output)
Based on original tagging (partially unavailable):
- lemma – base lemma (in its modern form)
- norm – normalized word form
- lc – lowercase normalized word form
- morph – morphological information
- tag2 – part-of-speech (original tagger output)
- ptag – syntactic category (original tagger output)
- kind – (word, number, punctuation, etc…)
- pID – word id in sentence (used by parser)
- pDepID – dependency relation (parser output)
Authors
The corpus was prepared by Martin Durrell; Paul Bennett; Silke Scheible; Richard J. Whitt.
Bibliography
Durrell, Martin; Ensslin, Astrid and Bennett, Paul (eds.). GerManC. A Historical Corpus of German Newspapers 1650-1800 [Electronic resource].
Attachments
- Documentation of GerManC corpus (in pdf)
- Appendix1: detailed information about files in the corpus (in xlsx)
- Appendix2: names in the genre “newspapers” (in pdf)
Search the GerManC corpus
Sketch Engine offers a range of tools to search the GerManC corpus.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.