Corpora are a good starting point for collecting historical texts. Historians and other professionals engaged in the study of history can upload their texts in various formats (TXT, PDF, DOC, etc.) to create a corpus from files or use our tool for building corpora from the web, e.g. downloading specific websites containing historical texts or books. Corpora can be divided into smaller parts called subcorpora which allows historians (users) to work with only specific parts of the whole corpus, i.e. texts from a specific time period or texts of only one author, genre and the like.
Historical corpora:
- Corpus of English Dialogues 1560–1760 (English)
- Early English Books Online 1473–1820 (English)
- GerManC. A Historical Corpus of German Newspapers 1650–1800 (German)
- Old French and Middle French (BFM 2022) 945–1497 (French)
- Penn Historical Corpora 1150–1900 (English)
- Polish Parliamentary Corpus (PPC) 1919–2020 (Polish)
- DraCor Drama corpora – a set of 21 corpora consisting of theater plays in 14 languages and dialects covering the period of about 2500 years (472 BC – 2017 AC)
- Latin corpus (Latin)
Sketch Engine is also being used in the ChartEx project which applies text mining methods to medieval Latin charters. It will make the corpora publicly available through Sketch Engine as the project proceeds.
Reference
Adam Kilgarriff, Miloš Husák and Robyn Woodrow (2012). The Sketch Engine as infrastructure for historical corpora. In Jeremy Jancsary (ed.). Empirical Methods in Natural Language Processing; Proceedings of the Conference on Natural Language Processing 2012, pp. 351–356
Barbara McGillivray and Adam Kilgarriff (2012). Tools for historical corpus research, and a corpus of Latin (presentation). In New Methods in Historical Corpus Linguistics 3, Germany, 2013, pp. 247–255.