LatinISE corpus
The LatinISE corpus is a Latin text corpus collected from the following historical sources: LacusCurtius, Intratext and Musisque Deoque. The corpus texts consist of topics such as literature, history, philosophy or poetry. The corpus contains also rich metadata containing information such as genre, title, century or specific date.
This Latin corpus was built by Barbara McGillivray. Please cite the paper in the Bibliography section (below) when using this corpus.
Lemmatization and part-of-speech tagset
The texts were lemmatized using Dag Haug’s Latin morphological analyser and Quick Latin and POS tagged with TreeTagger, trained on the Index Thomisticus Treebank, Latin Dependency Treebank and Latin treebank of the Proiel Project.
The part-of-speech tagset for the LatinISE corpus is available here.
Available tools for LatinISE corpus
A complete set of tools is available to work with this Latin corpus to generate:
- word sketch – Latin collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word units
- word lists – lists of Latin nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Bibliography
McGillivray, B. and Kilgarriff, A. (2013). Tools for historical corpus research, and a corpus of Latin. In Paul Bennett, Martin Durrell, Silke Scheible, Richard J. Whitt (eds.), New Methods in Historical Corpus Linguistics. Tübingen: Narr
Changelog
version 4 (December 2019)
- manual corrections of the most frequent lemmas
- sentence boundaries have been added
version 2 (October 2014)
- part-of-speech tagging has been partially corrected (by Barbara McGillivray)
- text cleaning
- 10,9 million words
version 1 (2011)
- initial size 11,3 million words
Acknowledgements
Bill Thayer (LacusCurtius), Nicola Mastidoro (IntraText), Linda Spinazzè (Musisque Deoque), Dag Haug (Latin morphological analyser and Latin treebank of the PROIEL project), Marco Passarotti (Index Thomisticus Treebank) and Perseus Project (Latin Dependency Treebank).
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.