Corpus of Classical Tibetan
The Annotated Corpora of Classical Tibetan (ACTib) version 2.0 is a Tibetan corpus containing 170 million words. The corpus consists of Classical Tibetan texts and was built as part of the Tibetan in Digital Communication project (2012-2015). The Annotated Corpus of Classical Tibetan is a collection of Tibetan electronic texts compiled by the Buddhist Digital Resource Center and can be downloaded from this Zenodo repository.
Part-of-speech tagging
The corpus is lemmatized and PoS tagged using the TreeTagger tool created by Helmut Schmid. The TreeTagger model for Tibetan was trained by Yeshe Tenley (the parameter file and training corpus can be found here). The lexicon, corpus, and enumeration of tags in the training data come from Dr. Nathan Hill.
Availability
The corpus is accessible to all users including trial users in Sketch Engine or can be downloaded in its entirety from Zenodo repository.
DOI for part-of-speech-tagged version: 10.5281/zenodo.3785070
Tools to work with the Tibetan corpus of Classical Tibetan
A complete set of Sketch Engine tools is available to work with this Tibetan corpus to generate:
- word sketch – Tibetan collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word units
- word lists – lists of Tibetan nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
How to cite?
Meelen, Marieke, & Roux, Élie. (2020). The Annotated Corpus of Classical Tibetan (ACTib) – Version 2.0 (Segmented & POS-tagged) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3951503
Bibliography
Garrett, Edward and Hill, Nathan W. and Kilgarriff, Adam and Vadlapudi, Ravikiran and Zadoks, Abel (2015) ‘The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries.’ Revue d’Etudes Tibétaines, 32. pp. 51-86.
Hill, Nathan W., & Garrett, Edward. (2017). A part-of-speech (POS) tagged corpus of Classical Tibetan [Data set]. Zenodo. http://doi.org/10.5281/zenodo.574878
Meelen, Marieke and Hill, Nathan W. (forthcoming) ‘Segmenting and POS tagging Classical Tibetan’ in Himalayan Linguistics.
Hill, Nathan W. and Meelen, Marieke (forthcoming) ‘Creating an Annotated Corpus of Classical Tibetan (ACTib)’.
Changelog
ACTib 2.0
- 197 million tokens
ACTib 1.0
- 90 million tokens automatically segmented and POS-tagged (no manual correction)
- created word sketch grammar for the Tibetan language
initial version of ACTib
- initial size of 21 million words automatically segmented and POS-tagged (no manual correction)
Search the ACTib corpus
Sketch Engine offers a range of tools to work with the Tibetan corpus.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extracting terms with Sketch Engine. Use our Quick Start Guide to learn it in minutes.