The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego
The Polish Parliamentary Corpus (PPC) is a Polish corpus made up of documents from the proceedings of the Polish Parliament, Sejm, and Senate. The corpus includes data of the Polish Sejm corpus and consists of stenographic records of plenary sittings and committee sittings, segments of interpellations and questions. Texts in the PPC corpus cover the period of a hundred years from 1919 to 2019.
The parliamentary data is public domain. The corpus annotations are available under CC-BY licence. For more information on the corpus including links to the source data, visit http://clip.ipipan.waw.pl/PPC
Part-of-speech tagset
The PPC corpus was processed by the RFTagger tool using the following NKJP part-of-speech tagset (compatible with the annotation in the National Corpus of Polish). The corpus also contains further annotations: tokenization and lemmatization produced with Morfeusz2, disambiguated morphosyntactic description produced with Concraft2, named entities produced with Liner2 (source: Wayback Machine), and dependency structures produced with COMBO parser.
Search the Polish Parliamentary Corpus (PPC)
Sketch Engine offers a range of tools to work with this Polish corpus.
The Polish Parliamentary Corpus in detail
Basic information
Frequency | |
Tokens | 671,292,351 |
Words | 553,858,723 |
Sentences | 36,766,760 |
Tools to work with the Polish Parliamentary Corpus
A complete set of Sketch Engine tools is available to work with this Polish corpus of documents from Polish Parliament to generate:
- word sketch – Polish collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of Polish nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- trends – diachronic analysis automatically identifies neologisms and changes in use
- text type analysis – statistics of metadata in the corpus
Bibliography
Maciej Ogrodniczuk and Bartłomiej Nitoń. New developments in the Polish Parliamentary Corpus. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the Second ParlaCLARIN Workshop, pages 1–4, Marseille, France, 2020. European Language Resources Association (ELRA).
Maciej Ogrodniczuk. Polish Parliamentary Corpus. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using Parliamentary Corpora, pages 15–19, Paris, France, 2018. European Language Resources Association (ELRA).
Maciej Ogrodniczuk. The Polish Sejm Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pages 2219–2223, Istanbul, Turkey, 2012. European Language Resources Association (ELRA).
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.