The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

The Polish Parliamentary Corpus (PPC) is a Polish corpus made up of documents from the proceedings of the Polish Parliament, Sejm, and Senate. The corpus includes data of the Polish Sejm corpus and consists of stenographic records of plenary sittings and committee sittings, segments of interpellations and questions. Texts in the PPC corpus cover the period of a hundred years from 1919 to 2019.

The parliamentary data is public domain. The corpus annotations are available under CC-BY licence. For more information on the corpus including links to the source data, visit http://clip.ipipan.waw.pl/PPC

Part-of-speech tagset

The PPC corpus was processed by the RFTagger tool using the following NKJP part-of-speech tagset (compatible with the annotation in the National Corpus of Polish). The corpus also contains further annotations: tokenization and lemmatization produced with Morfeusz2, disambiguated morphosyntactic description produced with Concraft2, named entities produced with Liner2 (source: Wayback Machine), and dependency structures produced with COMBO parser.

Search the Polish Parliamentary Corpus (PPC)

Sketch Engine offers a range of tools to work with this Polish corpus.

The Polish Parliamentary Corpus in detail

Basic information

Frequency
Tokens 671,292,351
Words 553,858,723
Sentences 36,766,760

Tools to work with the Polish Parliamentary Corpus

A complete set of Sketch Engine tools is available to work with this Polish corpus of documents from Polish Parliament to generate:

  • word sketch – Polish collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywordsterminology extraction of one-word and multi-word units
  • word lists – lists of Polish nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • trendsdiachronic analysis automatically identifies neologisms and changes in use
  • text type analysis – statistics of metadata in the corpus

Maciej Ogrodniczuk and Bartłomiej Nitoń. New developments in the Polish Parliamentary Corpus. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the Second ParlaCLARIN Workshop, pages 1–4, Marseille, France, 2020. European Language Resources Association (ELRA).

Maciej Ogrodniczuk. Polish Parliamentary Corpus. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using Parliamentary Corpora, pages 15–19, Paris, France, 2018. European Language Resources Association (ELRA).

Maciej Ogrodniczuk. The Polish Sejm Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pages 2219–2223, Istanbul, Turkey, 2012. European Language Resources Association (ELRA).

Other Polish corpora

Sketch Engine offers 20+ Polish language corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.