Polish Parliamentary Corpus (PPC)

The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

The Polish Parliamentary Corpus (PPC) is a Polish corpus made up of documents from the proceedings of the Polish Parliament, Sejm, and Senate. The corpus includes data of the Polish Sejm corpus and consists of stenographic records of plenary sittings and committee sittings, segments of interpellations and questions. Texts in the PPC corpus cover the period of a hundred years from 1919 to 2019.

The parliamentary data is public domain. The corpus annotations are available under CC-BY licence. For more information on the corpus including links to the source data, visit http://clip.ipipan.waw.pl/PPC

Part-of-speech tagset

The PPC corpus was processed by the RFTagger tool using the following NKJP part-of-speech tagset (compatible with the annotation in the National Corpus of Polish). The corpus also contains further annotations: tokenization and lemmatization produced with Morfeusz2, disambiguated morphosyntactic description produced with Concraft2, named entities produced with Liner2 (source: Wayback Machine), and dependency structures produced with COMBO parser.

Search the Polish Parliamentary Corpus (PPC)

Sketch Engine offers a range of tools to work with this Polish corpus.

open in Sketch Engine

about Sketch Engine

The Polish Parliamentary Corpus in detail

Basic information

	Frequency
Tokens	671,292,351
Words	553,858,723
Sentences	36,766,760

Tools to work with the Polish Parliamentary Corpus

A complete set of Sketch Engine tools is available to work with this Polish corpus of documents from Polish Parliament to generate:

word sketch – Polish collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of Polish nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
trends – diachronic analysis automatically identifies neologisms and changes in use
text type analysis – statistics of metadata in the corpus

Bibliography

Maciej Ogrodniczuk and Bartłomiej Nitoń. New developments in the Polish Parliamentary Corpus. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the Second ParlaCLARIN Workshop, pages 1–4, Marseille, France, 2020. European Language Resources Association (ELRA).

Maciej Ogrodniczuk. Polish Parliamentary Corpus. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using Parliamentary Corpora, pages 15–19, Paris, France, 2018. European Language Resources Association (ELRA).

Maciej Ogrodniczuk. The Polish Sejm Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pages 2219–2223, Istanbul, Turkey, 2012. European Language Resources Association (ELRA).

Other Polish corpora

Sketch Engine offers 20+ Polish language corpora

available Polish corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

Part-of-speech tagset

Search the Polish Parliamentary Corpus (PPC)

The Polish Parliamentary Corpus in detail

Basic information

Tools to work with the Polish Parliamentary Corpus

Other Polish corpora

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine