ParlaTalk: automatically updating corpora of parliamentary debates

The ParlaTalk corpora are a set of 20 corpora comprising 1.3 billion words of parliamentary debate transcriptions in 18 languages. The texts were gathered from the parliamentary websites of 20 member states of the European Union. (The missing states do not provide documents in a format suitable for automatic processing.) The ParlaTalk corpora are monitor corpora that are regularly and automatically updated once a month. ParlaTalk corpora grow by about 200 million words in total every year.

The ParlaTalk corpora contain metadata (also called text types) such as the meeting date or speaker’s name. The text types are in a unified format across all corpora. Some corpora include additional text types, e.g. notes of the transcriber or speaker’s party association.

Each ParlaTalk corpus covers a different period depending on the published data of the specific parliament. Usually, it means the last 5 years are included, but sometimes also earlier years are included as well. The most up-to-date documents may not be present if:

the chamber published the document but marked it as non-final. In this case, it will be downloaded when the final version is published.
the chamber publishes the documents in batches. Sometimes, this delay takes up to a year.

Part-of-speech tagset, lemmatization

All corpora are part-of-speech tagged indicating the part of speech and grammatical category and lemmatized when each word form from the corpus is assigned to its base form (lemma). The particular part-of-speech tagset can be checked within the Sketch Engine interface.

ParlaTalk corpora – corpus sizes

The total size of ParlaTalk corpora is 1.3 billion words as of November 2024. The table below shows the difference in the corpus size between 2023 and 2024 in million words:

EU member state	million words in 2023	million words in 2024
Bulgaria	5	15
Czechia	29	34
Denmark	79	80
Netherlands	81	94
Ireland	41	121
Estonia	9	12
Finland	21	23
Belgium	55	59
France	190	243
Austria	10	11
Germany (lower chamber only*)	125	131
Greece (lower chamber only*)	58	59
Hungary	3	3
Italy	16	23
Poland (upper chamber only*)	20	20
Portugal (lower chamber only*)	141	141
Romania	40	44
Slovakia	7	10
Slovenia (lower chamber only*)	15	26
Spain (lower chamber only*)	67	69
Sweden	132	132
sum	1,144	1,352

Tools to work with the ParlaTalk corpora

A complete set of Sketch Engine tools is available to work with these corpora of parliamentary debates to generate:

word sketch – collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
trends – diachronic analysis automatically identifies neologisms and changes in use
text type analysis – statistics of metadata in the corpus