ParlaTalk: automatically updating corpora of parliamentary debates

The ParlaTalk corpora are a set of 20 corpora comprising 1.3 billion words of parliamentary debate transcriptions in 18 languages. The texts were gathered from the parliamentary websites of 20 member states of the European Union. (The missing states do not provide documents in a format suitable for automatic processing.) The ParlaTalk corpora are monitor corpora that are regularly and automatically updated once a month. ParlaTalk corpora grow by about 200 million words in total every year.

The ParlaTalk corpora contain metadata (also called text types) such as the meeting date or speaker’s name. The text types are in a unified format across all corpora. Some corpora include additional text types, e.g. notes of the transcriber or speaker’s party association.

Each ParlaTalk corpus covers a different period depending on the published data of the specific parliament. Usually, it means the last 5 years are included, but sometimes also earlier years are included as well. The most up-to-date documents may not be present if:

  • the chamber published the document but marked it as non-final. In this case, it will be downloaded when the final version is published.
  • the chamber publishes the documents in batches. Sometimes, this delay takes up to a year.

Part-of-speech tagset, lemmtization

All corpora are part-of-speech tagged indicating the part of speech and grammatical category and lemmatized when each word form from the corpus is assigned to its base form (lemma). The particular part-of-speech tagset can be checked within the Sketch Engine interface.

ParlaTalk corpora – corpus sizes

The total size of ParlaTalk corpora is 1.3 billion words as of November 2024. The table below shows the difference in the corpus size between 2023 and 2024 in million words:

EU member state million words in 2023 million words in 2024
Bulgaria 5 15
Czechia 29 34
Denmark 79 80
Netherlands 81 94
Ireland 41 121
Estonia 9 12
Finland 21 23
Belgium 55 59
France 190 243
Austria 10 11
Germany (lower chamber only*) 125 131
Greece (lower chamber only*) 58 59
Hungary 3 3
Italy 16 23
 Poland (upper chamber only*) 20 20
Portugal (lower chamber only*) 141 141
Romania 40 44
Slovakia 7 10
Slovenia (lower chamber only*) 15 26
Spain (lower chamber only*) 67 69
Sweden 132 132
sum 1,144 1,352

Tools to work with the ParlaTalk corpora

A complete set of Sketch Engine tools is available to work with these corpora of parliamentary debates to generate:

  • word sketchcollocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywordsterminology extraction of one-word and multi-word units
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • trendsdiachronic analysis automatically identifies neologisms and changes in use
  • text type analysis – statistics of metadata in the corpus

ParlaTalk corpora

A set of 20 corpora of parliamentary debates in 18 languages, automatically updated once a month.

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.