ParlaTalk:
automatically updating corpora of parliament speech transcriptions

The ParlaTalk corpora are a set of 1.3 billion words of parliamentary debate transcriptions in 18 languages gathered from the websites of the parliaments of 21 states of the European Union. (The missing states do not provide documents in a format suitable for automatic processing.) There is a separate corpus for each parliament and each chamber.

The corpora cover the period between 2022 and now. The most up-to-date documents may not be present if:

  • the chamber published the document but marked it as non-final. In this case it will be downloaded when the final version is published.
  • the chamber publishes the documents in batches. Sometimes, this delay takes up to a year.

All corpora are tagged and lemmatized with the default tagging pipline for the respective language.  They also contain metadata (called text types in Sketch Engine) such as the meeting date or the speaker’s name. The metadata are in a unified format across all ParlaTalk corpora. Some corpora also contain additional text types, such as notes of the transcriber or speaker’s party association. Such additional text types may not be present in all corpora.

ParlaTalk corpus sizes

The corpora are updated automatically once a month. The difference in the corpus size between 2023 and 2024 in million words:

EU member state million words in 2023 million words in 2024
Bulgaria 5 15
Czechia 29 34
Denmark 79 80
Netherlands 81 94
Ireland 41 121
Estonia 9 12
Finland 21 23
Belgium 55 59
France 190 243
Austria 10 11
Germany (lower chamber only*) 125 131
Greece (lower chamber only*) 58 59
Hungary 3 3
Italy 16 23
 Poland (upper chamber only*) 20 20
Portugal (lower chamber only*) 141 141
Romania 40 44
Slovakia 7 10
Slovenia (lower chamber only*) 15 26
Spain (lower chamber only*) 67 69
Sweden 132 132
sum 1,144 1,352

ParlaTalk corpora

corpora of parliamentary debates, 18 languages, automatically updated

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.