ParlaTalk: automatically updating corpora of parliamentary debates

The ParlaTalk corpora are a set of 22 corpora comprising almost 3 billion words of parliamentary debate transcriptions in 20 languages. The texts were gathered from the parliamentary websites of 22 member states of the European Union. (The missing states do not provide documents in a format suitable for automatic processing.) The ParlaTalk corpora are monitor corpora that are regularly and automatically updated once a month. ParlaTalk corpora grow by about 200 million words in total every year.

The ParlaTalk corpora contain metadata (also called text types) such as the meeting date or speaker’s name. The text types are in a unified format across all corpora. Some corpora include additional text types, e.g. notes of the transcriber or speaker’s party association.

Each ParlaTalk corpus covers a different period depending on the published data of the specific parliament. Usually, it means the last 5 years are included, but sometimes also earlier years are included as well. The most up-to-date documents may not be present if:

The chamber published the document but marked it as non-final. In this case, it will be downloaded when the final version is published.
The chamber publishes the documents in batches. Sometimes, this delay takes up to a year.

Part-of-speech tagset, lemmatization

All corpora are part-of-speech tagged, indicating the part of speech and grammatical category, and lemmatized when each word form from the corpus is assigned to its base form (lemma). The particular part-of-speech tagset can be checked within the Sketch Engine interface.

Specifications on processing ParlaTalk Belgium - parliamentary debates

The ParlaTalk Belgium corpus contains texts in two languages: French and Dutch. More specifically:

Chamber of Representatives (Lower House) – texts in both French and Dutch
Senate (Upper House) – only in French, since the Dutch texts are direct translations of the French originals

Each language version of the corpus contains the same content but is processed with different tools:

ParlaTalk Belgium (French) – processed using French-language tools
ParlaTalk Belgium (Dutch) – processed using Dutch-language tools

The other language in each corpus (Dutch in the French corpus, or French in the Dutch corpus) is annotated as foreign words. In the concordance results, these foreign words are shown in grey.

To view this display:

Go to View options.
Under Show structures, select different_lang.
Click Save.

ParlaTalk corpora — corpus sizes

The total size of ParlaTalk corpora is 2.8 billion words as of July 2025. The table below shows the corpus sizes of particular national parliaments.

ParlaTalk corpus	Number of words
ParlaTalk Austria – parliamentary debates	14 million
ParlaTalk Belgium (Dutch) – parliamentary debates	60 million
ParlaTalk Belgium (French) – parliamentary debates	60 million
ParlaTalk Bulgaria – parliamentary debates	8 million
ParlaTalk Czechia – parliamentary debates	24 million
ParlaTalk Denmark – parliamentary debates	90 million
ParlaTalk Estonia – parliamentary debates	11 million
ParlaTalk Finland – parliamentary debates	26 million
ParlaTalk France – parliamentary debates	107 million
ParlaTalk Germany – parliamentary debates	286 million
ParlaTalk Greece – parliamentary debates	77 million
ParlaTalk Hungary – parliamentary debates	56 million
ParlaTalk Ireland – parliamentary debates	45 million
ParlaTalk Italy – parliamentary debates	106 million
ParlaTalk Latvia – parliamentary debates	1001 million
ParlaTalk Netherlands – parliamentary debates	105 million
ParlaTalk Poland – parliamentary debates	20 million
ParlaTalk Portugal – parliamentary debates	147 million
ParlaTalk Romania – parliamentary debates	45 million
ParlaTalk Slovakia – parliamentary debates	12 million
ParlaTalk Slovenia – parliamentary debates	87 million
ParlaTalk Spain – parliamentary debates	443 million
ParlaTalk Sweden – parliamentary debates	135 million

Tools to work with the ParlaTalk corpora

A complete set of Sketch Engine tools is available to work with these corpora of parliamentary debates to generate:

ParlaTalk: automatically updating corpora of parliamentary debates

Part-of-speech tagset, lemmatization

ParlaTalk corpora — corpus sizes

Tools to work with the ParlaTalk corpora