The Europarl parallel corpus
The Europarl corpus is a parallel corpus created from the European Parliament Proceedings in the official languages of the EU. It includes 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavic (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek. The corpus was repeatedly expanded with a final size of around 60 million words per language. Texts are from the period April 1996 – November 2011 (depending on the specific language pair) and it corresponds to the Europarl corpus version 7.
Most languages of the Europarl corpus were processed with the TreeTagger tool and thus there are available lemmas and part-of-speech tags in corpora.
Corpus data and more information can be found on the official website http://www.statmt.org/europarl/
Tools to work with Europarl parallel corpora
A complete set of Sketch Engine tools is available to work with the Europarl spoken parallel corpora to generate:
- word sketch – collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Changelog
version 7 TreeTagger (spring 2015)
- corpus tagged by TreeTagger
version 7.0 (May 2012)
- A further expanded and improved version of the corpus was released on 15th May 2012.
version 5.0 (May 2010)
- A corpus further expanded and improved version of the earlier version was released on 20th January 2010.
Bibliography
Koehn, P. (2005, September). Europarl: A parallel corpus for statistical machine translation. In MT summit (Vol. 5, pp. 79-86).
Search the Europarl spoken parallel corpus
Sketch Engine offers a range of tools to work with this spoken parallel corpus.
or
Tip
Create your own multilingual or parallel corpora in Sketch Engine.
See our user guide.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.