Parallel corpora in Sketch Engine

Finding parallel data suitable for corpora is extremely difficult. Parallel corpora are multi-lingual corpora made from translated texts. The largest volume of translated documents consists of documents such as contracts or legal documents which are often confidential. Software localization also produces lots of translated text, but these texts are rarely useful for corpora because they contain highly specialized language, often not the common natural language. This is why the parallel corpora in Sketch Engine reflect what is available rather than what Sketch Engine would ideally like to have.

Texts produced by the EU

The European Union produces an enormous amount of text and most of it must be translated into all the official languages of the EU. The texts are often made publicly available, although not in the format of a parallel concordance. Sketch Engine collected these texts, aligned them and produced these parallel corpora:

EUR-Lex

This is an enormous corpus of various documents. The documents cover various topics. Although it is formal language on the legal side, it covers vocabulary from cars to shrimps and from carrots to pneumatic hammers. It is therefore a good starting point for multilingual reference.
https://www.sketchengine.eu/eurlex-corpus/

EUR-Lex Judgements

This is a more specialized corpus containing judgements of the Court of Justice of the European Union. https://www.sketchengine.eu/eurlex-judgments-corpus/

EUROPARL

This is a corpus of spoken languages used in the European Parliament. It contains even informal language from when the MPs get a bit too excited.
https://www.sketchengine.eu/europarl-parallel-corpus/

DGT

The European Commission’s DGT (Directorate-General for Translation) made its multilingual Translation Memory available for download and Sketch Engine processed it into a parallel corpus.
https://www.sketchengine.eu/dgt-translation-memory/

Non-EU languages

The situation with languages that are not official languages of the European Union is very complicated. Although governments of most countries and regions do a translation of documents into other languages, these documents are generally not available.

OpenSubtitles

The OpenSubtitles parallel corpora are a corpus collection of 60 languages and language varieties made up of translated movie subtitles at https://www.opensubtitles.org/. They cover many non-EU languages such as Chinese, Indonesian, Japanese, Korean, etc. The current texts of the OpenSubtitles corpora date back to 2018.

UNPC

The United Nations Parallel Corpus (UNPC) consists of six parallel corpora created from official records and other parliamentary documents of the United Nations. https://www.sketchengine.eu/united-nations-parallel-corpus-unpc/

OPUS

The OPUS2 collection is a set of 37 corpora that includes various sources such as medical documents, subtitles, technical documentation, … The original data (see the project website for details) comes from 2013. https://www.sketchengine.eu/opus-parallel-corpora/

Parallel corpora in Sketch Engine

Texts produced by the EU

EUR-Lex

EUR-Lex Judgements

EUROPARL

DGT

Non-EU languages

OpenSubtitles

UNPC

OPUS

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine