OpenSubtitles: multilingual corpora in 58 languages

The OpenSubtitles parallel corpora 2018 are a collection of parallel corpora made up of translated movie subtitles at https://www.opensubtitles.org/. The collection consists of 60 corpora in 58 languages. There are two separate corpora of Chinese character standards (Chinese Simplified and Chinese Traditional) as well as two corpora for Portuguese language varieties – European Portuguese and Brazilian Portuguese.

The list of languages in the collection of the OpenSubtitles corpora includes: Afrikaans, Albanian, Arabic, Armenian, Basque, Bengali, Bosnian, Breton, Bulgarian, Catalan, Chinese (simplified characters and traditional characters), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Norwegian, Persian (Farsi), Polish, Portuguese (Brazilian and European), Romanian, Russian, Serbian, Sinhalese, Slovak, Slovenian, Spanish, Swedish, Tagalog, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu and Vietnamese.

The data were gained from the OPUS project that is maintained by Joerg Tiedemann. We process the texts in terms of lemmatization and part-of-speech tagging including word sketches and term grammars.

The OpenSubtitles parallel corpora have the sentence alignment and you can search and analyze monolingually (as a standard single corpus) or multilingually (as parallel corpora).

Tools to work with the OpenSubtitles parallel corpora

A complete set of tools is available to work with the multilingual corpora from OpenSubtitles.org to generate:

  • parallel concordance – examples of translations in context
  • word sketchcollocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywordsterminology extraction of one-word and multi-word units
  • word lists – lists of nouns, verbs, adjectives, etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

The set of tools may vary depending on the particular language.

OpenSubtitles parallel corpora – statistics

The table below shows the number of sentence pairs aligned in each language pair of OpenSubtitles parallel corpora. For example, in relation to the languages Afrikaans (af) and Arabic (ar), there are ~12,000 sentences aligned in the direction Afrikaans–Arabic (see the number in the 2nd line of the 6th column) and in the opposite direction Arabic–Afrikaans ~12,300 sentences (see the number in the 3rd line of the 5th column).

The 2nd column (files), the 3rd column (tokens), and the 4th column (sentences) show the total number of files, tokens, and sentences respectively of the particular language (the size of the corpus for a single language).

Jörg Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. [pdf] In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012).

Pierre Lison and Jörg Tiedemann, 2016 OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. [pdf] In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC-2016), 2016.

Search the OpenSubtitles parallel corpora

Sketch Engine offers a range of tools to work with the OpenSubtitles parallel corpora.

or

Tip

Learn to work with multilingual and parallel corpora in Sketch Engine. Find more in our user guide.

More parallel corpora

DGT Translation Memory parallel corpora – European Union’s legislative documents

EUR-Lex 2/2016 parallel corpora – texts from the EUR-Lex database containing public EU documents

Eur-Lex judgments 12/2016 parallel corpora – judgments of the Court of Justice of the European Union

Europarl spoken parallel corpora – transcriptions of the European Parliament Proceedings

Open Parallel Corpus (OPUS) – translated texts from various sources, e.g. medical documents, subtitles, technical documentation, etc.

United Nations Parallel Corpus (UNPC) – official records and other parliamentary documents of the United Nation

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.