How to select a corpus

corpus has to be selected before you can start using any of the Sketch Engine features. Here are a few tips for new users.

Option 1

  • click Select corpus in the left menu
  • click the language and we will select the best corpus for you

The corpus dashboard will open giving access to the tools and features.

Option 2

  • click Select corpus in the left menu
  • click ADVANCED
  • type the beginning of the language and/or the beginning of one or more words from the corpus name
  • select the corpus

The corpus dashboard will open giving access to the tools and features.

Selecting a corpus

Which corpus to choose?

Featured corpora

Featured corpora are a good start for monolingual corpora. These were pre-selected based on the size, quality and availability of the maximum number of features.

select featured corpora

No featured corpus?

If there is no featured corpus in your language, switch to All and use the search. Type a language or a corpus name.

These corpora are excellent general purpose corpora. The main advantage is their large size, typically several billion words.

TenTen is a new generation of Web corpora. They are created by crawling the web in a sophisticated way. The downloaded texts undergo a complex process before they are included in the corpus. The downloaded texts are cleaned from non-text, e.g. navigation menus, legal text or small print, and duplicate text is removed. Downloaded texts are also evaluated and texts which are too short or contain too much content unsuitable for the use in a corpus are removed.  TenTen stands for 1010 (10 billion) words. TenTen corpora in detail»

The main advantage of these monitor corpora is timestamps, the information about texts and their time of publication. This fact enables you to carry out diachronic analysis; to find trending words, neologisms, and archaisms, or to study word usage changes in language. Moreover, the size of the corpora (billions of words) guarantees also coverage of less frequent words and expressions.

Trends corpora (sometimes called timestamped corpora) are created by regular crawling news articles from the web across the world. Currently, the weekly updated English Trends corpus with more than 80 billion words is the biggest corpus in Sketch Engine.

The size of corpus

Sketch Engine provides you hundreds of corpora in various sizes from tiny (less than million words) to really huge (10+ billion words). Generally, exploring languages requires large corpora in order to reduce unwanted bias. See the comparison of the well-known British National Corpus (BNC) with other English corpora in Sketch Engine.

Parallel corpora

Most parallel corpora in Sketch are multilingual corpora, i.e. consist of the same text in many languages. Separately they can be used as monolingual corpora too.

Selecting a parallel corpus

You cannot select a parallel corpus as such, what you need to do is:

The OpenSubtitles parallel corpora comprise translated movie subtitles collected from OpenSubtitles.org. This collection includes 60 corpora in 58 different languages. more on OpenSubtitles»

The corpus is created from the proceedings of the European Parliament and is available in 21 European languages. The nature of the corpus makes it a great resource for topics discussed in the European Parliament and for general formal language. Searching for language from topic areas which are rare in the European Parliament may not produce good results. more on EUROPARL»

A corpus created from translated documents of the European Union available in the 24 official EU languages. Recommended for general formal language and subject areas covered in EU documents. Since EU documentation relates to many areas, it is suitable for general use too.  more on EUR-Lex»

OPUS is a collection of translated texts from the web and it covers a wide selection of subjects and topics and is available in the largest number of languages. This should be your first choice for parallel corpora. more on OPUS»

The United Nations Parallel Corpus (UNPC) is a compilation of six parallel corpora derived from official records and parliamentary documents of the United Nations. more on UNPC»

Display corpus information

After selecting a corpus, click the (i) info button next to the corpus name at the top centre of the screen.