ELEXIS corpora

This collection includes 24 corpora corresponding to the official languages of the European Union (EU), each targeting a final size of 1 billion words per language. This target size has been reached for all languages except Irish, for which the corpus comprises only 58 million words due to the limited availability of suitable data on the Internet.

These corpora belong to the TenTen corpus family. Sketch Engine currently provides access to TenTen corpora in more than 50 languages. The corpora are built using technology specialized in collecting only linguistically valuable web content.

The ELEXIS corpora were created within the ELEXIS project, carried out from 1 April 2018 to 31 March 2022, funded by the H2020 EU research programme. The goal of the project was to establish and provide a European lexicographic infrastructure and to foster research and cooperation in lexicography and natural language processing (NLP).

ELEXIS corpora: semantically annotated samples with word sense disambiguation (WSD)

The collection of ELEXIS corpora also includes a subset of 2-million-word samples that have been semantically annotated and word-sense disambiguated. This word-sense disambiguation (WSD) process applies advanced neural models to determine the correct meaning of words in context, making the text easier to analyze and understand.

The corpora contain three additional attributes related to the WSD:

  • BabelNet synset ID
  • WordNet synset offset
  • NLTK synset

The attributes can be displayed in the Concordance or Word Sketch function.

More information about the WSD can be found in this paper: https://aclanthology.org/2021.emnlp-demo.34.pdf

Search the ELEXIS corpora

Sketch Engine offers a range of tools to work with these ELEXIS corpora including samples with semantic annotation.

Tools to work with the ELEXIS corpora from the web

A complete set of Sketch Engine tools is available to work with these corpora to generate:

  • word sketchcollocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywordsterminology extraction of one-word and multi-word units
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Note: not all functions may be available for all the languages.

Word Sense Disambiguation

https://aclanthology.org/2021.emnlp-demo.34.pdf

http://nlp.uniroma1.it/amuse-wsd/

TenTen corpora

SUCHOMEL, Vít. Better Web Corpora For Corpus Linguistics And NLP. 2020. Available also from: https://is.muni.cz/th/u4rmz/. Doctoral thesis. Masaryk University, Faculty of Informatics, Brno. Supervised by Pavel RYCHLÝ.

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Genre annotation

SUCHOMEL, Vít. Genre Annotation of Web Corpora: Scheme and Issues. In Kohei Arai, Supriya Kapoor, Rahul Bhatia. Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1. Vancouver, Canada: Springer Nature Switzerland AG, 2021. s. 738-754. ISBN 978-3-030-63127-7. doi:10.1007/978-3-030-63128-4_55.

Largest English corpus

Explore our largest English Trends with 83+ billion words.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.