ELEXIS corpora: semantically annotated corpus WSD corpora

ELEXIS corpora

This collection includes 24 corpora corresponding to the official languages of the European Union (EU), each targeting a final size of 1 billion words per language. This target size has been reached for all languages except Irish, for which the corpus comprises only 58 million words due to the limited availability of suitable data on the Internet.

These corpora belong to the TenTen corpus family. Sketch Engine currently provides access to TenTen corpora in more than 50 languages. The corpora are built using technology specialized in collecting only linguistically valuable web content.

The ELEXIS corpora were created within the ELEXIS project, carried out from 1 April 2018 to 31 March 2022, funded by the H2020 EU research programme. The goal of the project was to establish and provide a European lexicographic infrastructure and to foster research and cooperation in lexicography and natural language processing (NLP).

Overview of ELEXIS corpora

These web corpora were crawled and processed repeatedly during the years:

ELEXIS corpora: semantically annotated samples with word sense disambiguation (WSD)

The collection of ELEXIS corpora also includes a subset of 2-million-word samples that have been semantically annotated and word-sense disambiguated. This word-sense disambiguation (WSD) process applies advanced neural models to determine the correct meaning of words in context, making the text easier to analyze and understand.

The corpora contain three additional attributes related to the WSD:

BabelNet synset ID
WordNet synset offset
NLTK synset

The attributes can be displayed in the Concordance or Word Sketch function.

More information about the WSD can be found in this paper: https://aclanthology.org/2021.emnlp-demo.34.pdf

Overview of ELEXIS corpora with word-sense disambiguation

This is a list of 2-million-word samples of ELEXIS corpora that have been semantically annotated:

Search the ELEXIS corpora

Sketch Engine offers a range of tools to work with these ELEXIS corpora including samples with semantic annotation.

open in Sketch Engine

about Sketch Engine

Tools to work with the ELEXIS corpora from the web

A complete set of Sketch Engine tools is available to work with these corpora to generate:

word sketch – collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Note: not all functions may be available for all the languages.

Changelog

Bibliography

Word Sense Disambiguation

https://aclanthology.org/2021.emnlp-demo.34.pdf

http://nlp.uniroma1.it/amuse-wsd/

TenTen corpora

SUCHOMEL, Vít. Better Web Corpora For Corpus Linguistics And NLP. 2020. Available also from: https://is.muni.cz/th/u4rmz/. Doctoral thesis. Masaryk University, Faculty of Informatics, Brno. Supervised by Pavel RYCHLÝ.

Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).

Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).

Genre annotation

SUCHOMEL, Vít. Genre Annotation of Web Corpora: Scheme and Issues. In Kohei Arai, Supriya Kapoor, Rahul Bhatia. Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1. Vancouver, Canada: Springer Nature Switzerland AG, 2021. s. 738-754. ISBN 978-3-030-63127-7. doi:10.1007/978-3-030-63128-4_55.

Largest English corpus

Explore our largest English Trends with 83+ billion words.

available English corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide

ELEXIS corpora

Overview of ELEXIS corpora

ELEXIS corpora: semantically annotated samples with word sense disambiguation (WSD)

Overview of ELEXIS corpora with word-sense disambiguation

Search the ELEXIS corpora

Tools to work with the ELEXIS corpora from the web

Largest English corpus

Use Sketch Engine in minutes

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine