ELEXIS corpora
This collection includes 24 corpora corresponding to the official languages of the European Union (EU), each targeting a final size of 1 billion words per language. This target size has been reached for all languages except Irish, for which the corpus comprises only 58 million words due to the limited availability of suitable data on the Internet.
These corpora belong to the TenTen corpus family. Sketch Engine currently provides access to TenTen corpora in more than 50 languages. The corpora are built using technology specialized in collecting only linguistically valuable web content.
The ELEXIS corpora were created within the ELEXIS project, carried out from 1 April 2018 to 31 March 2022, funded by the H2020 EU research programme. The goal of the project was to establish and provide a European lexicographic infrastructure and to foster research and cooperation in lexicography and natural language processing (NLP).
Overview of ELEXIS corpora
These web corpora were crawled and processed repeatedly during the years:
- ELEXIS Bulgarian Web 2021 (bgTenTen21) WSD sample
- ELEXIS Croatian Web 2020 (hrTenTen20) WSD sample
- ELEXIS Czech Web 2019 (csTenTen19) WSD sample
- ELEXIS Danish Web 2020 (daTenTen20) WSD sample
- ELEXIS Dutch Web 2020 (nlTenTen20) WSD sample
- ELEXIS English Web 2020 (enTenTen20, no genres and topics) WSD sample
- ELEXIS Estonian Web 2021 (etTenTen21) WSD sample
- ELEXIS Finnish Web 2019 (fiTenTen19) WSD sample
- ELEXIS French Web 2020 (frTenTen20) WSD sample
- ELEXIS German Web 2020 (deTenTen20) WSD sample
- ELEXIS Greek Web 2019 (elTenTen19) WSD sample
- ELEXIS Hebrew Web 2021 (heTenTen21) WSD sample
- ELEXIS Hungarian Web 2020 (huTenTen20) WSD sample
- ELEXIS Irish Web 2021 (gaTenTen21) WSD sample
- ELEXIS Italian Web 2020 (itTenTen20) WSD sample
- ELEXIS Latvian Web 2021 (lvTenTen21) WSD sample
- ELEXIS Lithuanian Web 2021 (ltTenTen21) WSD sample
- ELEXIS Polish Web 2019 (plTenTen19) WSD sample
- ELEXIS Portuguese Web 2020 (ptTenTen20) WSD sample
- ELEXIS Romanian Web 2021 (roTenTen21) WSD sample
- ELEXIS Slovak Web 2021 (skTenTen21) WSD sample
- ELEXIS Slovene Web 2020 (slTenTen20) WSD sample
- ELEXIS Spanish Web 2020 (esTenTen20) WSD sample
- ELEXIS Swedish Web 2020 (svTenTen20) WSD sample
ELEXIS corpora: semantically annotated samples with word sense disambiguation (WSD)
The collection of ELEXIS corpora also includes a subset of 2-million-word samples that have been semantically annotated and word-sense disambiguated. This word-sense disambiguation (WSD) process applies advanced neural models to determine the correct meaning of words in context, making the text easier to analyze and understand.
The corpora contain three additional attributes related to the WSD:
- BabelNet synset ID
- WordNet synset offset
- NLTK synset
The attributes can be displayed in the Concordance or Word Sketch function.
More information about the WSD can be found in this paper: https://aclanthology.org/2021.emnlp-demo.34.pdf
Overview of ELEXIS corpora with word-sense disambiguation
This is a list of 2-million-word samples of ELEXIS corpora that have been semantically annotated:
- ELEXIS Bulgarian Web 2021 (bgTenTen21) WSD sample
- ELEXIS Croatian Web 2020 (hrTenTen20) WSD sample
- ELEXIS Czech Web 2019 (csTenTen19) WSD sample
- ELEXIS Danish Web 2020 (daTenTen20) WSD sample
- ELEXIS Dutch Web 2020 (nlTenTen20) WSD sample
- ELEXIS English Web 2020 (enTenTen20, no genres and topics) WSD sample
- ELEXIS Estonian Web 2021 (etTenTen21) WSD sample
- ELEXIS Finnish Web 2019 (fiTenTen19) WSD sample
- ELEXIS French Web 2020 (frTenTen20) WSD sample
- ELEXIS German Web 2020 (deTenTen20) WSD sample
- ELEXIS Greek Web 2019 (elTenTen19) WSD sample
- ELEXIS Hebrew Web 2021 (heTenTen21) WSD sample
- ELEXIS Hungarian Web 2020 (huTenTen20) WSD sample
- ELEXIS Irish Web 2021 (gaTenTen21) WSD sample
- ELEXIS Italian Web 2020 (itTenTen20) WSD sample
- ELEXIS Latvian Web 2021 (lvTenTen21) WSD sample
- ELEXIS Lithuanian Web 2021 (ltTenTen21) WSD sample
- ELEXIS Polish Web 2019 (plTenTen19) WSD sample
- ELEXIS Portuguese Web 2020 (ptTenTen20) WSD sample
- ELEXIS Romanian Web 2021 (roTenTen21) WSD sample
- ELEXIS Slovak Web 2021 (skTenTen21) WSD sample
- ELEXIS Slovene Web 2020 (slTenTen20) WSD sample
- ELEXIS Spanish Web 2020 (esTenTen20) WSD sample
- ELEXIS Swedish Web 2020 (svTenTen20) WSD sample
Search the ELEXIS corpora
Sketch Engine offers a range of tools to work with these ELEXIS corpora including samples with semantic annotation.
Tools to work with the ELEXIS corpora from the web
A complete set of Sketch Engine tools is available to work with these corpora to generate:
- word sketch – collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Note: not all functions may be available for all the languages.
Changelog
Bibliography
Word Sense Disambiguation
https://aclanthology.org/2021.emnlp-demo.34.pdf
http://nlp.uniroma1.it/amuse-wsd/
TenTen corpora
SUCHOMEL, Vít. Better Web Corpora For Corpus Linguistics And NLP. 2020. Available also from: https://is.muni.cz/th/u4rmz/. Doctoral thesis. Masaryk University, Faculty of Informatics, Brno. Supervised by Pavel RYCHLÝ.
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).
Genre annotation
SUCHOMEL, Vít. Genre Annotation of Web Corpora: Scheme and Issues. In Kohei Arai, Supriya Kapoor, Rahul Bhatia. Proceedings of the Future Technologies Conference (FTC) 2020, Volume 1. Vancouver, Canada: Springer Nature Switzerland AG, 2021. s. 738-754. ISBN 978-3-030-63127-7. doi:10.1007/978-3-030-63128-4_55.
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.