MaCoCu Corpora from the web

maCoCu: Corpora from the Web

The MaCoCu corpora were built by crawling the Internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The crawler is available at github repository.

Considerable effort was devoted into cleaning the extracted texts to provide a high-quality web corpora. This was achieved by removing boilerplate (Justext) and near-duplicated paragraphs (Onion), discarding very short texts as well as texts that are not in the target language. Despite these extensive efforts, the corpora might still contain a small amount of undesirable content. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (monotextor), making the corpora highly useful for corpus linguistics studies, as well as for training language models and other language technologies.

Thanks to the MaCoCu project, corpora in multiple languages are now available in Sketch Engine. If you want to find out more about this project and individual corpora, please refer to this website: https://macocu.eu/

Search the MaCoCu corpora

Sketch Engine offers a range of tools to work with these MaCoCu corpora.

open in Sketch Engine

about Sketch Engine

Overview of MaCoCu corpora

The following MaCoCu corpora are available in Sketch Engine:

MaCoCu Albanian Web v1 (2022) – 617 million words
MaCoCu Bosnian Web v1 (2021-2022) – 715 million words
MaCoCu Croatian Web v2 (2021–2022) – 2.3 billion words
MaCoCu Macedonian Web v2 (2021) – 512 million words
MaCoCu Maltese Web v2 (2021) – 331 million words
MaCoCu Montenegrin Web v1 (2021-2022) – 157 million words
MaCoCu Serbian Web v1 (2021-2022) – 2.4 billion words
MaCoCu Slovene Web v2 (2021-2022) – 1.8 billion words
MaCoCu Turkish Web v2 (2021) – 4.2 billion words
MaCoCu Ukrainian Web v1 (2021-2022) – 5.9 billion words

Tools to work with the MaCoCu corpora from the web

A complete set of Sketch Engine tools is available to work with these MaCoCu corpora to generate:

word sketch – collocations categorized by grammatical relations
thesaurus – synonyms and similar words for every word
keywords – terminology extraction of one-word and multi-word units
word lists – lists of nouns, verbs, adjectives etc. organized by frequency
n-grams – frequency list of multi-word units
concordance – examples in context
text type analysis – statistics of metadata in the corpus

Note: Some of the functions may not be available for some of the MaCoCu corpora.

Changelog

MaCoCu

MaCoCu Bosnian Web v1 (2021-2022) (April 2024) – newly part-of-speech tagging, lemmatization
MaCoCu Maltese Web v2 (2021) (March 2024)
MaCoCu Albanian Web v1 (2022) (December 2023)
MaCoCu Montenegrin Web v1 (2021-2022) (December 2023)
MaCoCu Serbian Web v1 (2021-2022) (December 2023)
MaCoCu Bosnian Web v1 (2021-2022) (November 2023) – this corpus has become obsolete on May 2nd due to release of a new version that contains POS-tagging and lemmatization
MaCoCu Croatian Web v2 (2021–2022) (November 2023)
MaCoCu Slovene Web v2 (2021-2022) (November 2023)
MaCoCu Ukrainian Web v1 (2021-2022) (November 2023)
MaCoCu Macedonian Web v2 (2021-2022) (November 2023)

Bibliography

MaCoCu corpora

Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, and Jaume Zaragoza. 2022. MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 303–304, Ghent, Belgium. European Association for Machine Translation

Other text corpora

Sketch Engine offers 800+ language corpora.

available corpora

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.

Quick Start Guide