maCoCu: Corpora from the Web
The MaCoCu corpora were built by crawling the Internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The crawler is available at github repository.
Considerable effort was devoted into cleaning the extracted texts to provide a high-quality web corpora. This was achieved by removing boilerplate (Justext) and near-duplicated paragraphs (Onion), discarding very short texts as well as texts that are not in the target language. Despite these extensive efforts, the corpora might still contain a small amount of undesirable content. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (monotextor), making the corpora highly useful for corpus linguistics studies, as well as for training language models and other language technologies.
Thanks to the MaCoCu project, corpora in multiple languages are now available in Sketch Engine. If you want to find out more about this project and individual corpora, please refer to this website: https://macocu.eu/
Search the MaCoCu corpora
Sketch Engine offers a range of tools to work with these MaCoCu corpora.
Overview of MaCoCu corpora
The following MaCoCu corpora are available in Sketch Engine:
- MaCoCu Albanian Web v1 (2022) – 617 million words
- MaCoCu Bosnian Web v1 (2021-2022) – 715 million words
- MaCoCu Croatian Web v2 (2021–2022) – 2.3 billion words
- MaCoCu Macedonian Web v2 (2021) – 512 million words
- MaCoCu Maltese Web v2 (2021) – 331 million words
- MaCoCu Montenegrin Web v1 (2021-2022) – 157 million words
- MaCoCu Serbian Web v1 (2021-2022) – 2.4 billion words
- MaCoCu Slovene Web v2 (2021-2022) – 1.8 billion words
- MaCoCu Turkish Web v2 (2021) – 4.2 billion words
- MaCoCu Ukrainian Web v1 (2021-2022) – 5.9 billion words
Tools to work with the MaCoCu corpora from the web
A complete set of Sketch Engine tools is available to work with these MaCoCu corpora to generate:
- word sketch – collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Note: Some of the functions may not be available for some of the MaCoCu corpora.
Changelog
MaCoCu
- MaCoCu Bosnian Web v1 (2021-2022) (April 2024) – newly part-of-speech tagging, lemmatization
- MaCoCu Maltese Web v2 (2021) (March 2024)
- MaCoCu Albanian Web v1 (2022) (December 2023)
- MaCoCu Montenegrin Web v1 (2021-2022) (December 2023)
- MaCoCu Serbian Web v1 (2021-2022) (December 2023)
- MaCoCu Bosnian Web v1 (2021-2022) (November 2023) – this corpus has become obsolete on May 2nd due to release of a new version that contains POS-tagging and lemmatization
- MaCoCu Croatian Web v2 (2021–2022) (November 2023)
- MaCoCu Slovene Web v2 (2021-2022) (November 2023)
- MaCoCu Ukrainian Web v1 (2021-2022) (November 2023)
- MaCoCu Macedonian Web v2 (2021-2022) (November 2023)
Bibliography
MaCoCu corpora
Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, and Jaume Zaragoza. 2022. MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 303–304, Ghent, Belgium. European Association for Machine Translation
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.