maCoCu: Corpora from the Web

The MaCoCu corpora were built by crawling the Internet top-level domain in 2021 and 2022, extending the crawl dynamically to other domains as well. The crawler is available at github repository.

Considerable effort was devoted into cleaning the extracted texts to provide a high-quality web corpora. This was achieved by removing boilerplate (Justext) and near-duplicated paragraphs (Onion), discarding very short texts as well as texts that are not in the target language. Despite these extensive efforts, the corpora might still contain a small amount of undesirable content. The dataset is characterized by extensive metadata which allows filtering the dataset based on text quality and other criteria (monotextor), making the corpora highly useful for corpus linguistics studies, as well as for training language models and other language technologies.

Thanks to the MaCoCu project, corpora in multiple languages are now available in Sketch Engine. If you want to find out more about this project and individual corpora, please refer to this website: https://macocu.eu/

Search the MaCoCu corpora

Sketch Engine offers a range of tools to work with these MaCoCu corpora.

Overview of MaCoCu corpora

The following MaCoCu corpora are available in Sketch Engine:

Tools to work with the MaCoCu corpora from the web

A complete set of Sketch Engine tools is available to work with these MaCoCu corpora to generate:

  • word sketch – collocations categorized by grammatical relations
  • thesaurus – synonyms and similar words for every word
  • keywords – terminology extraction of one-word and multi-word units
  • word lists – lists of nouns, verbs, adjectives etc. organized by frequency
  • n-grams – frequency list of multi-word units
  • concordance – examples in context
  • text type analysis – statistics of metadata in the corpus

Note: Some of the functions may not be available for some of the MaCoCu corpora.

MaCoCu

  • MaCoCu Bosnian Web v1 (2021-2022) (April 2024) – newly part-of-speech tagging, lemmatization
  • MaCoCu Maltese Web v2 (2021) (March 2024)
  • MaCoCu Albanian Web v1 (2022) (December 2023)
  • MaCoCu Montenegrin Web v1 (2021-2022) (December 2023)
  • MaCoCu Serbian Web v1 (2021-2022) (December 2023)
  • MaCoCu Bosnian Web v1 (2021-2022)  (November 2023) – this corpus has become obsolete on May 2nd due to release of a new version that contains POS-tagging and lemmatization
  • MaCoCu Croatian Web v2 (2021–2022) (November 2023)
  • MaCoCu Slovene Web v2 (2021-2022) (November 2023)
  • MaCoCu Ukrainian Web v1 (2021-2022) (November 2023)
  • MaCoCu Macedonian Web v2 (2021-2022) (November 2023)

MaCoCu corpora

Marta Bañón, Miquel Esplà-Gomis, Mikel L. Forcada, Cristian García-Romero, Taja Kuzman, Nikola Ljubešić, Rik van Noord, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Peter Rupnik, Vít Suchomel, Antonio Toral, Tobias van der Werff, and Jaume Zaragoza. 2022. MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 303–304, Ghent, Belgium. European Association for Machine Translation

Other text corpora

Sketch Engine offers 800+ language corpora.

Use Sketch Engine in minutes

Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.