euWaC: Corpus of the Basque Web
The Basque Web Corpus (euWaC) is a Basque corpus made up of texts collected from the Internet. language corpus made up of texts collected from the Internet. The corpus was prepared by Dr. Igor Leturia
Part-of-speech tagset
The Basque Web corpus is lemmatized and part-of-speech tagged with the following list of part-of-speech tags.
Tools to work with the Basque Web corpus
A complete set of Sketch Engine tools is available to work with this Basque Web corpus to generate:
- word sketch – Basque collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word units
- word lists – lists of Basque nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Changelog
version 3 (March 2017)
- corpus tagged by the RFTagger tool with the NKJP tagset
- created lempos
version 2 (1 July 2013)
- corpus tagged by the WCRFT tagger
version 1 (23 July 2012)
- initial version – 7.7 billion words, untagged
a sample for Cesar (25 October 2012)
- 640 million words sample
- tagged by WCRFT (source: Wayback Machine) with the NKJP tagset
Bibliography
TenTen corpora
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., & Suchomel, V. (2013, July). The TenTen corpus family. In 7th International Corpus Linguistics Conference CL (pp. 125-127).
Suchomel, V., & Pomikálek, J. (2012). Efficient web crawling for large text corpora. In Proceedings of the seventh Web as Corpus Workshop (WAC7) (pp. 39-43).
Word Sketches
Radziszewski, A., Kilgarriff, A., & Lew, R. (2011). Polish word sketches.
Search the Basque corpus euWaC
Sketch Engine offers a range of tools to work with this Basque corpus from the web.
Use Sketch Engine in minutes
Generating collocations, frequency lists, examples in contexts, n-grams or extracting terms is easy with Sketch Engine. Use our Quick Start Guide to learn it in minutes.