The Web as Corpus (WaC) corpora were prepared by the Corpus factory method. Full details are described in the paper below. List of corpora (in order by language):
A
Arabic (arWaC web corpus), Amharic (AmWaC web corpus)
B
Basque (euWaC), Bengali (bnWaC), Bosnian (bsWaC)
C
Cantonese (yueWaC), Chinese (ChineseTaiwanWaC), Croatian (hrWaC)
D
Danish (dkWaC), Dutch (Dutch web corpus)
E
English (pukWaC, ukWaC – British English corpus, ukWaCsst)
F
Filipino (filWaC), Frisian (fyWaC), French (frWaC)
G
Georgian (kaWaC), German (deWaC, Parsed deWaC (sdeWaC)), Greek (gkWaC), Gujarati (guWaC)
H
Hausa (haWaC ), Hebrew (hebWaC), Hindi (hindiWaC)
I
Igbo (igWaC), Indonesian (idWaC), Italian (itWaC)
J
Japanese (jpWaC)
K
Kannada (knWaC)
L
Latvian (lvWaC – Latvian web corpus), Lithuanian (ltWaC – Lithuanian web corpus)
M
Malaysian (zsmWaC – Malaysian web corpus), Malayalam (mlWaC web corpus), Maltese (mtWaC – Maltese Wac corpus), Maori (miWaC – Maori web corpus), Mongolian (mnWaC – Mongolian web corpus)
N
Nepali (neWaC – Nepali web corpus)
O
Oromo (orWaC – Oromo web corpus)
P
Polish (plWaC – Polish Web corpus)
R
Russian (ruWac – Russian Web Corpus)
S
Samoan (smWaC – Samoan web corpus), Serbian (srWaC – Serbian Web corpus), Setswana (tnWaC – Setswana web corpus), Slovenian (slWaC2.1) Somali (soWaC – Somali web corpus), Spanish (esWaC – Spanish web corpus), Swahili (swWaC), Swedish (svWaC)
T
Tamil (taWaC), Tatar (Tatar Sample), Telugu (teWaC), Thai (thWaC), Tigrinya (tiWaC), Turkish (trWaC – Turkish Web Corpus)
U
Urdu (urWaC web corpus)
V
Vietnamese (viWaC)
W
Welsh (cyWaC)
Y
Yoruba (yoWaC)
Bibliography
Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010). A corpus factory for many languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.
Search WaC corpora in Sketch Engine
Sketch Engine offers a range of tools to work with web corpora.
or
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.