Afrikaans |
Afrikaans Web 2024 (afTenTen24-stanza) |
141,774,410 |
Albanian |
Albanian Web 2020 (sqTenTen20) |
528,084,150 |
Amharic |
Amharic Web 2013-17 (amWaC17) |
25,975,846 |
Arabic |
Arabic Web 2024 (arTenTen24) |
6,572,150,262 |
Armenian |
Armenian Wikipedia corpus 2020 (hywiki20) |
51,349,694 |
Assamese |
Assamese Wikipedia 2023 (asWiki23) |
2,581,684 |
Azerbaijani |
Turkic web – Azerbaijani |
94,267,206 |
Bashkir |
Bashkir Drama Corpus |
18,723 |
Basque |
Basque Web (BasqueWaC v2) |
99,719,584 |
Belarusian |
Belarusian Web 2016 (beTenTen16) |
63,327,264 |
Bengali |
Bengali Web 2021 (bnTenTen21) |
470,732,738 |
Bosnian |
MaCoCu Bosnian Web v1 (2021-2022) |
715,708,157 |
Breton |
OpenSubtitles 2018 parallel – Breton |
85,503 |
Bulgarian |
Bulgarian Web 2021 (bgTenTen21) |
4,695,125,771 |
Cantonese |
Cantonese Web (CantoneseWaC) |
30,898,663 |
Catalan |
Catalan Web 2014 (caTenTen14) |
182,608,420 |
Chinese Simplified |
Chinese Web 2017 (zhTenTen17) Simplified |
13,531,331,169 |
Chinese Traditional |
Chinese Web 2017 (zhTenTen17) Traditional |
2,400,405,372 |
Crimean Tatar |
Crimean Tatar National Monolingual & Parallel Corpora, Crimean Tatar |
2,958,868 |
Croatian |
Croatian Web (hrWaC 2.2, RFTagger) |
1,211,328,660 |
Czech |
Czech Web 2023 (csTenTen23) |
4,456,427,977 |
Danish |
Danish Web 2020 (daTenTen20) |
3,480,275,804 |
Dutch |
Dutch Web 2020 (nlTenTen20) |
5,890,009,964 |
English |
English Web 2021 (enTenTen21) |
52,268,286,493 |
Estonian |
Estonian Web 2021 (etTenTen21) |
725,832,092 |
Filipino |
Tagalog (Filipino) Web 2019 (tlTenTen19) |
198,303,250 |
Finnish |
Finnish Web 2014 (fiTenTen14) |
1,404,083,812 |
French |
French Web 2023 (frTenTen23) |
23,874,070,858 |
Frisian |
Western Frisian Web 2013 (FrisianWaC) |
3,116,119 |
Georgian |
Georgian Web 2013 (kaWaC) |
50,713,604 |
German |
German Web 2020 (deTenTen20) |
17,512,733,172 |
Greek |
Greek Web 2019 (elTenTen19) |
2,342,091,029 |
Gujarati |
Gujarati Web 2021 (guTenTen21) |
88,574,710 |
Hausa (Boko) |
Hausa Web 2015 (hausaWaC15) |
5,304,300 |
Hebrew |
Hebrew Web 2021 (heTenTen21) |
2,775,686,699 |
Hindi |
Hindi Web 2021 (hiTenTen21) |
792,395,313 |
Hungarian |
Hungarian Web 2023 (huTenTen23) |
3,494,350,960 |
Icelandic |
Icelandic Web 2020 (isTenTen20) |
518,620,759 |
Igbo |
Igbo Web 2015 (IgboWaC15) |
331,042 |
Indonesian |
Indonesian Web 2024 (idTenTen24) |
7,108,841,939 |
Irish |
Irish Web 2022 (gaTenTen22) |
125,040,541 |
Italian |
Italian Web 2020 (itTenTen20) |
12,451,734,885 |
Japanese |
Japanese Web 2011 sample (jaTenTen11, LUW) |
163,837,764 |
Kannada |
Kannada Web 2012 (knWaC12) |
11,056,526 |
Kazakh |
Turkic web – Kazakh |
139,417,763 |
Khmer |
Khmer Web 2018 (kmTenTen18) |
16,500,379 |
Korean |
Korean Web 2018 (koTenTen18) |
1,668,851,720 |
Kyrgyz |
Turkic web – Kyrgyz |
19,369,507 |
Lao |
Lao Web 2019 (loTenTen19) |
105,018,584 |
Latin |
LatinISE historical corpus v2.2 |
11,036,900 |
Latvian |
Latvian Web 2014 (lvTenTen14) |
530,367,474 |
Lithuanian |
Lithuanian Web 2014 (ltTenTen14) |
778,151,979 |
Macedonian |
MaCoCu Macedonian Web v2 (2021) |
512,171,886 |
Malay |
Malay Web 2020 (msTenTen20) |
296,419,465 |
Malayalam |
Malayalam Web (malayalamWaC) |
15,950,663 |
Maldivian |
Maldivian Wikipedia corpus 2019 (dvwiki) |
548,211 |
Maltese |
Maltese MLRS Corpus |
110,714,844 |
Maori |
Maori Web 2013 and 2020 (miTenTen20) |
11,814,825 |
Nepali |
Nepali National Corpus |
13,440,835 |
Norwegian |
Norwegian Web 2017 (noTenTen17, Bokmål) |
2,461,704,417 |
Norwegian Bokmål |
Norwegian Web 2017 (noTenTen17, Bokmål) |
2,461,704,417 |
Norwegian Nynorsk |
Norwegian Web 2017 (noTenTen17, Nynorsk) |
169,145,386 |
Oromo |
Oromo Web 2016 (orWaC16) |
4,249,953 |
Persian |
TalkBank Persian (blog posts) |
269,753,238 |
Polish |
Polish Web 2019 (plTenTen19) |
3,994,024,317 |
Portuguese |
Portuguese Web 2023 (ptTenTen23) |
16,976,742,883 |
Punjabi (Gurmukhi) |
Western Punjabi Web 2017 in Shahmukhi script (pnbTenTen17) |
2,806,904 |
Romanian |
Romanian Web 2021 (roTenTen21) |
2,763,173,824 |
Russian |
Russian Web 2017 (ruTenTen17) |
9,034,837,939 |
Samoan |
Samoan Web (SamoanWac1) |
3,115,385 |
Scottish Gaelic |
Scottish Gaelic Wiki 2015 (gdWiki) |
980,026 |
Serbian |
Serbian Web (srWaC 1.2 processed by Hunpos) |
477,724,164 |
Serbian (Latin) |
Serbian Web (srWaC 1.2 processed by RFTagger v1) |
441,888,202 |
Setswana |
Setswana/Tswana Web (SetswanaWaC v2) |
11,496,687 |
Sinhalese |
OpenSubtitles 2018 parallel – Sinhalese |
3,430,727 |
Slovak |
Slovak Web 2023 (skTenTen23) |
898,031,101 |
Slovenian |
Slovenian Web 2015 (slTenTen15, TreeTagger v2) |
829,544,337 |
Somali |
Somali Web 2016 (soWaC16) |
71,871,585 |
Spanish |
Spanish Web 2023 (esTenTen23) |
28,652,392,686 |
Swahili |
Swahili Web 2014 (swWaC) |
17,882,483 |
Swedish |
Swedish Web 2014 (svTenTen14) |
3,401,035,817 |
Tagalog |
Tagalog (Filipino) Web 2019 (tlTenTen19) |
198,303,250 |
Tajik |
Tajik Web (TajikWaC) |
93,151,897 |
Tamil |
Tamil Web 2021 (taTenTen21) |
823,837,031 |
Tatar |
Tatar Mixed Corpus |
102,779,803 |
Telugu |
Telugu Web (TeluguWaC) |
3,691,203 |
Thai |
Thai Web 2018 (thTenTen18) |
640,530,227 |
Tigrinya |
Tigrinya Web 2016 (tiWaC16) |
2,087,613 |
Turkish |
Turkish Web 2020 (trTenTen20) |
4,980,168,485 |
Turkmen |
Turkic web – Turkmen |
2,105,359 |
Ukrainian |
Ukrainian Web 2022 (ukTenTen22) |
7,594,784,148 |
Urdu |
Urdu Web (UrduWaC) |
53,269,273 |
Uzbek |
Turkic web – Uzbek |
18,720,334 |
Vietnamese |
Vietnamese Web 2017 (viTenTen17) |
6,056,899,600 |
Welsh |
Welsh Web 2013 (WelshWaC) |
12,458,397 |
Yoruba |
Yoruba Web 2015 (YorubaWaC15) |
2,816,965 |