The New model Corpus is a ~100 million words domain corpus built from web data in 2008. For more information see in attachments (below).
Text types
Genres
Genre | # documents |
---|---|
blog | 13,957 |
news | 12,388 |
general | 10,216 |
business | 1,433 |
speech (subtitles) | 1,088 |
medical | 516 |
law | 451 |
fiction | 123 |
Web top level domains
TLD | # documents |
---|---|
com | 15,954 |
uk | 12,077 |
org | 2,852 |
net | 944 |
edu | 379 |
gov | 237 |
ca | 154 |
us | 104 |
au | 94 |
ie | 92 |
info | 30 |
other | 116 |
unknown | 7,139 |
Attachments
Further information about New Model Corpus.