Check our new 15-billion-word English corpus (enTenTen) comprised of texts from the Web until the end of 2015.
We used our newest advanced cleaning method in order to filter out spam and advertisements. Texts were annotated with a newer version 2.1 of the TreeTagger tool providing more accurate tokenization.