ukWaC – British Web corpus from the .uk domain
The British Web (ukWaC) is an English corpus collected from the .uk domain using medium-frequency words from the British National Corpus as seed words. These two facts are fair to argue that it is a corpus of mainly British English although other variants are likely to be included as long as they were found on a .uk domain.
The corpus was prepared by Adriano Ferraresi and word sketches which enable users to explore the grammatical relations of words were prepared by David Tugwell. The whole preparation of the corpus is described in Introducing and evaluating ukWaC, a very large web-derived corpus of English (LREC conference, 2008; crawled from Webarchive).
Sketch Engine provides access to the version of ukWaC tagged with SuperSenseTagger (sst-light) described in Ciaramita and Altun (2006).
Part-of-speech tagset
It was part-of-speech tagged and lemmatized using TreeTagger, a leading part-of-speech tagger that has been trained for a number of languages. It uses Penn Treebank Tagset.
A complete set of tools is available to work with this British Web 2007 corpus to generate:
- word sketch – English collocations categorized by grammatical relations
- thesaurus – synonyms and similar words for every word
- keywords – terminology extraction of one-word and multi-word units
- word lists – lists of English nouns, verbs, adjectives etc. organized by frequency
- n-grams – frequency list of multi-word units
- concordance – examples in context
- text type analysis – statistics of metadata in the corpus
Bibliography
BARONI, Marco, et al. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 2009, 43.3: 209-226.
FERRARESI, Adriano, et al. Introducing and evaluating ukWaC, a very large web-derived corpus of English [crawled from Webarchive]. In Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google. 2008, pp. 47–54.
CIARAMITA, Massimiliano; ALTUN, Yasemin. Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006, pp. 594–602.
or
Use Sketch Engine in minutes
Generate collocations, frequency lists, examples in contexts, n-grams or extract terms. Use our Quick Start Guide to learn it in minutes.