This page contains information about a corpus building method that is no longer used by Sketch Engine but Sketch Engine still contains older corpora built using this method. They are mainly the WaC corpora .
Nowadays, Sketch Engine builds corpora using the method used for TenTen corpora and described here.
A method for developing large general language corpora which can be applied to many languages.
Corpus Factory performs the following steps to collect a corpus of a language
- Download Wikipedia Dump and parse it to get Wiki corpus
- Generate frequency list of a language form Wiki corpus
- Build queries from the mid frequent words in the frequency list
- send queries to Bing, Google or Yahoo, and download the search hit pages
- Clean the corpus
- Remove boilerplate text (HTML tags and advertisements)
- Using the wiki frequency list, compute ratio of frequent words to non-frequent words and determine if a page is continuous (i.e. is meaningful)
- Remove duplicates
- Tokenise and (if tools are available) lemmatise and part-of-speech tag
- Load into our corpus query tool, Sketch Engine
Bibliography
Adam Kilgarriff, Siva Reddy, Jan Pomikálek and Avinesh PVS (2010). A corpus factory for many languages. In LREC workshop on Web Services and Processing Pipelines, Malta, May 2010.