The corpus is prepared by Corpus factory method. Full details are described in Kilgarriff et al. at LREC 2010.
An important aim in creating this corpus was to get a corpus that was comparable to the PAROLE corpus of the Swedish Department of Gothenburg University. In order to achieve that, the corpus was gathered by Håkan Jansson of the Swedish Department of Gothenburg University in two steps: the first one gathering URL:s/domains of Swedish news papers, magazines, political parties, ministries and public agencies and NGO:s; the second step was to search these URL:s/domain with the help of WebBootCaT. This means that most of the content of the corpus has passed through some sort of editing process. There are however some portions of text that emanates from magazine “forums”, and the language there is likely to be rather colloquial.
It was part-of-speech tagged by Dimitrios Kokkinakis of the Swedish Department of Gothenburg University using TnT – Statistical Part-of-Speech Tagging developed by Thorsten Brants of Department of Computational Linguistics of Univärsitet des Saarlandes and trained on the Swedish SUC-corpus. It was then lemmatized by Jan Pomikálek of the Faculty of Informatics, Masaryk University, Brno, using the LEMPAS lemmatizer developed by Silvie Cinková and Jan Pomikálek.
The Sketch Grammar was developed by Sylvie Cinková, Charles University, Prague.