CZES is a Czech corpus consisting of newspaper articles and magazine articles from years 1995–1998 and 2002.
- The data was downloaded from trafika.cz and newspapers’ home sites: Lidové noviny, Mladá fronta, Českomoravský profit, Právo and other.
- Some data (articles, books) was taken from many small websites (students’ work).
- Another part was obtained from CD archives of PC magazines.
- Some parts were taken from newspapers’ home sites were added around year 2002 (students’ work).
Tagging
Czes was tagged using Ajka tags.
Changelog
v2.0 (26 October 2010)
- removed duplicate and near-duplicate documents
v2.1 (2015)
- retokenised and retagged