A tagset is a list of part-of-speech tags (POS tags for short), i.e. labels used to indicate the part of speech and sometimes also other grammatical categories (case, tense etc.) of each token in a text corpus.
However, there are still languages that have not a part-of-speech tagging tool or we cannot tag them with an existing tagger.
In this case, we developed a simple part-of-speech notation called shallow tagging which is based on regular expressions and frequency properties of tokens. Once a corpus is tagged with this simple tagset, it can be processed with Universal Sketch Grammar prepared by Siva Reddy, Adam Kilgarriff, Pavel Rychlý.
Tagset legend for shallow tagging
An Example of a tag in the CQL concordance search box: [tag="FREQ"]
finds the 200 most frequent words in the language.
FREQ | frequent words (200 most frequent word in language) |
CONTENT | other words |
CRD | numerals |
PUN | punctuations |
OTHER | other |