A tagset is a list of part-of-speech tags (POS tags for short), i.e. labels used to indicate the part of speech and sometimes also other grammatical categories (case, tense etc.) of each token in a text corpus.
Polish NKJP part-of-speech tagset is available in Polish corpora using grammatical categories according to the National Corpus of Polish (NKJP).
In this list, there are 36 grammatical classes distributed approximately according to the most commonly used (traditional) parts of speech, and 13 grammatical categories with their possible values. Each grammatical class has different grammatical categories which may be specified as obligatory or optional for the particular class. The actual tag contains grammatical categories divided by a colon.
An Example of a tag in the CQL concordance search box: [tag="subst:sg:gen:f"]
finds all feminine genitive nouns in singular, e.g. pracy, strony (note: please make sure that you use straight double quotation marks). For the grammatical class ‘noun’, there are specified grammatical categories for number (sg), case (gen), and gender (f).
It is also possible to search the grammatical categories such as case or gender separately with the same syntax. For example, [case = "gen"]
searches for all words in the genitive, and [degree = "sup"]
will find words in superlative (of course, it matches only words belonging to the part of speech which includes the category). The names are identical to the grammatical categories in lowercase and without any punctuation (i.e. number, case, gender, person, degree, aspect, negation, accentability, postprepositionality, accommodability, agglutination, vocalicity, fullstoppedness).
Tagset
Elementary part-of-speech tagset LEGEND | |
---|---|
noun | subs* |
adjective | adj.* |
pronoun | ppron.*|siebie.* |
numeral | num.* |
verb | fin.*|bedzie.*|aglt.*|praet.*|impt.*|imps.*|inf.*|pcon.*|pant.*|ger.*|pact.*|ppas.* |
adverb | adv.* |
preposition | prep.* |
conjunction | conj.*|comp.* |
particle-adverb | qub.* |
interjection | interj |
punctuation | interp.* |
foreign word | xxx |
Grammatical classes
Noun | noun | subst | subst:number:case:gender |
depreciative form | depr | depr:number:case:gender | |
Adjective | adjective | adj | adj:number:case:gender:degree |
ad-adj. adjective | adja | adja | |
post-prep. adjective | adjp | adjp | |
predicative adjective | adjc | adjc | |
Pronoun | non-3rd person pronoun | ppron12 | ppron12:number:case:gender:person:accentability |
3rd-person pronoun | ppron3 | ppron3:number:case:gender:person:accentability:post-prepositionality | |
pronoun siebie | siebie | siebie:case | |
Numeral | main numeral | num | num:number:case:gender:accommodability |
collective numeral | numcol | num:number:case:gender:accommodability | |
Verb | non-past form | fin | fin:number:person:aspect |
future być | bedzie | bedzie:number:person:aspect | |
agglutinate być | aglt | aglt:number:person:aspect:vocalicity | |
l-participle | praet | praet:number:gender:aspect:agglutination | |
imperative | impt | impt:number:person:aspect | |
impersonal | imps | imps:aspect | |
infinitive | inf | inf:aspect | |
contemporary adv. participle | pcon | pcon:aspect | |
anterior adv. participle | pant | pant:aspect | |
gerund | ger | ger:number:case:gender:aspect:negation | |
active adj. participle | pact | pact:number:case:gender:aspect:negation | |
passive adj. participle | ppas | ppas:number:case:gender:aspect:negation | |
winien-like verb | winien | winien:number:gender:aspect | |
Adverb | adverb | adv | adv:degree |
Preposition | preposition | prep | prep:case |
Conjunction | coordinating conjunction | conj | comp |
subordinating conjunction | comp | comp | |
Particle-adverb | particle-adverb | qub | qub |
Interjection | interjection | interj | interj |
Others
Abbreviation | brev | brev:fullstoppedness |
Bound word | burk | burk |
Punctuation | interp | interp |
Alien | xxx | xxx |
Unknown form | ign | ign |
Grammatical categories and their possible values
Number
(for nouns, adjectives, pronouns, numerals, some verbs)
singular | sg | subst:pl:nom:m3 | zbory |
plural | pl | subst:sg:nom:m3 | chrzest |
Case
(for nouns, adjectives, pronouns, numerals, prepositions)
nominative | nom | subst:sg:nom:f
subst:sg:nom:m3 |
praca
rozkład |
genitive | gen | subst:sg:gen:f
subst:sg:gen:m3 |
pracy
rozkładu |
dative | dat | subst:sg:dat:f
subst:sg:dat:m3 |
pracy
rozkładowi |
accusative | acc | subst:sg:nom:f
subst:sg:acc:m3 |
pracę
rozkład |
vocative | voc | subst:sg:voc:f | praco |
local | loc | subst:sg:loc:f
subst:sg:loc:m3 |
pracy
rozkładzie |
instrumental | inst | subst:sg:inst:f
subst:sg:inst:m3 |
pracą
rozkładem |
Gender
human masculine (virile) | m1 | papież, kto, wujostwo |
animate masculine | m2 | baranek, walc, babsztyl |
inanimate masculine | m3 | stół |
feminine | f | stuła |
neuter | n | dziecko, okno, co, skrzypce, spodnie |
Person
first | pri | bredzę, my |
second | sec | bredzisz, wy |
third | ter | bredzi, oni |
Degree
positive | pos | cudny |
comparative | com | cudniejszy |
superlative | sup | najcudniejszy |
Aspect
imperfective | imperf | iść |
perfective | perf | zajść |
Negation
affirmative | aff | pisanie, czytanego |
negative | neg | niepisanie, nieczytanego |
Accentability
accented (strong) | akc | jego, niego, tobie |
non-post-prepositional | nakc | go, -ń, ci |
Post-prepositionality
post-prepositional | praep | niego, -ń |
non-post-prepositional | npraep | jego, go |
Accommodability
agreeing | congr | dwaj, pięcioma |
governing | rec | dwóch, dwu, pięciorgiem |
Agglutination
non-agglutinative | nagl | niósł |
agglutinative | agl | niosł- |
Vocalicity
vocalic | wok | -em |
non-vocalic | nwok | -m |
Fullstoppedness
with full stop | pun | tzn |
without full stop | npun | wg |
Source: A comparison of two morphosyntactic tagsets of Polish
Reference
PRZEPIÓRKOWSKI, Adam. A comparison of two morphosyntactic tagsets of Polish. In: Representing Semantics in Digital Lexicography: Proceedings of MONDILEX Fourth Open Workshop. Warsaw, 2009. pp. 138–144.