A tagset is a list of part-of-speech tags, i.e. labels used to indicate the part of speech and often also other grammatical categories (case, tense etc.) of each token in a text corpus.
Penn Treebank tagset
The English Penn Treebank tagset is used with English corpora annotated by the TreeTagger tool, developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart. This version of the tagset contains modifications developed by Sketch Engine (earlier version).
See a more recent version of this tagset.
The table shows English Penn TreeBank tagset with Sketch Engine modifications (earlier version).
Example: [tag="NNS"]
finds all nouns in the plural, e.g. people, years when used in the CQL concordance search (always use straight double quotation marks in CQL)
POS Tag | Description | Example |
CC | coordinating conjunction | and |
CD | cardinal number | 1, third |
DT | determiner | the |
EX | existential there | there is |
FW | foreign word | les |
IN | preposition, subordinating conjunction | in, of, like |
IN/that | that as subordinator | that |
JJ | adjective | green |
JJR | adjective, comparative | greener |
JJS | adjective, superlative | greenest |
LS | list marker | 1) |
MD | modal | could, will |
NN | noun, singular or mass | table |
NNS | noun plural | tables |
NP | proper noun, singular | John |
NPS | proper noun, plural | Vikings |
PDT | predeterminer | both the boys |
POS | possessive ending | friend’s |
PP | personal pronoun | I, he, it |
PPZ | possessive pronoun | my, his |
RB | adverb | however, usually, naturally, here, good |
RBR | adverb, comparative | better |
RBS | adverb, superlative | best |
RP | particle | give up |
SENT | Sentence-break punctuation | . ! ? |
SYM | Symbol | / [ = * |
TO | infinitive ‘to’ | togo |
UH | interjection | uhhuhhuhh |
VB | verb be, base form | be |
VBD | verb be, past tense | was, were |
VBG | verb be, gerund/present participle | being |
VBN | verb be, past participle | been |
VBP | verb be, sing. present, non-3d | am, are |
VBZ | verb be, 3rd person sing. present | is |
VH | verb have, base form | have |
VHD | verb have, past tense | had |
VHG | verb have, gerund/present participle | having |
VHN | verb have, past participle | had |
VHP | verb have, sing. present, non-3d | have |
VHZ | verb have, 3rd person sing. present | has |
VV | verb, base form | take |
VVD | verb, past tense | took |
VVG | verb, gerund/present participle | taking |
VVN | verb, past participle | taken |
VVP | verb, sing. present, non-3d | take |
VVZ | verb, 3rd person sing. present | takes |
WDT | wh-determiner | which |
WP | wh-pronoun | who, what |
WP$ | possessive wh-pronoun | whose |
WRB | wh-abverb | where, when |
# | # | # |
$ | $ | $ |
“ | Quotation marks | ‘ “ |
`` | Opening quotation marks | ‘ “ |
( | Opening brackets | ( { |
) | Closing brackets | ) } |
, | Comma | , |
: | Punctuation | – ; : — … |
Main differences to the default Penn tagset
In TreeTagger
- Distinguishes be (VB) and have (VH) from other (non-modal) verbs (VV)
- For proper nouns, NNP and NNPS have become NP and NPS
- SENT for end-of-sentence punctuation (other punctuation tags may also differ)
In TreeTagger tool + Sketch Engine modifications
- the word ‘to’ is tagged IN when used as a preposition and TO when used as an infinitive marker
Bibliography
M. Marcus, B. Santorini and M.A. Marcinkiewicz (1993). Building a large annotated corpus of English: The Penn Treebank. In Computational Linguistics, volume 19, number 2, pp. 313–330.
or