English TreeTagger PoS tagset with Sketch Engine modifications
A tagset is a list of part-of-speech tags (POS tags for short), i.e. labels used to indicate the part of speech and sometimes also other grammatical categories (case, tense, etc.) of each token in a text corpus.
English corpora annotated by the TreeTagger tool are tagged with this English TreeTagger part-of-speech tagset that was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart and containing modifications developed by Sketch Engine (currently pipeline version 3).
This is a modified tagset of the default TreeTagger tagset.
An Example of a tag in the CQL concordance search box: [tag="NNS"]
finds all nouns in the plural, e.g. people, years (note: please make sure that you use straight double quotation marks)
Tagset
(empty tag) | HTML and other entities enclosed in angle brackets | |
” or “ | single or double quotation marks | ” ‘ |
( | left brackets | ( [ { |
) | right brackets | ) ] } |
, | comma | , |
$ | currency symbols | $ £ € |
# | hash (number sign) | # |
: | dashes, ellipsis, underscore, (semi)colon | – … .. _ ; : |
POS Tag | Description | Example |
CC | coordinating conjunction | and |
CD | cardinal number | 1, one |
CDZ | possessive numeral | one’s |
DT | determiner | the |
EX | existential there | there is |
FW | foreign word | d’hoevre |
IN | preposition, subordinating conjunction | in, of, like |
IN/that | that as subordinator | that |
JJ | adjective | green |
JJR | adjective, comparative | greener |
JJS | adjective, superlative | greenest |
LS | list marker | 1) |
MD | modal (verbs) | could, will, should, would |
NN | noun, singular or mass | table |
NNS | noun plural | tables |
NNSZ | possessive noun plural | people’s, women’s |
NNZ | possessive noun, singular or mass | year’s, world’s |
NP | proper noun, singular | John |
NPS | proper noun, plural | Vikings |
NPSZ | possessive proper noun, plural | Boys’, Workers’ |
NPZ | possessive noun, singular | Britain’s, God’s |
PDT | predeterminer | both the boys |
PP | personal pronoun | I, he, it |
PPZ | possessive pronoun | my, his |
RB | adverb | however, usually, naturally, here, good |
RBR | adverb, comparative | better |
RBS | adverb, superlative | best |
RP | particle | give up |
SENT | Sentence-break punctuation | . ! ? |
SYM | Symbols (except for those listed above) | / = * |
TO | infinitive ‘to’ | togo |
UH | interjection | uhhuhhuhh |
VB | verb be, base form | be |
VBD | verb be, past tense | was, were |
VBG | verb be, gerund/present participle | being |
VBN | verb be, past participle | been |
VBP | verb be, present, non-3d person | am, are |
VBZ | verb be, 3rd person sing. present | is |
VH | verb have, base form | have |
VHD | verb have, past tense | had |
VHG | verb have, gerund/present participle | having |
VHN | verb have, past participle | had |
VHP | verb have, sing. present, non-3d | have |
VHZ | verb have, 3rd person sing. present | has |
VV | verb, base form | take |
VVD | verb, past tense | took |
VVG | verb, gerund/present participle | taking |
VVN | verb, past participle | taken |
VVP | verb, present, not 3rd person | take |
VVZ | verb, 3rd person sing. present | takes |
WDT | wh-determiner | which |
WP | wh-pronoun | who, what |
WPZ | possessive wh-pronoun | whose |
WRB | wh-abverb | where, when |
Z | possessive ending | ‘s |
Main differences to default Penn tagset
In TreeTagger tagset
- Distinguishes be (VB) and have (VH) from other (non-modal) verbs (VV)
- For proper nouns, NNP and NNPS have become NP and NPS
- SENT for end-of-sentence punctuation (other punctuation tags may also differ)
In TreeTagger version 2 + Sketch Engine
- token “to” can be tagged as IN when it is a preposition or TO only when it is an infinitive marker
In TreeTagger version 3 + Sketch Engine
- indefinite markers “a/an” are both lemmatized as “a”
Changelog
2020-08-18:
- updated TreeTagger model
- improved lemma guessing for possessives
- tokenizer rezognizes hashtags, user handles, emojis
- improved sentence boundary detection around speech marks
2016-04-10:
- updated TreeTagger model
- various changes in tokenization
- “would” lemmatized as “would”
- possessive clitic (Saxon genitive) re-tokenized (joined to preceding word) in almost all cases
- changed tags: POS -> Z, PP$ -> PPZ, WP$ -> WPZ
- new tags: NNZ, NNSZ, NPZ, NPSZ, CDZ
- Default word sketch grammar has human-readable relation names
Bibliography
M. Marcus, B. Santorini and M.A. Marcinkiewicz (1993). Building a large annotated corpus of English: The Penn Treebank. In Computational Linguistics, volume 19, number 2, pp. 313–330.