English TreeTagger PoS tagset with Sketch Engine modifications

A tagset is a list of part-of-speech tags (POS tags for short), i.e. labels used to indicate the part of speech and sometimes also other grammatical categories (case, tense, etc.) of each token in a text corpus.

English corpora annotated by the TreeTagger tool are tagged with this English TreeTagger part-of-speech tagset that was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart and containing modifications developed by Sketch Engine (currently pipeline version 3).

This is a modified tagset of the default TreeTagger tagset.

An Example of a tag in the CQL concordance search box: [tag="NNS"] finds all nouns in the plural, e.g. people, years (note: please make sure that you use straight double quotation marks)

Tagset

(empty tag) HTML and other entities enclosed in angle brackets
” or “ single or double quotation marks ” ‘
( left brackets ( [ {
) right brackets ) ] }
, comma ,
$ currency symbols $ £ €
# hash (number sign) #
: dashes, ellipsis, underscore, (semi)colon – … .. _ ; :
POS Tag Description Example
CC coordinating conjunction and
CD cardinal number 1, one
CDZ possessive numeral one’s
DT determiner the
EX existential there there is
FW foreign word d’hoevre
IN preposition, subordinating conjunction in, of, like
IN/that that as subordinator that
JJ adjective green
JJR adjective, comparative greener
JJS adjective, superlative greenest
LS list marker 1)
MD modal (verbs) could, will, should, would
NN noun, singular or mass table
NNS noun plural tables
NNSZ possessive noun plural people’s, women’s
NNZ possessive noun, singular or mass year’s, world’s
NP proper noun, singular John
NPS proper noun, plural Vikings
NPSZ possessive proper noun, plural Boys’, Workers’
NPZ possessive noun, singular Britain’s, God’s
PDT predeterminer both the boys
PP personal pronoun I, he, it
PPZ possessive pronoun my, his
RB adverb however, usually, naturally, here, good
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
SENT Sentence-break punctuation . ! ?
SYM Symbols (except for those listed above) / = *
TO infinitive ‘to’ togo
UH interjection uhhuhhuhh
VB verb be, base form be
VBD verb be, past tense was, were
VBG verb be, gerund/present participle being
VBN verb be, past participle been
VBP verb be, present, non-3d person am, are
VBZ verb be, 3rd person sing. present is
VH verb have, base form have
VHD verb have, past tense had
VHG verb have, gerund/present participle having
VHN verb have, past participle had
VHP verb have, sing. present, non-3d have
VHZ verb have, 3rd person sing. present has
VV verb, base form take
VVD verb, past tense took
VVG verb, gerund/present participle taking
VVN verb, past participle taken
VVP verb, present, not 3rd person take
VVZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WPZ possessive wh-pronoun whose
WRB wh-abverb where, when
Z possessive ending ‘s

Main differences to default Penn tagset

In TreeTagger tagset

  • Distinguishes be (VB) and have (VH) from other (non-modal) verbs (VV)
  • For proper nouns, NNP and NNPS have become NP and NPS
  • SENT for end-of-sentence punctuation (other punctuation tags may also differ)

In TreeTagger version 2 + Sketch Engine

  • token “to” can be tagged as IN when it is a preposition or TO only when it is an infinitive marker

In TreeTagger version 3 + Sketch Engine

  • indefinite markers “a/an” are both lemmatized as “a”

2020-08-18:

  • updated TreeTagger model
  • improved lemma guessing for possessives
  • tokenizer rezognizes hashtags, user handles, emojis
  • improved sentence boundary detection around speech marks

2016-04-10:

  • updated TreeTagger model
  • various changes in tokenization
  • “would” lemmatized as “would”
  • possessive clitic (Saxon genitive) re-tokenized (joined to preceding word) in almost all cases
    • changed tags: POS -> Z, PP$ -> PPZ, WP$ -> WPZ
    • new tags: NNZ, NNSZ, NPZ, NPSZ, CDZ
  • Default word sketch grammar has human-readable relation names

M. Marcus, B. Santorini and M.A. Marcinkiewicz (1993). Building a large annotated corpus of English: The Penn Treebank. In Computational Linguistics, volume 19, number 2, pp. 313–330.

Largest English corpus

Explore our English Trends corpus, which totals over 80 billion words and grows automatically every week.