English Penn Treebank tagset with modifications

English TreeTagger PoS tagset with Sketch Engine modifications

A tagset is a list of part-of-speech tags (POS tags for short), i.e. labels used to indicate the part of speech and sometimes also other grammatical categories (case, tense, etc.) of each token in a text corpus.

English corpora annotated by the TreeTagger tool are tagged with this English TreeTagger part-of-speech tagset that was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart and containing modifications developed by Sketch Engine (currently pipeline version 3).

English tagsets

used in Sketch Engine

What is a POS tag?

This is a modified tagset of the default TreeTagger tagset.

An Example of a tag in the CQL concordance search box: [tag="NNS"] finds all nouns in the plural, e.g. people, years (note: please make sure that you use straight double quotation marks)

Tagset

(empty tag)	HTML and other entities enclosed in angle brackets
” or “	single or double quotation marks	” ‘
(	left brackets	( [ {
)	right brackets	) ] }
,	comma	,
$	currency symbols	$ £ €
#	hash (number sign)	#
:	dashes, ellipsis, underscore, (semi)colon	– … .. _ ; :
POS Tag	Description	Example
CC	coordinating conjunction	and
CD	cardinal number	1, one
CDZ	possessive numeral	one’s
DT	determiner	the
EX	existential there	there is
FW	foreign word	d’hoevre
IN	preposition, subordinating conjunction	in, of, like
IN/that	that as subordinator	that
JJ	adjective	green
JJR	adjective, comparative	greener
JJS	adjective, superlative	greenest
LS	list marker	1)
MD	modal (verbs)	could, will, should, would
NN	noun, singular or mass	table
NNS	noun plural	tables
NNSZ	possessive noun plural	people’s, women’s
NNZ	possessive noun, singular or mass	year’s, world’s
NP	proper noun, singular	John
NPS	proper noun, plural	Vikings
NPSZ	possessive proper noun, plural	Boys’, Workers’
NPZ	possessive noun, singular	Britain’s, God’s
PDT	predeterminer	both the boys
PP	personal pronoun	I, he, it
PPZ	possessive pronoun	my, his
RB	adverb	however, usually, naturally, here, good
RBR	adverb, comparative	better
RBS	adverb, superlative	best
RP	particle	give up
SENT	Sentence-break punctuation	. ! ?
SYM	Symbols (except for those listed above)	/ = *
TO	infinitive ‘to’	togo
UH	interjection	uhhuhhuhh
VB	verb be, base form	be
VBD	verb be, past tense	was, were
VBG	verb be, gerund/present participle	being
VBN	verb be, past participle	been
VBP	verb be, present, non-3d person	am, are
VBZ	verb be, 3rd person sing. present	is
VH	verb have, base form	have
VHD	verb have, past tense	had
VHG	verb have, gerund/present participle	having
VHN	verb have, past participle	had
VHP	verb have, sing. present, non-3d	have
VHZ	verb have, 3rd person sing. present	has
VV	verb, base form	take
VVD	verb, past tense	took
VVG	verb, gerund/present participle	taking
VVN	verb, past participle	taken
VVP	verb, present, not 3rd person	take
VVZ	verb, 3rd person sing. present	takes
WDT	wh-determiner	which
WP	wh-pronoun	who, what
WPZ	possessive wh-pronoun	whose
WRB	wh-abverb	where, when
Z	possessive ending	‘s

Main differences to default Penn tagset

In TreeTagger tagset

Distinguishes be (VB) and have (VH) from other (non-modal) verbs (VV)
For proper nouns, NNP and NNPS have become NP and NPS
SENT for end-of-sentence punctuation (other punctuation tags may also differ)

In TreeTagger version 2 + Sketch Engine

token “to” can be tagged as IN when it is a preposition or TO only when it is an infinitive marker

In TreeTagger version 3 + Sketch Engine

indefinite markers “a/an” are both lemmatized as “a”

Changelog

2020-08-18:

updated TreeTagger model
improved lemma guessing for possessives
tokenizer rezognizes hashtags, user handles, emojis
improved sentence boundary detection around speech marks

2016-04-10:

updated TreeTagger model
various changes in tokenization
“would” lemmatized as “would”
possessive clitic (Saxon genitive) re-tokenized (joined to preceding word) in almost all cases
- changed tags: POS -> Z, PP$ -> PPZ, WP$ -> WPZ
- new tags: NNZ, NNSZ, NPZ, NPSZ, CDZ
Default word sketch grammar has human-readable relation names

Bibliography

M. Marcus, B. Santorini and M.A. Marcinkiewicz (1993). Building a large annotated corpus of English: The Penn Treebank. In Computational Linguistics, volume 19, number 2, pp. 313–330.

Largest English corpus

Explore our English Trends corpus, which totals over 80 billion words and grows automatically every week.

open in Sketch Engine

English TreeTagger PoS tagset with Sketch Engine modifications

Tagset

Main differences to default Penn tagset

Largest English corpus

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine

English TreeTagger PoS tagset with modifications

English TreeTagger PoS tagset with Sketch Engine modifications

Tagset

Main differences to default Penn tagset

Largest English corpus

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine