Words, tags, lemmas, lemposes, lowercase

Words, tags, lemmas, lemposes, lowercase – what are they for?

When using Sketch Engine, every now and then the user comes across the word attribute and its values: words, tags, lemmas, lempos, lowercase and some others depending on the corpus and language. This blog post explains how these positional attributes, to use the correct terminology, work in Sketch Engine and how the user can benefit from them.

Attributes – versions of a corpus

As soon as some text is uploaded to Sketch Engine, it is divided into tokens, i.e. tokenized. A token is the smallest part of a corpus. Each word or punctuation is a token. Hello is one token. Hello! is two tokens. The next step is to convert the original text into additonal versions. Each version has its name taken from the attribute into which the original corpus is converted.

word

Each token will immediately become part of the corpus version called word which is short for word form. Word represents each token exactly as it was written in the original sentence. It is not modified by Sketch Engine in any way. The first word in the sentence will keep its capital letter, contractions such as n’t in don’t will stay as n’t. This is what a sentence will look like when it is tokenized.

word
The
Cook
Islands
were
n’t
named
after
a
cook
but
after
James
Cook
who
landed
on
the
islands
in
1773
to
explore
the
land
.

Vertical text

The sentence is presented in the format of a vertical text, i.e. one token per line. This is the standard format of storing corpora in Sketch Engine. This format allows adding more attributes (columns) to each token easily.

Structures

For the sake of simplicity, structures such as sentence or glue are not shown.

word (lowercase) or lc

Word (lowercase), sometimes displayed as lc, is next version of the corpus, i.e. the next positional attribute. To generate this attribute, all tokens in the corpus are converted to lowercase including proper nouns (London⇢london, Peter⇢peter, WiFi⇢wifi, WIFI⇢wifi). Using this attribute (column) for searching will make the search case insensitive. Searching for cook, will find both Cook and cook. Searching for Cook will find nothing.

When generating frequency lists using this attribute, lowercase and upper case variants of the word will be treated as the same words. To get a separate frequency for WiFi, WIFI and wifi, use the word attribute.

lc is added as an additional column to the vertical text. Logically, the lc attribute is only present if the script distinguishes between lowercase and upper case. No lc for Chinese or corpora in Indian scripts.

To make the search or analysis work with the lc or word (lowercase) attribute:

activate the A = a option in the input form
if not available, choose the word (lowercase), word_lc or lc option from the list of available attributes (the names can differ between corpora).

word	lc
The	the
Cook	cook
Islands	islands
were	were
n’t	n’t
named	named
after	after
a	a
cook	cook
but	but
after	after
James	james
Cook	cook
who	who
landed	landed
on	on
the	the
islands	islands
in	in
1773	1773
to	to
explore	explore
the	the
land	land
.	.

lemma

The lemma is the form of the word found in dictionaries, sometimes called the base form. Introducing lemmas makes it possible to treat different word forms of the word as the same word. This is especially useful with morphologically rich languages, i.e. languages where lemmas can have many different word forms (Spanish, French, Polish, Japanese, Turkish, Russian etc.).

The existence of the lemma makes it possible to type go and find go, goes, going, gone and went automatically. A wordlist generated on the lemma attribute will count the frequencies of go, goes, going, gone and went together and display them as one item: go. To find their individual frequencies, the word or lc attribute should be used.

The lemma preserves the original capitalization but, typically, the first word of a sentence will be lowercased.

In most languages (German is one of the exceptions), when a capitalized word is found in the middle of the sentence, the lemmatizer identifies it as unusual usage, possibly a brand name or proper noun, and will assign a lemma which is identical to the word form. Compare islands ⇢ island but Islands ⇢ Islands.

word	lc	lemma
The	the	the
Cook	cook	Cook
Islands	islands	Islands
were	were	be
n’t	n’t	not
named	named	name
after	after	after
a	a	a
cook	cook	cook
but	but	but
after	after	after
James	james	James
Cook	cook	Cook
who	who	who
landed	landed	land
on	on	on
the	the	the
islands	islands	island
in	in	in
1773	1773	[number]
to	to	to
explore	explore	explore
the	the	the
land	land	land
.	.	.

lemma (lowercase)

Lemma (lowercase), sometimes shown as lemma_lc is used to ignore the differences in lemma capitalisation. This is analogous to the difference between word and lc (see above). Searching a corpus with the lemma (lowercase) attribute allows the user to type cook and find both cook, cooks and Cook.

To make the search or analysis work with the lemma (lowercase) attribute:

activate the A = a option in the input form
if not available, choose the lemma (lowercase) or lemma_lc option from the list of available attributes (the names can differ between corpora).

word	lc	lemma	lemma_lc
The	the	the	the
Cook	cook	Cook	cook
Islands	islands	Islands	islands
were	were	be	be
n’t	n’t	not	not
named	named	name	name
after	after	after	after
a	a	a	a
cook	cook	cook	cook
but	but	but	but
after	after	after	after
James	james	James	james
Cook	cook	Cook	cook
who	who	who	who
landed	landed	land	land
on	on	on	on
the	the	the	the
islands	islands	island	island
in	in	in	in
1773	1773	[number]	[number]
to	to	to	to
explore	explore	explore	explore
the	the	the	the
land	land	land	land
.	.	.	.

tag or POS tag or part-of-speech tag

The tag attribute contains POS tags with information about the part of speech of each token and usually also other grammatical or morphological information such as number, gender, tense etc. Tags are assigned automatically by a tagger.
Using the tag for searching makes it possible to find all words with the same part of speech. Combining the tag with other attributes makes it possible to only find words when used (or not used) as a specific part of speech.
A frequency list of tags will provide information about how frequent each part of speech is in the corpus.

word	lc	lemma	lemma_lc	tag
The	the	the	the	DT
Cook	cook	Cook	cook	NP
Islands	islands	Islands	islands	NP
were	were	be	be	VBD
n’t	n’t	not	not	RB
named	named	name	name	VVN
after	after	after	after	IN
a	a	a	a	DT
cook	cook	cook	cook	NN
but	but	but	but	CC
after	after	after	after	IN
James	james	James	james	NP
Cook	cook	Cook	cook	NP
who	who	who	who	WP
landed	landed	land	land	VVD
on	on	on	on	IN
the	the	the	the	DT
islands	islands	island	island	NNS
in	in	in	in	IN
1773	1773	[number]	[number]	CD
to	to	to	to	TO
explore	explore	explore	explore	VV
the	the	the	the	DT
land	land	land	land	NN
.	.	.	.	SENT

Tagset

The complete list of tags used in a corpus is called a tagset and can be accessed via the corpus info page.

lempos and lempos_lc

The lempos attribute was introduced mainly to make the computation of the word sketch and thesaurus possible. Lempos stands for lemma + POS. It is a combination of lemma and a one-word abbreviation of the part of speech. Parts of speech not supported by the word sketch all use the same suffix -x.

The lempos_lc or lempos (lowercase) is the lowercase version of lempos.

word	lc	lemma	lemma_lc	tag	lempos	lempos_lc
The	the	the	the	DT	the-x	the-x
Cook	cook	Cook	cook	NP	Cook-n	cook-n
Islands	islands	Islands	islands	NP	Islands-n	islands-n
were	were	be	be	VBD	be-v	be-v
n’t	n’t	not	not	RB	not-a	not-a
named	named	name	name	VVN	name-v	name-v
after	after	after	after	IN	after-i	after-i
a	a	a	a	DT	a-x	a-x
cook	cook	cook	cook	NN	cook-n	cook-n
but	but	but	but	CC	but-c	but-c
after	after	after	after	IN	after-i	after-i
James	james	James	james	NP	James-n	james-n
Cook	cook	Cook	cook	NP	Cook-n	cook-n
who	who	who	who	WP	who-x	who-x
landed	landed	land	land	VVD	land-v	land-v
on	on	on	on	IN	on-i	on-i
the	the	the	the	DT	the-x	the-x
islands	islands	island	island	NNS	island-n	island-n
in	in	in	in	IN	in-i	in-i
1773	1773	[number]	[number]	CD	[number]-m	[number]-m
to	to	to	to	TO	to-x	to-x
explore	explore	explore	explore	VV	explore-v	explore-v
the	the	the	the	DT	the-x	the-x
land	land	land	land	NN	land-n	land-n
.	.	.	.	SENT	.-x	.-x

Vertical file download

A user corpus can be downloaded as a plain text file or vertical text. The latter option only includes 3 attributes: word, tag and lempos and also structures and their attributes (metadata).

Vertical text cannot be downloaded with all the columns shown on this page. They are included here for clarity only.

How to display attributes

Concordance

Attributes can be viewed easily in the concordance. The concordance can be generated:

from scratch using a concordance search,
by jumping to the concordance via the local menu next to each result in other tools.

In the concordance, the view options offer the complete selection of attributes.

Tip

Sketch Engine remembers your view settings for each corpus. Only keep the attributes displayed if you really need to see them. Otherwise hide them to keep the screen neat and tidy and easy to work with. The display of many attributes and many concordance lines on one screen can slow your browser down.

Wordlist and n-grams

To include other attributes in the wordlist, use the ADVANCED tab to select the required attribute. To include more than one attribute, use the Display as option.

Word Sketch and thesaurus

The default attribute is set in the configuration file. Changing it may require writing a new sketch grammar.

Keywords & terms

Keywords – use the advanced tab to change the attribute for keywords.

Terms – the attribute is set in the term grammar. Changing the attribute may require writing a new term grammar. The reference corpus must be processed with the same term grammar as focus corpus.