Words, tags, lemmas, lemposes, lowercase – what are they for?
When using Sketch Engine, every now and then the user comes across the word attribute and its values: words, tags, lemmas, lempos, lowercase and some others depending on the corpus and language. This blog post explains how these positional attributes, to use the correct terminology, work in Sketch Engine and how the user can benefit from them.
Attributes – versions of a corpus
As soon as some text is uploaded to Sketch Engine, it is divided into tokens, i.e. tokenized. A token is the smallest part of a corpus. Each word or punctuation is a token. Hello is one token. Hello! is two tokens. The next step is to convert the original text into additonal versions. Each version has its name taken from the attribute into which the original corpus is converted.
word
Each token will immediately become part of the corpus version called word which is short for word form. Word represents each token exactly as it was written in the original sentence. It is not modified by Sketch Engine in any way. The first word in the sentence will keep its capital letter, contractions such as n’t in don’t will stay as n’t. This is what a sentence will look like when it is tokenized.
see also word form
word |
---|
The |
Cook |
Islands |
were |
n’t |
named |
after |
a |
cook |
but |
after |
James |
Cook |
who |
landed |
on |
the |
islands |
in |
1773 |
to |
explore |
the |
land |
. |
Vertical text
The sentence is presented in the format of a vertical text, i.e. one token per line. This is the standard format of storing corpora in Sketch Engine. This format allows adding more attributes (columns) to each token easily.
Structures
For the sake of simplicity, structures such as sentence or glue are not shown.
word (lowercase) or lc
Word (lowercase), sometimes displayed as lc, is next version of the corpus, i.e. the next positional attribute. To generate this attribute, all tokens in the corpus are converted to lowercase including proper nouns (London⇢london, Peter⇢peter, WiFi⇢wifi, WIFI⇢wifi). Using this attribute (column) for searching will make the search case insensitive. Searching for cook, will find both Cook and cook. Searching for Cook will find nothing.
When generating frequency lists using this attribute, lowercase and upper case variants of the word will be treated as the same words. To get a separate frequency for WiFi, WIFI and wifi, use the word attribute.
lc is added as an additional column to the vertical text. Logically, the lc attribute is only present if the script distinguishes between lowercase and upper case. No lc for Chinese or corpora in Indian scripts.
To make the search or analysis work with the lc or word (lowercase) attribute:
- activate the A = a option in the input form
- if not available, choose the word (lowercase), word_lc or lc option from the list of available attributes (the names can differ between corpora).
see also lc
Please study this vertical text and compare the frequency lists generated on the word and word (lowercase) attributes. The lc list will always be shorter, it will contain a smaller variety of items because the distinction between upper case and lowercase is lost.
word | lc |
---|---|
The | the |
Cook | cook |
Islands | islands |
were | were |
n’t | n’t |
named | named |
after | after |
a | a |
cook | cook |
but | but |
after | after |
James | james |
Cook | cook |
who | who |
landed | landed |
on | on |
the | the |
islands | islands |
in | in |
1773 | 1773 |
to | to |
explore | explore |
the | the |
land | land |
. | . |
lemma
The lemma is the form of the word found in dictionaries, sometimes called the base form. Introducing lemmas makes it possible to treat different word forms of the word as the same word. This is especially useful with morphologically rich languages, i.e. languages where lemmas can have many different word forms (Spanish, French, Polish, Japanese, Turkish, Russian etc.).
The existence of the lemma makes it possible to type go and find go, goes, going, gone and went automatically. A wordlist generated on the lemma attribute will count the frequencies of go, goes, going, gone and went together and display them as one item: go. To find their individual frequencies, the word or lc attribute should be used.
The lemma preserves the original capitalization but, typically, the first word of a sentence will be lowercased.
In most languages (German is one of the exceptions), when a capitalized word is found in the middle of the sentence, the lemmatizer identifies it as unusual usage, possibly a brand name or proper noun, and will assign a lemma which is identical to the word form. Compare islands ⇢ island but Islands ⇢ Islands.
see also lemma
Compare these frequency lists generated on the word and lemma attributes. The lemma itself does not differentiate between parts of speech, therefore landed and land are counted as the same lemma despite being a verb and a noun. The Sketch Engine interface, however, features functionality to take the part of speech into account if needed.
word | lc | lemma |
---|---|---|
The | the | the |
Cook | cook | Cook |
Islands | islands | Islands |
were | were | be |
n’t | n’t | not |
named | named | name |
after | after | after |
a | a | a |
cook | cook | cook |
but | but | but |
after | after | after |
James | james | James |
Cook | cook | Cook |
who | who | who |
landed | landed | land |
on | on | on |
the | the | the |
islands | islands | island |
in | in | in |
1773 | 1773 | [number] |
to | to | to |
explore | explore | explore |
the | the | the |
land | land | land |
. | . | . |
lemma (lowercase)
Lemma (lowercase), sometimes shown as lemma_lc is used to ignore the differences in lemma capitalisation. This is analogous to the difference between word and lc (see above). Searching a corpus with the lemma (lowercase) attribute allows the user to type cook and find both cook, cooks and Cook.
To make the search or analysis work with the lemma (lowercase) attribute:
- activate the A = a option in the input form
- if not available, choose the lemma (lowercase) or lemma_lc option from the list of available attributes (the names can differ between corpora).
see also lemma_lc
word | lc | lemma | lemma_lc |
---|---|---|---|
The | the | the | the |
Cook | cook | Cook | cook |
Islands | islands | Islands | islands |
were | were | be | be |
n’t | n’t | not | not |
named | named | name | name |
after | after | after | after |
a | a | a | a |
cook | cook | cook | cook |
but | but | but | but |
after | after | after | after |
James | james | James | james |
Cook | cook | Cook | cook |
who | who | who | who |
landed | landed | land | land |
on | on | on | on |
the | the | the | the |
islands | islands | island | island |
in | in | in | in |
1773 | 1773 | [number] | [number] |
to | to | to | to |
explore | explore | explore | explore |
the | the | the | the |
land | land | land | land |
. | . | . | . |
tag or POS tag or part-of-speech tag
The tag attribute contains POS tags with information about the part of speech of each token and usually also other grammatical or morphological information such as number, gender, tense etc. Tags are assigned automatically by a tagger.
Using the tag for searching makes it possible to find all words with the same part of speech. Combining the tag with other attributes makes it possible to only find words when used (or not used) as a specific part of speech.
A frequency list of tags will provide information about how frequent each part of speech is in the corpus.
see also POS tag
word | lc | lemma | lemma_lc | tag |
---|---|---|---|---|
The | the | the | the | DT |
Cook | cook | Cook | cook | NP |
Islands | islands | Islands | islands | NP |
were | were | be | be | VBD |
n’t | n’t | not | not | RB |
named | named | name | name | VVN |
after | after | after | after | IN |
a | a | a | a | DT |
cook | cook | cook | cook | NN |
but | but | but | but | CC |
after | after | after | after | IN |
James | james | James | james | NP |
Cook | cook | Cook | cook | NP |
who | who | who | who | WP |
landed | landed | land | land | VVD |
on | on | on | on | IN |
the | the | the | the | DT |
islands | islands | island | island | NNS |
in | in | in | in | IN |
1773 | 1773 | [number] | [number] | CD |
to | to | to | to | TO |
explore | explore | explore | explore | VV |
the | the | the | the | DT |
land | land | land | land | NN |
. | . | . | . | SENT |
Tagset
The complete list of tags used in a corpus is called a tagset and can be accessed via the corpus info page.
lempos and lempos_lc
The lempos attribute was introduced mainly to make the computation of the word sketch and thesaurus possible. Lempos stands for lemma + POS. It is a combination of lemma and a one-word abbreviation of the part of speech. Parts of speech not supported by the word sketch all use the same suffix -x.
The lempos_lc or lempos (lowercase) is the lowercase version of lempos.
word | lc | lemma | lemma_lc | tag | lempos | lempos_lc |
---|---|---|---|---|---|---|
The | the | the | the | DT | the-x | the-x |
Cook | cook | Cook | cook | NP | Cook-n | cook-n |
Islands | islands | Islands | islands | NP | Islands-n | islands-n |
were | were | be | be | VBD | be-v | be-v |
n’t | n’t | not | not | RB | not-a | not-a |
named | named | name | name | VVN | name-v | name-v |
after | after | after | after | IN | after-i | after-i |
a | a | a | a | DT | a-x | a-x |
cook | cook | cook | cook | NN | cook-n | cook-n |
but | but | but | but | CC | but-c | but-c |
after | after | after | after | IN | after-i | after-i |
James | james | James | james | NP | James-n | james-n |
Cook | cook | Cook | cook | NP | Cook-n | cook-n |
who | who | who | who | WP | who-x | who-x |
landed | landed | land | land | VVD | land-v | land-v |
on | on | on | on | IN | on-i | on-i |
the | the | the | the | DT | the-x | the-x |
islands | islands | island | island | NNS | island-n | island-n |
in | in | in | in | IN | in-i | in-i |
1773 | 1773 | [number] | [number] | CD | [number]-m | [number]-m |
to | to | to | to | TO | to-x | to-x |
explore | explore | explore | explore | VV | explore-v | explore-v |
the | the | the | the | DT | the-x | the-x |
land | land | land | land | NN | land-n | land-n |
. | . | . | . | SENT | .-x | .-x |
Vertical file download
A user corpus can be downloaded as a plain text file or vertical text. The latter option only includes 3 attributes: word, tag and lempos and also structures and their attributes (metadata).
Vertical text cannot be downloaded with all the columns shown on this page. They are included here for clarity only.
How to display attributes
Concordance
Attributes can be viewed easily in the concordance. The concordance can be generated:
- from scratch using a concordance search,
- by jumping to the concordance via the local menu next to each result in other tools.
In the concordance, the view options offer the complete selection of attributes.
Tip
Sketch Engine remembers your view settings for each corpus. Only keep the attributes displayed if you really need to see them. Otherwise hide them to keep the screen neat and tidy and easy to work with. The display of many attributes and many concordance lines on one screen can slow your browser down.
Wordlist and n-grams
To include other attributes in the wordlist, use the ADVANCED tab to select the required attribute. To include more than one attribute, use the Display as option.
Word Sketch and thesaurus
The default attribute is set in the configuration file. Changing it may require writing a new sketch grammar.
Keywords & terms
Keywords – use the advanced tab to change the attribute for keywords.
Terms – the attribute is set in the term grammar. Changing the attribute may require writing a new term grammar. The reference corpus must be processed with the same term grammar as focus corpus.