Automatic identification of word senses

Sketch Engine contains functionality which identifies word senses automatically. This function is included in the Word Sketch tool. Sketch Engine can perform word sense induction, not disambiguation. Please see the info box.

The word sense induction tool will categorize the collocations identified by the word sketch into groups corresponding to the different senses of the word. This makes it possible to only display collocations related to a particular sense and investigate the usage of the word when used in this sense only. Each sense is assigned a colour and the related collocations are colour-coded accordingly.

Sense names

The senses are not named, but labelled. The label consists of a number of words which are representative of the word sense. The user must infer the sense name from the labels.

User data and languages

The word sense induction works in all corpora in the supported languages, i.e. both in user corpora as well as the preloaded corpora. The user does not need to configure the corpus to make it work. It is enough to activate in the View Options of the word Sketch Engine (see below).

At the time of writing this text (January 2024), the supported languages are: English, Spanish, Italian, French, Ukrainian, Czech, Slovak, and Estonian. Additional languages are added as their AI models are computed.

How to use the word sense induction

Generate a word sketch of any lemma. On the result screen:

  • click View options (1)
  • activate Show word senses (2)

Note that classifying the collocations may be a slow process and may take a couple of seconds.

Do not confuse these

There are two terms associated with the topic of automatic identification of word senses:

Word sense induction (WSI)

WSI is a process of discovering the senses which the word may have. The senses are not known at the beginning.

Word sense disambiguation (WSD)

WSD is a process of matching a word used in a concrete context to one of its known senses. The senses of the word are known.

Sketch Engine performs word sense induction, but not word sense disambiguation.

Word sense disambiguation

Lexical Computing offers word sense disambiguation based on our data or the client’s data. Please contact us for more information.

Look at the screenshot showing the collocations and the senses of the lemma card. Sketch Engine identified a total of 4 senses. Clicking on the senses hides or shows the related collocations. Collocations used with several senses and collocations which did not provide sufficient data for the sense to be reliably identified are marked “No sense”.

Click the hotspots in the image for details.

1
2
3
4
5
1

green – the labels suggest that these collocations are related to lemma card when used in the sense of playing cards

2

purple – collocations related to a card of paper

3

tick this to hide collocations which could not be assigned to any of the senses

4

orange – collocations related to a card as a means of payment

5

blue – collocations related to a card as a computer component

How are senses identified?

A sense model was trained on a large corpus. When the function is activated, the model is applied to the collocations in the word sketch.

In more detail

There is one model per language. The language model (the Adaptive Skip-gram model) is trained on word sketch triples and represents the senses as word embeddings. The senses from the model are mapped onto (some of) the collocations from the word sketch to cluster the collocations.

The same model is applied to all corpora in the same language, including user corpora. A sense can only be identified if the sense appears in the training data. This means that, for example, archaic senses may not be identified correctly or not at all because the model was trained on contemporary data.

Unlike all other tools in Sketch Engine making use of traditional statistics, the word sense induction in the word sketch makes use of AI, specifically the Adaptive Skip-gram model.

Why no sense names?

Inventing names is a problem which has not been solved. This is why Sketch Engine displays words from the same semantic field instead. It is up to the user to infer the sense they are related to.

It seems possible that the latest LLMs (large language models such as ChatGPT) might have a solution to the naming but further research is required into how reliable this would be before it can be implemented in Sketch Engine.

Try Sketch Engine