Automatic identification of word senses
Sketch Engine contains functionality which identifies word senses automatically. This function is included in the Word Sketch tool. Sketch Engine can perform word sense induction, not disambiguation. Please see the info box.
The word sense induction tool will categorize the collocations identified by the word sketch into groups corresponding to the different senses of the word. This makes it possible to only display collocations related to a particular sense and investigate the usage of the word when used in this sense only. Each sense is assigned a colour and the related collocations are colour-coded accordingly.
Sense names
The senses are not named, but labelled. The label consists of a number of words which are representative of the word sense. The user must infer the sense name from the labels.
User data and languages
The word sense induction works in all corpora in the supported languages, i.e. both in user corpora as well as the preloaded corpora. The user does not need to configure the corpus to make it work. It is enough to activate in the View Options of the word Sketch Engine (see below).
At the time of writing this text (January 2024), the supported languages are: English, Spanish, Italian, French, Ukrainian, Czech, Slovak, and Estonian. Additional languages are added as their AI models are computed.
Do not confuse these
There are two terms associated with the topic of automatic identification of word senses:
Word sense induction (WSI)
WSI is a process of discovering the senses which the word may have. The senses are not known at the beginning.
Word sense disambiguation (WSD)
WSD is a process of matching a word used in a concrete context to one of its known senses. The senses of the word are known.
Sketch Engine performs word sense induction, but not word sense disambiguation.
Word sense disambiguation
Lexical Computing offers word sense disambiguation based on our data or the client’s data. Please contact us for more information.
Look at the screenshot showing the collocations and the senses of the lemma card. Sketch Engine identified a total of 4 senses. Clicking on the senses hides or shows the related collocations. Collocations used with several senses and collocations which did not provide sufficient data for the sense to be reliably identified are marked “No sense”.
Click the hotspots in the image for details.
green – the labels suggest that these collocations are related to lemma card when used in the sense of playing cards
purple – collocations related to a card of paper
tick this to hide collocations which could not be assigned to any of the senses
orange – collocations related to a card as a means of payment
blue – collocations related to a card as a computer component
How are senses identified?
A sense model was trained on a large corpus. When the function is activated, the model is applied to the collocations in the word sketch.
In more detail
There is one model per language. The language model (the Adaptive Skip-gram model) is trained on word sketch triples and represents the senses as word embeddings. The senses from the model are mapped onto (some of) the collocations from the word sketch to cluster the collocations.
The same model is applied to all corpora in the same language, including user corpora. A sense can only be identified if the sense appears in the training data. This means that, for example, archaic senses may not be identified correctly or not at all because the model was trained on contemporary data.
Unlike all other tools in Sketch Engine making use of traditional statistics, the word sense induction in the word sketch makes use of AI, specifically the Adaptive Skip-gram model.
Why no sense names?
Inventing names is a problem which has not been solved. This is why Sketch Engine displays words from the same semantic field instead. It is up to the user to infer the sense they are related to.
It seems possible that the latest LLMs (large language models such as ChatGPT) might have a solution to the naming but further research is required into how reliable this would be before it can be implemented in Sketch Engine.