Simple maths is the keyness score used in Sketch Engine to identify keywords, terms, key n-grams and key word sketch collocations. It identifies the items that are typical of the corpus or that represent the corpus best. One could say that it identifies the ‘DNA’ of the corpus.
Simple maths compares the frequencies in the focus corpus with the frequencies in the reference corpus. Alternatively, two subcorpora in the same corpus or in different corpora can be used.
The statistics is a variation on “word W is so-and-so many times more frequent in corpus X than corpus Y”. The formula is:
where
is the normalized (per million) frequency of the word in the focus corpus,
is the normalized (per million) frequency of the word in the reference corpus,
is the smoothing parameter ( is the default value).
The N value makes the score prefer more frequent or less frequent items.
A higher N value shifts to focus on higher-frequency words (more common words), whereas a lower N value focusses on low-frequency (rarer words). The value should be changed in orders of magnitude, i.e. 0.1, 1, 10, 100, 1000, 10000 etc. Smaller changes rarely produce any noticeable effect.
Example
Your focus corpus (BNC): 112,289,776 tokens
Frequency of the lemma (shard) in the corpus: 35
Relative frequency
Selected reference corpus (ukWaC): 1,559,716,979 tokens
Frequency of the lemma (shard) in the corpus: 263
Keyness score
For more details see:
Adam Kilgarriff. Simple maths for keywords. In Proceedings of Corpus Linguistics Conference CL2009, Mahlberg, M., González-Díaz, V. & Smith, C. (eds.), University of Liverpool, UK, July 2009.
Statistic used in Sketch Engine (Chapter 5). Lexical Computing Ltd., 8 July 2015.