Variation in hit counts | Sketch Engine

It often seems like you have got a different hit count for the same search when, for example, you compare hits for a concordance with hits in a frequency list.

The usual reason is that the searches were not identical, and even if they were nearly identical, in a large corpus the unusual events where they differ occurred a few times.

Explanation 1

The usual sources of confusion:

is the word capitalised?
is the lemma capitalised?
what word class is it?

So: English ‘the’ often occurs at the beginnings of sentences (but it almost always lemmatised with lowercase) so there are fewer hits for [word=”the”] than for [lemma=”the”]

The correct way to think of this is:

there are five relevant features in most of our major corpora

word	as it appears
lemma	as lemmatised by TreeTagger or another tagger we have used
	if TreeTagger thinks it’s a name it lemmatises with initial capital, if not, with an initial lowercase
lempos	(lemma + part-of-speech, as lemmatised by TreeTagger with a 2-character suffix of which the first character is always – and the second is n for a noun, v for a verb etc.
lc	lowercase version of a word (so if the word is either “the” or “The” or “THE” … then lc=”the”
lemma_lc:	lowercase version of lemma

Some queries are eg

[word=”and”]

Others are combinations

[word=”bruising” & lempos=”bruise-v”]

‘Simple’ searches are, like Google searches, simple to the user but not simple from a technical point of view. A simple search for eg., ‘and’ is turned into the query

[lc=”and” | lemma_lc=”and”]

as, we believe, this is the query least likely to miss concordances that users think they ought to have found. (‘|’, called the or-bar, means “or”)

Frequency lists are for exact matches to one of ‘word’, ‘lemma’, ‘lempos’, ‘lc’, lemma_lc’.

If you want to explore, use the CQL option for the search box and see the CQL tutorial. All the square-bracketed expressions above are valid CQL. This is the language that the Sketch Engine ‘really’ speaks!

Explanation 2

Some corpora have limited number of concordance lines that is shown as a result of the query (HARDCUT). If you have a query with more than HARDCUT results and ask for frequencies of the headword in the concordance, you will generally get different numbers from word list for the same words.

This is because word lists take whole corpus into account, while frequencies take only the current concordance.

Explanation 1

Explanation 2

for learners of languages

A Course in Lexicography and Lexical Computing

term extraction

learn sketch engine