Using case sensitive and case insensitive searches with corpora
This blog post explains how to analyse corpora and take into account or ignore the difference between lowercase and uppercase. In other words, how to use Sketch Engine to:
type wifi and find wifi, WIFI, WiFi and Wifi
OR
type WiFi and only find WiFi but not the other variants
A short introduction to the lowercase attribute is required to fully understand how this can be achieved.
Lowercase
Lowercase is the key concept for case sensitive and case insensitive searches and analysis. When data are uploaded to Sketch Engine to build a corpus, they are automatically converted into several versions. To use the exact terminology, each token is assigned with several positional attributes. Attributes can be understood as corpus versions. Each column in the following vertical text represents a positional attribute (a version of the corpus).
The table shows some of the attributes (columns) into which this sentence would be converted:
The Cook Islands weren’t named after a cook but after James Cook who landed on the islands in 1773 to explore the land.
word | lc | lemma | lemma_lc | tag | lempos | lempos_lc |
---|---|---|---|---|---|---|
The | the | the | the | DT | the-x | the-x |
Cook | cook | Cook | cook | NP | Cook-n | cook-n |
Islands | islands | Islands | islands | NP | Islands-n | islands-n |
were | were | be | be | VBD | be-v | be-v |
n’t | n’t | not | not | RB | not-a | not-a |
named | named | name | name | VVN | name-v | name-v |
after | after | after | after | IN | after-i | after-i |
a | a | a | a | DT | a-x | a-x |
cook | cook | cook | cook | NN | cook-n | cook-n |
but | but | but | but | CC | but-c | but-c |
after | after | after | after | IN | after-i | after-i |
James | james | James | james | NP | James-n | james-n |
Cook | cook | Cook | cook | NP | Cook-n | cook-n |
who | who | who | who | WP | who-x | who-x |
landed | landed | land | land | VVD | land-v | land-v |
on | on | on | on | IN | on-i | on-i |
the | the | the | the | DT | the-x | the-x |
islands | islands | island | island | NNS | island-n | island-n |
in | in | in | in | IN | in-i | in-i |
1773 | 1773 | [number] | [number] | CD | [number]-m | [number]-m |
to | to | to | to | TO | to-x | to-x |
explore | explore | explore | explore | VV | explore-v | explore-v |
the | the | the | the | DT | the-x | the-x |
land | land | land | land | NN | land-n | land-n |
. | . | . | . | SENT | .-x | .-x |
The first attribute (column), called word, represents the text in its original form. No transformation is applied. The second attribute (column), called lc, lowercase or word (lowercase), is the same as word but converted into lowercase. All uppercase letters including ones in proper nouns and acronyms are lowercased (WiFi⇢wifi, WIFI⇢wifi, Paris⇢paris, Hugo⇢hugo, UNESCO⇢unesco). Similarly, lemma_lc and lempos_lc are the lowercased versions of the respective attributes. This blog post helps you understand all the different positional attributes.
The point of the lowercased attributes is to allow case insensitive searches and analysis when uppercase and lowercase variants of a token should be treated as the same thing.
How to switch to case insensitive
There are 2 ways to switch a tool in Sketch Engine into the case insensitive mode.
Option 1
Many tools have a case sensitivity switch, often found on the ADVANCED tab, not the SIMPLE tab.
Option 2
Some tools do not have the switch but the user can select the required attribute directly.
Selecting the lowercased attributes will perform the statistics in a case insensitive way. This means that the upper case and lower case versions of the same token will be counted together.
Typing words
When lowercase is selected, the input form will automatically adjust the input to match the setting. All of these options:
WIFI
WiFi
Wifi
wifi
will be lowercased first and will produce the the same as typing wifi.
Tools in detail
Certain tools and operation have a predefined attribute to work with and the user cannot change it. This is how individual tools behave with regard to case sensitivity:
Word sketch
The word sketch always uses a predefined attribute, typically the lempos. The attribute is defined in the sketch grammar. The user cannot change the attribute on the fly. Word sketches are precalculated during compilation and changing the attribute would require recalculation. For user corpora, the user can write their own sketch grammar that uses a different attribute.
With lempos, apple will produce different collocations from Apple. Combined collocations for Apple and apple cannot be displayed.
With lempos_lc (only possible if the user writes their own sketch grammar based on this attribute), apple produces combined collocations for both apple and Apple. Typing Apple will not produce any results because lempos_lc does not contain any lemmas starting with an uppercase letter.
Word sketch difference
The information for the word sketch above applies to word sketch difference too.
Thesaurus
The thesaurus is based on comparing word sketches and therefore always uses the same attribute as the word sketch. To change the attribute for the thesaurus, the attribute for word sketch should be changed in the sketch grammar.
With the attribute set to lempos, apple and Apple will produce different lists of synonyms.
Concordance and Parallel concordance
Simple search searches simultaneously in several attributes, typically word, lowercase and lemma. For user corpora, this can be in the corpus configuration file.
Other searches use the switch to activate the use of lowercased attributes.
CQL search – the attribute is set individually for each token.
Concordance result screen
The tools for working with the concordance result, located in the toolbar above the concordance lines, contain the BASIC and ADVANCED tabs. The latter contains either the switch or the attribute selector for switching between case sensitive and case insensitive.
Wordlist
On the ADVANCED TAB, use or select the required attribute. Additional options such as starting with/contianing/ending with must match the selected attribute. This is how the combinations of settings affects the result:
input of: starting with containing ending with from this list |
attribute | result | note | |
---|---|---|---|---|
apple | word | ☐ | apple | will not find Apple |
Apple | word | ☐ | Apple | will not find apple |
apple | word | ? | apple | results include both Apple and apple but are displayed as apple |
Apple | word | ? | apple | as above; the interface will lowercase the input |
Use Display as to display a different result. For example, these criteria:
apple | word | ? | apple | results include both Apple and apple but are displayed as apple |
normally count Apple and apple together and display it as apple. Set Display as: to word to display a separate result for Apple and for apple.
N-grams
On the ADVANCED TAB, use or select the required attribute.
Keywords and terms
Keywords
By default, the word attribute is used. It can be changed on the advanced tab.
Terms
The attribute is defined in the term grammar, it is usually the lemma and cannot be changed by the user.
Trends
The attribute can be selected on both the BASIC and ADVANCED tabs.
See also
Words, tags, lemmas, lemposes, lowercase – explanation of all attributes in the corpus
POS tags – explanation of part-of-speech tags