Regular expressions are a convention of using some characters instead of unspecified letters or numbers. They are used to set criteria for strings of characters, e.g. words or tags, which have a common pattern, e.g. start the same way, finish the same way or contain certain characters.
Regular expressions are used mainly inside CQL, in word lists and n-grams.
This page only gives a few basic examples, please refer to Wikipedia, try our regular expressions exercises or this interactive course.
Wild cards
Wild cards are not regular expressions but users know them from other software. They are only supported in the simple concordance search.
Using wild cards in simple concordance search
Only in simple concordance search, the asterisk (*), question mark (?) and double dashes (--
) can be used like this:
asterisk (*) stands for zero or more characters
test* will find
test, tests, tested, testing
c*t will find
CT, cut, cat, craft, construct
question mark (?) stands for exactly 1 character
test? will find
tests, Testa, testy
but will not find
test
c?t will find these lemmas
cat, cut
BUT! simple search always treats each search word as a lemma, thus c?t will search for the lemmas cut, cat and cot. These lemmas will produce results which include all word forms. The final concordance will thus show: cut, cutting, cat, cats, cot, cots, etc.
To search for the asterisk and question mark, use backslash (\) such as \*
and \?
double dashes (--
) stands for dash, space or none character
multi--
million will find
multi-million, multi million, multimillion
vertical bar (|
) stands for OR
cat|
dog|
horse will find
cat, dog, horse
Regular expressions
Regular expressions (not wildcards!) are used in all the other concordance searches, in CQL to specify patterns for values and with wordlists to only include/exclude certain types of items.
Regular expressions and CQL
Regular expressions are used in CQL to specify patterns for values.
[word = “dis.*“] [tag = “V.*“] finds words beginning dis- followed by a verb
[tag=”J.*“] [word=”[[:upper:]]*“] finds adjectives followed by an acronym (=word in capitals)
To copy & paste, use these:
[word = "dis.*"] [tag = "V.*"] [tag="J.*"] [word="[[:upper:]]*"]
Spaces in CQL and regular expressions
Spaces are used in CQL to make the code easier to read for the human eye. The use of spaces in CQL does not have any effect on the result.
In regular expressions, a space refers to a real space, e.g. space between two words. Since CQL criteria are set for individual tokens separately, the use of a space is generally a mistake and will not produce the required result.
CQL tutorial – introduction to corpus query language
Regex exercise
Learn regular expressions with our regex online tutorial.
dot ‘ . ‘
A dot stands for a single unspecified character.
regular expression | matching result(s) |
---|---|
w.n | win won wen wun wan |
ca. | cat car cap cab can |
question mark ‘ ? ‘
A question mark stands for zero or 1 occurrence of the preceding character
regular expression | matching result(s) |
---|---|
be?t | bt bet (but will not find beet beeet beeeet) |
bet? | be bet (but will not find bets betting) |
.?at | at hat bat cat mat (zero or one unspecified character at the beginning) |
asterisk ‘ * ‘
An asterisk stands for zero or more occurrences of the preceding character.
regular expression | matching result(s) |
---|---|
co*l | CL col cool coool cooool |
hallo* | hall hallo halloo hallooo halloooo |
c.*ing | words startin with c- and ending with -ing (i.e. having any number of unspecified characters between c and ing) cycling camping cutting cooking contemplating |
*ool | produces error, no character precedes the asterisk |
c.* | word beginning with the letter c (c is followed by any number of any character) |
.*ed | word ending with -ed (the word starts with any number of any character) |
range ‘ [ ] ‘
use square brackets to specify a list or range
[bmpg] stands for b OR m OR p OR g
[a-d] stands for a letter between a and d
[3-5] stands for a digit between 3 and 5
regular expression | matching result(s) |
---|---|
[mpgb]et | met pet get bet |
m[2-5] | m2 m3 m4 m5 |
m[2-5]* | m m22 m52 m3425 m23453234 m222345 (m followed by zero or more digits between 2 and 5) |
not ‘ ^ ‘
use ^ to indicate that the character(s) should not be included, the characters have to be enclosed in square brackets
regular expression | matching result(s) |
---|---|
[^m]et | pet get bet let (but will not find met) |
[^mpg]et | set let (but will not find met pet get) |
letters and digits
letters can be specified by a range or by character class
regular expression | matching result(s) |
---|---|
[A-Z] | finds any upper-case character (of the English alphabet, not charactes such as é í č ß etc.) |
[a-z] | finds any lowercase character (of the English alphabet) |
[A-Za-z]* | finds any word consisting of upper-case and lowercase characters (of the English alphabet) |
[[:alpha:]].* | finds a word consisting of letters of any alphabet including accented characters and special characters, see character classes further below |
\d stands for a digit, i.e. characters 0-9, \D stands for any non-digit character
regular expression | matching result(s) |
---|---|
b\d | b1 b2 b3 b4 |
b\d* | b b1 b12 b89 b43958 (zero or more digits after b) |
\d\db | 58b 46b 89b (b preceded by two digits) |
character classes
Character classes are special codes used to refer to a group of characters.
character class | meaning |
---|---|
[[:alpha:]] | any letter including accented and special characters, equivalent only for English is [A-Za-z] |
[[:digit:]] | any digit, equivalent to [0-9] or d |
[[:alnum:]] | any alphanumeric character, equivalent only for English is [0-9A-Za-z] |
[[:lower:]] | all lower case characters [a-z] |
[[:upper:]] | all upper case characters |
[[:punct:]] | punctuation [-!”#$%&'()*+,./:;<=>?@[]_`{ |
[[:space:]] | whitespace character (space, new line, tab, carriage return) |
Example:
[[:alpha:]]* finds all words composed of letters
[[:alpha:]][[:alnum:]]* finds all words starting with a letter and then composed of letters and numbers, eg. H2SO4 but not 4you
or ‘ | ‘
the pipe | is used to indicate OR
regular expression | matching result(s) |
---|---|
get|met | will find lines which contain the word get OR the word met |
plus ‘+’
the plus stands for ‘one or more repetitions of the preceding character’
regular expression | matching result(s) |
---|---|
hallo+ | hallo halloo hallooo hallooooooooo (but not hall) |
.+at | bat, great, format, cat (but not ‘at’, to include ‘at’, use .*at) |
case sensitivity switch (?i)
regular expressions are always case sensitive, i.e. Bill is different from bill. To make the whole regular expression case insensitive, put these four characters at the beginning (?i)
regular expression | matching result(s) |
---|---|
(?i)monday | Monday monday MONDAY |
repetition { }
use curly brackets to indicate repetition of the preceding character
regular expression | matching result(s) |
---|---|
halo{3} | halooo (exactly 3 repetitions of the letter o) |
hallo{2,4} | haloo hallooo hallooo (from 2 to 4 repetitions of the letter ooo) |
.{6} | anyone playmat bottle (words consisiting of any 6 characters, it is equivalent to typing 6 dots …… ) |
[a-z]{4,} | bake mother corporation (words consisting of 4 or more letters) |
grouping ( )
any part of a regular expression can be surrounded by parentheses to make it a single unit onto which other regular expressions can be applied
regular expression | matching result(s) |
---|---|
(dis)?connect | connect disconnect (question mark makes the preceding element ‘(dis)’ optional) |
(bla){3,4} | blablabla blablablabla |
escaping
to search for characters . ? * which already have a special function in regular expressions, you have to put a backslash in front of them, this is called escaping (e.g. you have to escape a question mark) Characters $ and # in part of speech tags also have to be escaped.
regular expression
.
\.
ok?
ok\?
\
matching result
a b c d e f g h etc. (all alphanumeric characters)
.
o ok (question mark makes the preceeding character optional)
ok?
produces error, backslash escapes the following character but no such character exists
not starting with ‘ ?! ‘
Use ?! to say “not starting with”, also called negative lookahead. The brackets are required. The brackets have to be followed by a regular expression defining what the token should consist of. Use .* for any token. Use … for 3-letter tokens. Use [[:upper:]]* for tokens consisting of uppercase characters, etc.
regular expression | matching result(s) |
---|---|
(?!NP).* | all POS tags not starting with NP |
(?!th)… | all 3-character words not starting with “th” |
backreferences
since manatee 2.65 It is possible to place brackets around one or several parts of a regular expression and refer to those parts later. The first part in brackets is referred to with number 1, the second with number 2, etc. (This only works within one token, e.g. [word=”(ba)..\1..*”] to find baseball, basketball, etc. N-grams tool supports also backreferences in different tokens, e.g. (.*) or \1 to find occurrences such as may or may, do or do, etc.
regular expression | matching result(s) |
---|---|
(abra)kad\1 (the number must be escaped) | abrakadabra |
(a)(b)(c)\3\2\1 | abccab |