word

Note: This entry is for the type of tokenFor the positional attribute, see word form.

A word is a type of token. All tokens in a corpus are divided into two groups: words and nonwords.  Words are tokens which begin with a letter of the alphabet. Tokens such as book, working, Mary, T-shirt, post-1945, mp3 or CO2 are words because they start with a letter.

In non-alphabetic scripts, a word is a token that starts with a kanji, kana, hanzi or hanja.

The regular expression Sketch Engine users to identify words is [[:alpha:]].* 

Compare to nonword.

See also token.

« Back to Glossary Index