tokenization

For the corpus to work, the corpus text should be first divided into individual tokens. Tokenization is the automatic process of dividing text into tokens. This process is performed by tools called tokenizers.