Text tokenization

Tokenization converts plain text into a sequence of numerical values, which AI models love to handle. The text tokenizer makes this conversion as the first step of text processing.

How does text tokenization work?

  1. The tokenizer splits the input text into small pieces, called tokens.
    There can be more tokens than words if parts of a word (like prefixes and suffixes) are more common than the word itself.

  2. Language models also break the words into tokens. Therefore, when a model sees a word: “wordlessness”, it might have the tokens "word", "less", "ne", "ss".

  3. For rare words the model was not trained on, such as Peltarion, it might represent the rare word as tokens: "Pel", "ta", "r", "ion".

  4. If you allow the model to include Capital letters, then you will have new tokens for when the model sees words with large and small letters: Peltarion could be tokenized differently from peltarion.

  5. The Sequence length is enforced by truncating or padding the sequence of tokens.
    Longer sequences take longer to process, but shorter sequence may ignore the end of the original text.

  6. Special tokens that may be required by some NLP models are added, and every token is then replaced with an integer value.

    Token ID
  7. The sequence of integers is ready to be processed by the core of a language processing block like Multilingual BERT, Sentence XLM-R, or the Universal sentence encoder.

Different NLP models use different tokenizers, which split and convert text in slightly different ways. Sentence XLM-R uses SentencePiece under the hood, while English BERT uses WordPiece.

Was this page helpful?