Text tokenization

Tokenization converts plain text into a sequence of numerical values, which AI models love to handle. The text tokenizer makes this conversion as the first step of text processing.

How does text tokenization work?

  1. The tokenizer splits the input text into small pieces, called tokens.
    There can be more tokens than words if parts of a word (like prefixes and suffixes) are more common than the word itself.

    Tokenizer tokenization
  2. The Sequence length is enforced by truncating or padding the sequence of tokens.
    Longer sequences take longer to process, but shorter sequence may ignore the end of the original text.

    Tokenizer padding
  3. Special tokens that may be required by some NLP models are added, and every token is then replaced with an integer value.

    Tokenizer id
  4. The sequence of integers is ready to be processed by the core of a language processing block like Multilingual BERT, Sentence XLM-R, or the Universal sentence encoder.

Different NLP models use different tokenizers, which split and convert text in slightly different ways. Sentence XLM-R uses SentencePiece under the hood, while English BERT uses WordPiece.

Was this page helpful?