Tokenizer

The tokenizer block converts plain text into a sequence of numerical values, which AI models love to handle. The same block can process text written in over 100 languages thanks to the WordPiece method.

How does text tokenization work?

  1. The tokenizer splits the input text into small pieces, called tokens.
    There can be more tokens than words if parts of a word (like prefixes and suffixes) are more common than the word itself.

    Tokenizer tokenization
  2. The Sequence length is enforced by truncating or padding the sequence of tokens.

    Tokenizer padding
  3. Special tokens required by the Multilingual BERT encoder and English BERT encoder blocks are added, and every token is then replaced with an integer value.

    Tokenizer id
  4. The sequence of integers is ready to be processed by one of the language processing blocks.

The Tokenizer block uses WordPiece under the hood.
This means that it can process input text features written in over 100 languages, and be directly connected to a Multilingial BERT Encoder or English BERT Encoder block for advanced Natural Language Processing.

Parameters

Sequence length: The total number of tokens kept in the sequence. It’s necessary to fix the sequence length, since models require fixed size inputs.
Minimum: 3
Maximum: 512

If the text input is longer than the Sequence length, the end of the text will be ignored. If the text input is smaller, the sequence will be padded with PAD tokens.

Choose a length that matches your typical text size to utilize all the data while avoiding unnecessary calculations on the padding tokens.

Tokenizer padding
Figure 1. A Sequence length of 5 causes long sentences to be truncated, and short sentences to be padded.

Vocabulary: The known vocabulary used to tokenize the text and assign numerical values.
Use English uncased if you connect the tokenizer block to an English BERT encoder block. Letter case (capitalization) in the text is ignored.
Use Multilingual cased if you connect the tokenizer block to a Multilingual BERT encoder block. Letter casing (capitalization) is preserved to get additional linguistic information. For example, i and I get different token values.

Was this page helpful?
YesNo