Text tokenization
Tokenization converts plain text into a sequence of numerical values, which AI models love to handle. The text tokenizer makes this conversion as the first step of text processing.
How does text tokenization work?
-
The tokenizer splits the input text into small pieces, called tokens.
There can be more tokens than words if parts of a word (like prefixes and suffixes) are more common than the word itself. -
Language models also break the words into tokens. Therefore, when a model sees a word: “wordlessness”, it might have the tokens "word", "less", "ne", "ss".
-
For rare words the model was not trained on, such as Peltarion, it might represent the rare word as tokens: "Pel", "ta", "r", "ion".
-
If you allow the model to include Capital letters, then you will have new tokens for when the model sees words with large and small letters: Peltarion could be tokenized differently from peltarion.
-
The Sequence length is enforced by truncating or padding the sequence of tokens.
Longer sequences take longer to process, but shorter sequence may ignore the end of the original text. -
Special tokens that may be required by some NLP models are added, and every token is then replaced with an integer value.
-
The sequence of integers is ready to be processed by the core of a language processing block like Multilingual BERT, Sentence XLM-R, or the Universal sentence encoder.
Different NLP models use different tokenizers, which split and convert text in slightly different ways. Sentence XLM-R uses SentencePiece under the hood, while English BERT uses WordPiece.