Tokenization converts plain text into a sequence of numerical values, which AI models love to handle. The text tokenizer makes this conversion as the first step of text processing.
How does text tokenization work?
The tokenizer splits the input text into small pieces, called tokens.
There can be more tokens than words if parts of a word (like prefixes and suffixes) are more common than the word itself.
The Sequence length is enforced by truncating or padding the sequence of tokens.
Longer sequences take longer to process, but shorter sequence may ignore the end of the original text.
Special tokens that may be required by some NLP models are added, and every token is then replaced with an integer value.