XLM-R Tokenizer
The XLM-R tokenizer block converts plain text into a sequence of numerical values that can be passed to the XLM-R Encoder block. The XLM-R Tokenizer block can process text written in over 100 languages thanks to the SentencePiece method.
How does text tokenization work?
-
The tokenizer splits the input text into small pieces, called tokens.
There can be more tokens than words if parts of a word (like prefixes and suffixes) are more common than the word itself. -
The Sequence length is enforced by truncating or padding the sequence of tokens.
-
Special tokens required by the XLM-R Encoder block are added, and every token is then replaced with an integer value.
-
The sequence of integers is ready to be processed by the XLM-R Encoder block.
The XLM-R Tokenizer block uses SentencePiece under the hood, which means that it can process input text features written in any language.
However, the XLM-R Encoder block that is compatible with this tokenizer was pretrained with 100 particular languages.
Parameters
Sequence length: The total number of tokens kept in the sequence. It’s necessary to fix the sequence length, since models require fixed size inputs.
Minimum: 3
Maximum: 512
If the text input is longer than the Sequence length, the end of the text will be ignored.
If the text input is smaller, the sequence will be padded with PAD
tokens.
Choose a length that matches your typical text size to utilize all the data while avoiding unnecessary calculations on the padding tokens.