The Sentence XLM-R block provides good Natural Language Processing capabilities for 100 languages, and is especially trained for text similarity applications.
Using Sentence XLM-R
Sentence XLM-R is a good model for processing text written naturally and in many languages, like the Multilingual BERT block.
However, Sentence XLM-R was pretrained more specifically for sentence embedding, making it a better choice for text similarity tasks out of the box. Text similarity tasks can also be performed by the faster but less accurate Universal sentence encoder block.
The Sentence XLM-R block tokenizes text features and embeds each token in a context-aware vector using self-attention.
The Sentence XLM-R block outputs the mean of all the token embedding vectors, which is a useful representation of text as a whole that you can use to compare different text examples, for example for text similarity and text clustering tasks.
Sequence length: The maximum length of text that the model processes, in number of tokens. There are generally 1 to 3 tokens per word.
Aim for a Sequence length that is the typical size of your text feature, since larger values require more computation time but smaller values may cause the end of your text to be ignored.
The Sentence XLM-R block supports sequence lengths between 3 and 512 tokens.
Note that 2 tokens are reserved for internal use, so a Sequence length of 3 processes a single token of text.
Trainable: Whether we want the training algorithm to change the value of the weights during training. In some cases, one will want to keep parts of the network static.
Training the Sentence XLM-R block
The Sentence XLM-R block is initialized with weights pretrained on CommonCrawl, SNLI, and STS-b.
This means that you don’t have to fine-tune Sentence XLM-R for similarity tasks. Simply make sure that all the blocks in your model graph have the Trainable setting un-checked to skip training and save time.
Why use a multilingual model?
More than a simple convenience, multilingual models often perform better than monolingual models.
One reason is that the training data available is generally more limited in any single language. In addition, many languages share common patterns that the model can pick up more easily when it is trained with a variety of languages.
Languages supported by Sentence XLM-R
Here is the list of languages used in pretraining, which are supported out of the box. More details, like how much of each languages was used in pretraining, can be found in Appendix 1 of Unsupervised Cross-lingual Representation Learning at Scale.
Nils Reimers, Iryna Gurevych: Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation, 2020.
Alexis Conneau, Kartikay Khandelwal, et al.: Unsupervised Cross-lingual Representation Learning at Scale, 2020.
Guillaume Wenzek, Marie-Anne Lachaux, et al.: CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, 2019.