Sentence XLM-R
The Sentence XLM-R block provides good Natural Language Processing capabilities for 100 languages, and is especially trained for text similarity applications.
Note
|
Disclaimer Please note that datasets, machine-learning models, weights, topologies, research papers and other content, including open source software, (collectively referred to as “Content”) provided and/or suggested by Peltarion for use in the Platform and otherwise, may be subject to separate third party terms of use or license terms. You are solely responsible for complying with the applicable terms. Peltarion makes no representations or warranties about Content. You expressly relieve us from any and all liability, loss or risk arising (directly or indirectly) from Your use of any third party content. |
Using Sentence XLM-R
Use cases
Sentence XLM-R is a good model for processing text written naturally and in many languages, like the Multilingual BERT block.
However, Sentence XLM-R was pretrained more specifically for sentence embedding, making it a better choice for text similarity tasks out of the box.
Text similarity tasks can also be performed by the faster but less accurate Universal sentence encoder block.
Input
The input of the Sentence XLM-R must come from an Input block that provides a text encoded feature.
Output
The Sentence XLM-R block tokenizes text features and embeds each token in a context-aware vector using self-attention.
The Sentence XLM-R block outputs the mean of all the token embedding vectors, which is a useful representation of text as a whole that you can use to compare different text examples, for example for text similarity and text clustering tasks.
Parameters
Sequence length: The maximum length of text that the model processes, in number of tokens. There are generally 1 to 3 tokens per word.
Aim for a Sequence length that is the typical size of your text feature, since larger values require more computation time but smaller values may cause the end of your text to be ignored.
The Sentence XLM-R block supports sequence lengths between 3 and 512 tokens.
Note that 2 tokens are reserved for internal use, so a Sequence length of 3 processes a single token of text.
Trainable: Whether we want the training algorithm to change the value of the weights during training. In some cases, one will want to keep parts of the network static.
Training the Sentence XLM-R block
The Sentence XLM-R block is initialized with weights pretrained on CommonCrawl, SNLI, and STS-b.
This means that you don’t have to fine-tune Sentence XLM-R for similarity tasks. Simply make sure that all the blocks in your model graph have the Trainable setting un-checked to skip training and save time.
Languages
Why use a multilingual model?
A multilingual model allows you to deploy a single model able to work with any of the 100 languages.
More than a simple convenience, multilingual models often perform better than monolingual models.
One reason is that the training data available is generally more limited in any single language.
In addition, many languages share common patterns that the model can pick up more easily when it is trained with a variety of languages.
Languages supported by Sentence XLM-R
Here is the list of languages used in pretraining, which are supported out of the box. More details, like how much of each languages was used in pretraining, can be found in Appendix 1 of Unsupervised Cross-lingual Representation Learning at Scale.
Afrikaans |
Estonian |
Kyrgyz |
Sindhi |
Available weights
The Sentence XLM-R block uses the xlm-r-100langs-bert-base-nli-stsb-mean-tokens model with weights, pre-trained by Hugging face on 100 languages from CommonCrawl, SNLI, and STS-b.
Terms
When using pretrained blocks, additional terms apply: Sentence XLM-R with weights licence.
References
-
Nils Reimers, Iryna Gurevych: Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation, 2020.
-
Alexis Conneau, Kartikay Khandelwal, et al.: Unsupervised Cross-lingual Representation Learning at Scale, 2020.
-
Guillaume Wenzek, Marie-Anne Lachaux, et al.: CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, 2019.