Sentence XLM-R

The Sentence XLM-R block provides good Natural Language Processing capabilities for 100 languages, and is especially trained for text similarity applications.

Note
Disclaimer
Please note that datasets, machine-learning models, weights, topologies, research papers and other content, including open source software, (collectively referred to as “Content”) provided and/or suggested by Peltarion for use in the Platform and otherwise, may be subject to separate third party terms of use or license terms. You are solely responsible for complying with the applicable terms. Peltarion makes no representations or warranties about Content. You expressly relieve us from any and all liability, loss or risk arising (directly or indirectly) from Your use of any third party content.

Using Sentence XLM-R

Use cases

Sentence XLM-R is a good model for processing text written naturally and in many languages, like the Multilingual BERT block.
However, Sentence XLM-R was pretrained more specifically for sentence embedding, making it a better choice for text similarity tasks out of the box. Text similarity tasks can also be performed by the faster but less accurate Universal sentence encoder block.

Input

The input of the Sentence XLM-R must come from an Input block that provides a text encoded feature.

Output

The Sentence XLM-R block tokenizes text features and embeds each token in a context-aware vector using self-attention.
The Sentence XLM-R block outputs the mean of all the token embedding vectors, which is a useful representation of text as a whole that you can use to compare different text examples, for example for text similarity and text clustering tasks.

Parameters

Sequence length: The maximum length of text that the model processes, in number of tokens. There are generally 1 to 3 tokens per word.
Aim for a Sequence length that is the typical size of your text feature, since larger values require more computation time but smaller values may cause the end of your text to be ignored.

The Sentence XLM-R block supports sequence lengths between 3 and 512 tokens.
Note that 2 tokens are reserved for internal use, so a Sequence length of 3 processes a single token of text.

Trainable: Whether we want the training algorithm to change the value of the weights during training. In some cases, one will want to keep parts of the network static.

Training the Sentence XLM-R block

The Sentence XLM-R block is initialized with weights pretrained on CommonCrawl, SNLI, and STS-b.

This means that you don’t have to fine-tune Sentence XLM-R for similarity tasks. Simply make sure that all the blocks in your model graph have the Trainable setting un-checked to skip training and save time.

Languages

Why use a multilingual model?

A multilingual model allows you to deploy a single model able to work with any of the 100 languages.

Multilingual training dataset and predictions
Figure 1. Example of sentiment classification. The training data combines examples from English and French which are easily available. The model predicts the sentiment of a sentence in any language.

More than a simple convenience, multilingual models often perform better than monolingual models.
One reason is that the training data available is generally more limited in any single language. In addition, many languages share common patterns that the model can pick up more easily when it is trained with a variety of languages.

Languages supported by Sentence XLM-R

Here is the list of languages used in pretraining, which are supported out of the box. More details, like how much of each languages was used in pretraining, can be found in Appendix 1 of Unsupervised Cross-lingual Representation Learning at Scale.

Table 1. Languages used to pretrain the Sentence XLM-R.

Afrikaans
Albanian
Amharic
Arabic
Armenian
Assamese
Azerbaijani
Basque
Belarusian
Bengali
Bengali Romanized
Bosnian
Breton
Bulgarian
Burmese
Burmese
Catalan
Chinese (Simplified)
Chinese (Traditional)
Croatian
Czech
Danish
Dutch
English
Esperanto

Estonian
Filipino
Finnish
French
Galician
Georgian
German
Greek
Gujarati
Hausa
Hebrew
Hindi
Hindi Romanized
Hungarian
Icelandic
Indonesian
Irish
Italian
Japanese
Javanese
Kannada
Kazakh
Khmer
Korean
Kurdish (Kurmanji)

Kyrgyz
Lao
Latin
Latvian
Lithuanian
Macedonian
Malagasy
Malay
Malayalam
Marathi
Mongolian
Nepali
Norwegian
Oriya
Oromo
Pashto
Persian
Polish
Portuguese
Punjabi
Romanian
Russian
Sanskrit
Scottish Gaelic
Serbian

Sindhi
Sinhala
Slovak
Slovenian
Somali
Spanish
Sundanese
Swahili
Swedish
Tamil
Tamil Romanized
Telugu
Telugu Romanized
Thai
Turkish
Ukrainian
Urdu
Urdu Romanized
Uyghur
Uzbek
Vietnamese
Welsh
Western Frisian
Xhosa
Yiddish

Available weights

The Sentence XLM-R block uses the xlm-r-100langs-bert-base-nli-stsb-mean-tokens model with weights, pre-trained by Hugging face on 100 languages from CommonCrawl, SNLI, and STS-b.

Terms

When using pretrained blocks, additional terms apply: Sentence XLM-R with weights licence.

References

Was this page helpful?
YesNo