XLM-R Encoder

The XLM-R Encoder block provides the xlm-roberta-base model, published in Unsupervised Cross-lingual Representation Learning at Scale.

The XLM-R Encoder is a good model for processing text written naturally and in many languages, like the Multilingual BERT encoder block. However, the XLM-R Encoder was pretrained more specifically for sentence embedding, making it a better choice for text similarity tasks out of the box.

Note
Disclaimer
Please note that datasets, machine-learning models, weights, topologies, research papers and other content, including open source software, (collectively referred to as “Content”) provided and/or suggested by Peltarion for use in the Platform and otherwise, may be subject to separate third party terms of use or license terms. You are solely responsible for complying with the applicable terms. Peltarion makes no representations or warranties about Content. You expressly relieve us from any and all liability, loss or risk arising (directly or indirectly) from Your use of any third party content.

Using the XLM-R Encoder

The XLM-R Encoder block is initialized with weights pretrained on CommonCrawl data from 100 languages.

Use the XLM-R Embedding snippet to directly get a complete model for text embedding that uses the XLM-R Encoder.

Input

The input of the XLM-R Encoder must come from a XLM-R Tokenizer block.

Output

The XLM-R Encoder returns the so-called CLS output. This output is a vector that can be used as embedding for text similarity tasks, or passed to other blocks to perform regression or classification.

Languages

Here is the list of languages used in pretraining, which are supported out of the box by the XLM-R Encoder block and the XLM-R Embedding snippet.
More details, like how much of each languages was used in pretraining, can be found in Appendix 1 of Unsupervised Cross-lingual Representation Learning at Scale.

Table 1. Languages used to pretrain the XLM-R Encoder.

Afrikaans
Albanian
Amharic
Arabic
Armenian
Assamese
Azerbaijani
Basque
Belarusian
Bengali
Bengali Romanized
Bosnian
Breton
Bulgarian
Burmese
Burmese
Catalan
Chinese (Simplified)
Chinese (Traditional)
Croatian
Czech
Danish
Dutch
English
Esperanto

Estonian
Filipino
Finnish
French
Galician
Georgian
German
Greek
Gujarati
Hausa
Hebrew
Hindi
Hindi Romanized
Hungarian
Icelandic
Indonesian
Irish
Italian
Japanese
Javanese
Kannada
Kazakh
Khmer
Korean
Kurdish (Kurmanji)

Kyrgyz
Lao
Latin
Latvian
Lithuanian
Macedonian
Malagasy
Malay
Malayalam
Marathi
Mongolian
Nepali
Norwegian
Oriya
Oromo
Pashto
Persian
Polish
Portuguese
Punjabi
Romanian
Russian
Sanskrit
Scottish Gaelic
Serbian

Sindhi
Sinhala
Slovak
Slovenian
Somali
Spanish
Sundanese
Swahili
Swedish
Tamil
Tamil Romanized
Telugu
Telugu Romanized
Thai
Turkish
Ukrainian
Urdu
Urdu Romanized
Uyghur
Uzbek
Vietnamese
Welsh
Western Frisian
Xhosa
Yiddish

Available weights

The XLM-R Encoder block uses the xlm-roberta-base model with weights, pre-trained by Hugging face on 100 languages from CommonCrawl.

Terms

When using pretrained snippets, additional terms apply: XLM-R with weights licence.

Reference

Alexis Conneau, Kartikay Khandelwal, et al.: Unsupervised Cross-lingual Representation Learning at Scale, 2020.

Guillaume Wenzek, Marie-Anne Lachaux, et al.: CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, 2019.

Was this page helpful?
YesNo