The XLM-R network (published in Unsupervised Cross-lingual Representation Learning at Scale) has excellent performance for Natural Language Processing (NLP), in particular for similarity tasks.
The XLM-R Embedding snippet allows you to quickly get started with your language-based model.
Why use a multilingual model?
More than a simple convenience, multilingual models often perform better than monolingual models.
One reason is that the training data available is generally more limited in any single language. In addition, many languages share common patterns that the model can pick up more easily when it is trained with a variety of languages.
The XLM-R Embedding snippet
The XLM-R Embedding snippet includes:
An XLM-R Tokenizer block.
An XLM-R Encoder block with pre-trained weights.
An Output block.
Allows you to extract the encoded text feature as a sentence embedding when the model is deployed.
How to train the XLM-R Embedding snippet
The weights provided were pre-trained for 100 languages, and make the model particularly well suited for similarity tasks without further training.
If you want to fine-tune the model for your own data, you can follow this procedure for fine-tuning pre-trained snippets.
Fine-tuning an XLM-R model
XLM-R is also a powerful model, which can learn most fine-tuning datasets very easily. This means that it is prone to catastrophic forgetting and overfitting of the new dataset when trained with inappropriate settings.
To avoid these issues, train your model with a very low Learning rate, of the order of 10-5 to 10-6.
In addition, only train for a few Epochs, between 1 and 3.
Alexis Conneau, Kartikay Khandelwal, et al.: Unsupervised Cross-lingual Representation Learning at Scale, 2020.
Guillaume Wenzek, Marie-Anne Lachaux, et al.: CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data, 2019.