The table above, adapted from the paper by Conneau et al. (2019), contains the results of evaluating the models we have discussed on XNLI. Note that mBERT uses a smaller architecture (BERT-base), whereas XLM and XLM-R use the BERT-large architectures, so the difference in performance may be exaggerated.
We see that XLM-R performs significantly better than the other models. XLM is better than mBERT, but handles fewer languages and is based on a larger model, and so the difference may not be so large in reality. Surprisingly, cross-lingual transfer works very well: XLM-R gets 80% accuracy averaged across all languages, despite only being fine-tuned on English training data. Machine-translating the training set to all languages (totaling 6 million training examples) yields even better performance, but not by much. Considering that machine translation can be prohibitively expensive, cross-lingual transfer is very competitive.
A highlight from the XLM-R paper is their evaluation on the GLUE benchmark, a standard NLP benchmark in English, where it is shown that XLM-R is competitive with monolingual models on a monolingual benchmark, despite handling 100 languages. XLM-R achieves an average performance of 91.5, compared to 90.2, 92.0 and 92.8 of BERT, XLNet and RoBERTa, respectively. So while XLM-R doesn’t beat its monolingual counterpart RoBERTa, it is remarkably close.
Our experiments with XLM-R on Swedish political data echoes this finding. XLM-R performed as good as (or slightly better than) the Swedish BERT base models provided by The Swedish Public Employment Service and The National Library of Sweden. XLM-R achieved ~80% accuracy whereas the Swedish BERT models reached ~79% accuracy. While this isn’t a significant difference, it may mean that training monolingual models for small languages is unnecessary.