Though these models are very performant in the representations that they learn, this is also a limiting factor inherent in their model architecture. Because computing attention is so computationally expensive, most models limit this to around 512 tokens (sub-parts of words). Because the models train with a short attention span, they don't learn long-term dependencies between long sentences. More recent research has focused on methods for improving this in existing Transformers, such as the Reformer, Transformer-XL, or Longformer models. These models, as with most deep learning research, have focused exclusively on the English language, making them limited for real-world use. To use these models on, e.g., Swedish data, we would need vast amounts of long-context data, which might not be available. This process would then need to be repeated for every new language we want to use the model for.
It is not feasible to re-purpose even a fraction of the research models into the over 7000 languages spoken in the world. However, what some recent models have done (XLM-R and mBERT) is to train existing models on over 100 of the most common languages in the world. These models have shown to be on par with existing monolingual models or even out-performing them in some instances. Additionally, they have proven to adapt surprisingly well to new languages, previously not seen by the model.
Training a language model from scratch takes a long time, a large-context model more so and a multilingual model exuberantly more. We speculated that the ideal way to create a model that could leverage longer context in multiple languages would be if we could start from a multilingual model and extend its context with additional training. Luckily, there does exist such models: one of which is called the Longformer.