Data science /

Improving multilingual models for low resource languages

September 22/7 min read
  • Stella Katsarou
    Stella KatsarouData Scientist

The development of the Transformer architecture, introduced in the “Attention is all you need” paper, has been acclaimed as the ImageNet moment for the field of Natural Language Processing (NLP). Operationalizing Transformer models offers opportunities for economic growth and access to education, and can thereby work as an engine of prosperity for those who utilize them. 

Nevertheless, most state-of-the-art models are monolingual due to the abundance of datasets in high-resource languages, predominantly in English. This compromises the democratization of their usage, since it renders the potential benefits they can offer inaccessible to non-English speakers, which makes up approximately 80% of the world population.

02/ What has been done to address this problem?

The solution provided by the community is to develop multilingual models that have been pre-trained on a mixture of many languages. It has been proven that, under certain experimental settings, multilingual models can perform as well as their monolingual counterparts, or even surpass their performance.  

Nevertheless, multilingual models still underperform on a specific language when compared to a similarly sized monolingual model that has been trained solely on that specific language. For example, see Table 3 in this paper

Consequently, it could be tempting to fall back on the trend of developing big monolingual models from scratch, which is essentially an energy-intensive game (see section 6.3 here) of compute power and data abundance: two resources that not everyone has equal access to. 

What if we shift our attention to improving the multilingual models for a specific language instead? A lot of research has focused on this topic lately, employing a variety of methods. We at Peltarion have explored the method of further pre-training multilingual models, focusing on the Swedish language and the multilingual model mT5. Ideally, we want to transfer the knowledge acquired by the model during pre-training to this low-resource target language, reducing the need for annotated data. 

This idea of extracting relevant information from one domain, task, or language and applying it to another, is known as transfer learning.

03/ Multilingual models and Transfer Learning

Transfer learning is a key concept in the context of improving multilingual models for low-resource languages.

In the context of modern NLP, transfer learning refers to training a model in an unsupervised way on unlabeled data and then fine-tuning it on a smaller, labeled dataset. The intention is that the model acquires background world knowledge and knowledge about the meaning of words during pre-training. This knowledge can later be used by the model to perform a downstream task successfully. This is known as the pre-training finetuning approach (PT-FT) and it has proven to be a successful recipe that has brought SOTA results.  

Further pre-training is defined as the continuation of the unsupervised training process, starting from a released checkpoint of an already pre-trained model. Since pre-training does not require any annotated data, finding raw text in our language of interest gets easier and might boil down to, for example, using an open-source service to simply crawl natural text from the web. By further pre-training we aim to boost the model’s performance for a target language.  

Multilingual models can support downstream tasks in any of the languages they have been pre-trained on, given that they are fine-tuned on a labeled dataset in that language. It has been proven that they can perform surprisingly well on cross-lingual tasks as well: they can be fine-tuned using a labeled dataset of a language, usually a high-resource one, and they can later zero-shot on another language. We have validated this through our experiments with XLM-R, on the task of Political authorship classification.

This happens potentially because, during pre-training, the model is learning universal structures that are common to several languages. There is ongoing research to fully support this hypothesis, but the predominant view in the field today attributes this phenomenon to language-agnostic representations that are created during pre-training, due to shared commonalities like sub-words that are shared across many languages or structural similarities. 

In our case, our experiments included further pre-training of the mT5 model on either a mixture of Swedish and English, or on Swedish only. 

Why did we choose to mix English with Swedish? mT5 was pre-trained on a massively large dataset comprising 6.3 trillion tokens in 101 languages, out of which 2,733B tokens are in English. Only 45B are in Swedish. English is the language represented by most tokens in this dataset. By further pre-training on a mix of the two languages, we aim to enhance the alignment between them. We further pre-trained mT5 using two datasets of size 700KB each and another two datasets of 7KB each.

Both pairs consist of one dataset in Swedish and one in English and Swedish. All components of the experimental process are represented in the diagram below.

To assess the effectiveness of our method, we later fine-tuned both the original model that was released by the authors as well as our further pre-trained model, on the task of semantic text similarity. To do so, we used a parallel dataset, one that is available in both English and Swedish. This allowed us to also experiment under a cross-lingual setting: we fine-tuned the model in the same language we would later test the model on, and we fine-tuned the model in one language and then tested it on the other. This way we were also able to evaluate whether the model’s generalization ability was affected, that is, if further pre-training in Swedish could compromise the model’s performance in English. 

We found that further pre-training was either beneficial or had no negative effect on the performance, in all cases. Surprisingly, the model’s performance improved even for English, after further pre-training the model in Swedish only.

04/ Looking ahead

It remains to be explored whether we can improve the performance even more by trying out different dataset sizes and domains when we further pre-train our models, or by finding more sophisticated ways of sampling languages during training.

Setting multilingual models as starting points towards improving them for specific languages is a good opportunity to be inventive, so that we find the optimal way to push the boundaries of understanding and exploring our models. It is also an opportunity to save on energy and avail audiences that may not hold the computational or financial power to invest on the pre-training process.

    • Stella Katsarou

      Stella Katsarou

      Data Scientist

      Stella is a member of the Peltarion AI Research team working as a data scientist. She focuses on NLP and has a strong interest in Bias, Fairness and Explainability in AI.