Transfer learning is a key concept in the context of improving multilingual models for low-resource languages.
In the context of modern NLP, transfer learning refers to training a model in an unsupervised way on unlabeled data and then fine-tuning it on a smaller, labeled dataset. The intention is that the model acquires background world knowledge and knowledge about the meaning of words during pre-training. This knowledge can later be used by the model to perform a downstream task successfully. This is known as the pre-training finetuning approach (PT-FT) and it has proven to be a successful recipe that has brought SOTA results.
Further pre-training is defined as the continuation of the unsupervised training process, starting from a released checkpoint of an already pre-trained model. Since pre-training does not require any annotated data, finding raw text in our language of interest gets easier and might boil down to, for example, using an open-source service to simply crawl natural text from the web. By further pre-training we aim to boost the model’s performance for a target language.
Multilingual models can support downstream tasks in any of the languages they have been pre-trained on, given that they are fine-tuned on a labeled dataset in that language. It has been proven that they can perform surprisingly well on cross-lingual tasks as well: they can be fine-tuned using a labeled dataset of a language, usually a high-resource one, and they can later zero-shot on another language. We have validated this through our experiments with XLM-R, on the task of Political authorship classification.
This happens potentially because, during pre-training, the model is learning universal structures that are common to several languages. There is ongoing research to fully support this hypothesis, but the predominant view in the field today attributes this phenomenon to language-agnostic representations that are created during pre-training, due to shared commonalities like sub-words that are shared across many languages or structural similarities.
In our case, our experiments included further pre-training of the mT5 model on either a mixture of Swedish and English, or on Swedish only.
Why did we choose to mix English with Swedish? mT5 was pre-trained on a massively large dataset comprising 6.3 trillion tokens in 101 languages, out of which 2,733B tokens are in English. Only 45B are in Swedish. English is the language represented by most tokens in this dataset. By further pre-training on a mix of the two languages, we aim to enhance the alignment between them. We further pre-trained mT5 using two datasets of size 700KB each and another two datasets of 7KB each.
Both pairs consist of one dataset in Swedish and one in English and Swedish. All components of the experimental process are represented in the diagram below.