Since the deep learning revolution began, the capabilities of natural language processing (NLP) have increased by leaps and bounds. The invention of the Transformer architecture introduced a new family of language models (the BERT model being their poster child), which have pushed the state-of-the-art limits in basically every NLP task. However, these models are typically trained and evaluated in English. Wouldn't it be great if we could fit several languages inside a single model? As part of the Language models for Swedish authorities project, we investigated the power of multilingual NLP models, and their potential to generalize across languages.
Breaking the language barriers with multilingual NLP
Multilingual NLP models
The driving idea behind multilingual NLP models is to create a single model which can understand multiple languages, instead of training a single model for every language. An enticing thought; however, creating such a model requires training on a massive multilingual corpus. Because of this, we've mainly seen results from research groups at major players such as Facebook, Google and DeepMind. Fortunately, some of their results have been open-sourced on GitHub together with associated research articles, so if you are a data scientist specializing in deep learning, you can download the model parameters and conduct your own experiments. It should be noted, however, that this is not production-quality code and it requires significant knowledge and work to properly operationalize these models into production. In particular, Google and Facebook have released the multilingual models “Multilingual BERT” and “XLM-R,” respectively. These two models have been pre-trained on 100 different languages, so the odds are high that your particular language is among them.
One model to rule them all
So what's the big deal with multilingual models? What practical benefits do they actually provide? We'll cover a concrete example in the next section, but in short, they are surprisingly useful.
As more and more of the everyday discourse moves online, the borders between different markets become increasingly blurred. While a lot of the online discourse is in English, there are still many other languages out there. Even though people are set apart by language, we still talk about similar things independent of geographic location or native tongue, including topics such as news, politics, brands, etc. Should you want to perform a thorough analysis of the discourse about a topic, it would traditionally require a separate model for each language, or you would miss out on considerable parts of the discussion. Training and maintaining a single model is difficult enough; imagine needing to maintain one model for each language! A multilingual model vastly simplifies the required software engineering effort, shoring up time for other projects. Fine-tuning a pre-trained model is also significantly cheaper and has a much smaller carbon footprint than pre-training models from scratch. As an added benefit, a multilingual model has no cold-start period should you present it with a new language – just feed the new language right into your existing one (given it was pre-trained on the new language).
Even if you only care about a single language, there's a chance multilingual models still offer value. They have been shown to perform better than monolingual models for languages that do not have copious amounts of training data. This can be explained by the model being able to learn similar underlying language patterns across languages; allowing languages with only a small amount of training data to piggyback on other languages.
Example: Political authorship classification
To investigate what is possible with multilingual models, we experimented with Facebook's XLM-R on a specific Swedish use case. We trained XLM-R to classify which Swedish political party had authored specific text. All Swedish government documents are available to the general public, and for this experiment we used all the motions from the eight major parties in the Swedish Parliament from 2010 to present as training data to fine-tune the model.
After being trained, the model reached 80% accuracy, which is the same accuracy we got using a monolingual Swedish BERT model. This shows that XLM-R does indeed perform well for this problem.
To investigate if this training would transfer to other languages, we evaluated XLM-R qualitatively in several different languages. This was done by translating previously unseen Swedish sentences to English, Spanish and Russian, among other languages. To our surprise, the model was able to perform semantically similar predictions, even though it was translated to another language. This suggests that training solely on a Swedish dataset gave us a model that is usable on 99 other languages for free! How cool is that? There are many languages that have very few labeled datasets to train on, Swedish being one of them. These findings suggest that it is possible to train a multilingual model on an English dataset, and then use it in another language. This could be a huge boon for smaller languages, and a true democratization of NLP.
We have made a public web app where you can test the model’s capabilities yourself, which is available here.
In this article, we've discussed the benefits of multilingual NLP models and evaluated XLM-R on an example task. We've also seen that after training, XLM-R is capable of performing the learned task on different languages, despite not being explicitly trained on them. If you're curious about how this works and want to read more about multilingual NLP models, check out this blog post written by my fellow colleague John.
02/ More on Data science
Search text by semantic similarity
Building a Stack Overflow question tagging model with public BigQuery data
In this blog post, we'll show how you can shape one of these public datasets, the Stack Overflow posts dataset, to train a question tagging BERT model.
Connecting the dots in Neural Networks
Image similarity with deep learning explained
How can a neural network deal with a vague concept like similarity? Why do we measure angles with the cosine similarity? Is it that far from classification?
Peltarion on Microsoft's AI Show