Product development /

Multilingual BERT - Unlock insights from text in 100 languages

November 3/10 min read
  • Reynaldo Boulogne
    Reynaldo Boulogne

AI has come a long way in understanding natural language in a very human way. However, up until recently, many of these AI possibilities were limited to text written in English, but not anymore!

Before we begin: this article uses some basic ideas and concepts related to the field of Natural Language Processing (NLP) and the BERT model. If you don’t know what these terms mean, we highly recommend reading our non-technical introduction to NLP Making sense of NLP - Part I and What is BERT? articles.

Text data is one of the most common and abundant kinds of data available in companies today, and it often a treasure trove of business opportunities waiting to be tapped. 

For example, your customer’s reviews and social media posts are a direct link to their opinions about you, and their emails and support tickets a continuous source of how to improve customer satisfaction. Similarly, your own internal reports are sources of operational risk and opportunities and sources of historical data that can tell powerful stories.

In recent years, AI has made big strides in being able to automatically identify relationships and context in text data, and through that extract critical information locked in it. Or in other words, AI has made it possible to automate the processing of text data and make it a genuine source of quantifiable and actionable insights.

However, as is the nature of many AI advances, many of these capabilities were only available for text written in English. But not anymore.

As of today, you can use the Peltarion Platform and it’s Multilingual BERT implementation to create models that can analyze text data written in any of over 100 languages

Do you have a lot of text or documents that you or your company need to check manually to find relevant information? Or do you perhaps have a lot of valuable historical information that is locked away in piles of documents? 

AI can finally help you process these no matter the language they are written in and the Peltarion Platform makes building a custom AI model a breeze.

What’s new?

If you used the platform before, you might know that we already had a BERT implementation, so maybe you are wondering what’s new. A picture is worth a thousand words, so here are two to help clarify the differences.

BERT

The model was only able to learn from English text and thus, could only be used to  process English texts when it was in service.

Multilingual BERT

The new model is able to learn from text written in any of over 100 languages and thus, can be used to process texts in your language of choice.

What’s in it for you?

While Multilingual BERT can be used to perform different NLP tasks, we have put our attention on text classification in our current implementation, since this is the task that will allow for the most number of business applications. For an inspirational list of sample applications, we recommend reading our article What is NLP and how can I make use of it in my business?

With that in mind, let's briefly go over the 3 different ways in how Multilingual BERT can be trained and used in practice since 2/3 might not necessarily be immediately obvious.

Use a state-of-the-art AI model for your language of choice

This scenario is the main use case of the new Multilingual BERT implementation. In short, you can now fine-tune and use this BERT model in any of over 100 languages

What this means for you is that you’re no longer restricted to AI language models that don’t perform well or at all in your language. You can take advantage of state-of-the-art models to identify and sort information from your text and turn them into powerful stories today.

Reduce the effort needed to create a fine-tuning dataset - Method 1

There are many language English language datasets that have been created over the years, why not leverage them?

If your application doesn’t require a high degree of precision (for example, because  you’re building a PoC) you can save time and costs by using an existing dataset in English to fine-tune your model and then use it to make predictions on text written in any other language (for example, Swedish). 

Reduce the effort needed to create a fine-tuning dataset - Method 2

Another way to facilitate the creation of a fine-tuning dataset is to combine text data from multiple languages into a single dataset. This is actually what we did to create the dataset for our Multilingual BERT tutorial.

For example, if you have or are part of an international organization, you could ask colleagues from around the world to provide you with examples of the problem you’re trying to solve and quickly compile a large dataset. You can then use it to fine tune your model and make predictions on text written in the language of your choice.

As in the previous case, the performance of the model will be lower compared to fine-tuning and using the model on a single language, so it’s best suited for applications where a high degree of precision is not crucial (for example, when you’re building a PoC).

Where do I go from here?

Why not jump directly into getting some hands-on experience and find out for yourself how you can create your own powerful text classification model with a few simple clicks, in just a couple of minutes?

And if you're looking for some technical documentation about how to use Multilingual BERT on the Peltarion Platform, make sure to read our Knowledge Center articles:

  • Reynaldo Boulogne

    Reynaldo Boulogne

    With over 15 years of experience, Reynaldo has worked within the intersection of business and technology across multiple sectors, most recently at Klarna and Spotify. He is passionate about innovation, leadership, and building things from scratch. Reynaldo is also a former Vice-chairman of the Stockholm based AI forum, Stockholm AI.