Classify text in any language

Multilingual AI at your fingertips

If you have done our previous tutorials, you might be aware that AI has made big strides in being able to automatically identify relationships and context in text data, and through that extract critical information locked in it. However, as is the nature of many AI advances, many of these capabilities were only available for text written in English.
But not anymore!

In this tutorial we will show you how you can use the Peltarion Platform and its Multilingual BERT snippet to create a model that is able to work with multiple languages simultaneously!

12 - Target audience: Beginners

You will learn to
12 - Build, train, and deploy a Multilingual BERT model, the state of the art AI in language processing.
12 - Automatically classify text extracts depending on their topic.
12 - Mix the available languages for training the model, and test it in any language.

Multilingual BERT tutorial
Figure 1. Let the computer tirelessly sort out piles of text, no matter the language!

The data – books in many languages

Like in our writing style tutor tutorial, the training data is taken from the Project Gutenberg: A fantastic library of over 60,000 free ebooks. There, you’ll find the world’s greatest literature, with a focus on older works for which U.S. copyright has expired.

We selected about 1779 books that were marked with one of the following topics: biography, children’s stories, drama, history, mystery, and poetry.
We will train our model to read a piece of text in any language, and attribute it to one of these topics.

Each topic isn’t equally available in every language. For example, there were no mystery books available in Italian or Swedish. This is surely an artefact of the keywords used to filter the books, or maybe their copyright haven’t expired yet.
Regardless, this is not a problem at all.

The strength of multilingual models is that they can be trained with whatever language is available, and still work with other languages.
For example, English and Swedish are sufficiently similar that the model can be trained with examples of mystery text in English, and pick up on speech patterns that it can use later to process Swedish text examples.

Here is the number of books in our dataset, for each topic and language:

biography children’s stories drama history mystery poetry

English

91

22

92

93

24

80

Finnish

33

17

99

99

26

93

French

94

23

99

98

23

95

German

49

10

98

98

3

83

Italian

8

3

85

100

0

49

Swedish

1

4

12

12

0

15

We extracted 10 passages of about 100 words in every book available, at random and without any overlap.


Create a project

Let’s begin!

Log in on platform.peltarion.com, and click on New project to start working on this problem.


Import the data

After creating the project, you will be taken to the Datasets view, where you can import data.

Click the Data library button and look for the Multilingual book corpus dataset in the list. Click on it to get more information.

Click Accept and import. This will import the dataset in your project, and you will be taken to the dataset’s details where you can edit features and subsets.

Feature encoding

Click on Table to get a detailed view of the dataset.

Verify that the default feature encoding is correct:

  • The sentence feature should use the Text encoding.

  • The genre feature should use the Categorical encoding, with One-hot as the Type.

The other features are there to give you more context about the dataset, but the model won’t use them.

Edit feature encoding on the Peltarion Platform
Figure 2. Click on the wrench icon to edit the encoding of the dataset features.

If a feature uses the wrong settings, click the wrench icon to change it.

Save the dataset

You’ve now created a dataset ready to be used in the platform. Click Save version.

Then, click on Use in new experiment and the Experiment wizard will pop up.


Multilingual BERT - the text processing AI

The Experiment wizard helps you to set up your training experiment and recommends snippets, our pre-built models.

Let’s go over the Experiment wizard tab by tab.

Dataset tab

The platform selects the correct subsets by default.

  • Training uses 80% of the available examples.

  • Validation uses the remaining 20% examples to check how the model generalizes to unknown examples.

Input(s) / target tab

Select the features that you want the model to use as input and as training target.

  • Input(s) column: select only sentence (), unselect any other feature that might be selected.

  • Target column: make sure that genre (6) is selected.

Snippet tab

The platform recommends an appropriate Problem type and some snippets based on the input and target features.

Make sure the Problem type is Single-label text classification, since we want to classify text examples into single topics.

Select the Multilingual BERT cased snippet. This snippet will build a model around the Multilingual BERT encoder which is adapted to your features.


Create the experiment

Click Create to build the model in the Modeling canvas.

The ready to use Multilingual BERT snippet
Figure 3. The snippet creates and configures the model automatically. Blocks can be further edited and moved around before running.

Note that the last Dense block must output a tensor of shape 6. This matches the shape of the genre feature, which has 6 categories and is the target feature that the model learns.

You can find the shape of features in the datasets view.


Run the experiment

Click Run to fine-tune Multilingual BERT on your dataset!

BERT is a large model, so training will take some time. Expect about 30 minutes to finish 1 epoch.


Evaluate your model

Navigate to the evaluation view (click the Evaluation black tab at the top of the page).

Loss and metrics

In the Evaluation view, you can see how the loss and metrics improve in real time.

By default the thin line shows the loss of the training subset. The thick line shows the loss of the validation subset: this is a subset of data that is never shown to the model while training, and it’s used to validate that the model will perform well even on unknown data.

You can also click on different metrics from the list, like the Categorical accuracy, to see how it increases during training.

Predictions inspection

Once at least 1 epoch has been completed, you can also click on Predictions inspection to see how individual examples are classified.

Confusion matrix

If you select the Subset Validation, the Checkpoint marked with (best), and click on Inspect, you will see the confusion matrix and the scores of individual examples for each category.

Higher values on the diagonal of the confusion matrix indicate that more predictions correctly match the actual categoty.

You can click on cells of the confusion matrix to filter the examples shown. For example, it seems that the model has some trouble distinguishing between the history and biography topics.

Click on one of the cells crossing history and biography to filter examples, which will look similar to this.

Confusion matrix for text classification
Figure 4. The confusion matrix shows that most predictions fall in the correct (actual) category. Clicking on a cell will filter the examples shown on the side.

Get insights from inspecting the predictions

Even for a human, it is difficult to tell if the text examples should belong to history or biography. This lets us know that the errors made by the model come from a real difficulty in the data, and not from a technical problem or design mistake.

We could try to improve the performance on these categories by adding more relevant examples in the dataset.
We could also rethink whether our approach is the right one. If the model says that the topic of a text example is 50% history, 50% biography, there might be some truth to it. That’s how researchers have determined how much of Shakespeare was written by Shakespeare.

So in some cases, displaying scores for each topic, or even solving a multi-label classification problem, like in our build your own music critic tutorial, could be more appropriate.


Enable the deployment

Finally, when the training is complete, navigate to the Deployment view and create a New deployment.

Click Enable to set your model live in the cloud.

Using the model URL and secret Token, you can now get predictions from your model from anywhere over the Internet.

Use the Deployment API to let your applications use the model.

If you want to prolong the no-code experience, check out how to create a no-code AI app, or integrate directly with existing products with Peltarion connector in Microsoft Power Apps or Peltarion AI for Sheets.


Test the deployment

Once the model is deployed, you can also test it by clicking on Test deployment above the API information.

This will open a new page configured with your model’s API information, where you can type in an extract from your favorite book.
Click on the Play button to check the prediction.


Recap

Congrats, in this tutorial, you have:

  • Used a corpus of text written in several languages.

  • Trained a model to classify pieces of text into one of several possible topics.
    Not every category was available in every language.

  • Deployed a model for production usage.

  • Requested your model for classification of new text over the Internet.

A true end-to-end project, and no coding involved!

Was this page helpful?
YesNo