Classify text in any language

Multilingual AI at your fingertips

Learn how you can use the Peltarion Platform and its Multilingual BERT to create a model that is able to work with multiple languages simultaneously!

AI has made big strides in being able to automatically identify relationships and context in text data, and through that extract critical information locked in it.
However, many of these capabilities were only available for text written in English. But not anymore!

Person- Target audience: Beginners
Spaceship- Tutorial type: Get started tutorial
Bell- Problem type: Text classification

You will learn to
Peltarion logo- Use a Multilingual BERT model, the state of the art AI in language processing.
Peltarion logo- Automatically classify text extracts depending on their topic.
Peltarion logo- Mix the available languages for training the model, and test it in any language.

Create a project

Let’s begin!

Log in on, and click on New project to start working on this problem.

New project icon

Import the data

After creating the project, you will be taken to the Datasets view, where you can import data.

Click the Import free datasets button.

Import free datasets button

Look for the Multilingual book corpus - tutorial data dataset in the list. Click on it to get more information.

Click Accept and import. This will import the dataset in your project, and you will be taken to the dataset’s details where you can edit features and subsets.

This will import the dataset in your project, and you can now edit it.

The Multilingual book corpus dataset

The training data is taken from the Project Gutenberg: A fantastic library of over 60,000 free ebooks. There, you’ll find the world’s greatest literature, with a focus on older works for which U.S. copyright has expired.

We selected about 1779 books that were marked with one of the following topics: biography, children’s stories, drama, history, mystery, and poetry.

We extracted 10 passages of about 100 words in every book available, at random and without any overlap.

Build a multilingual model in the Experiment wizard

Click Use in new experiment to open the Experiment wizard.

Use in new experiment button

Name the experiment in the Experiment wizard.

  • Dataset tab
    Make sure that the Multilingual book corpus dataset is selected.

  • Inputs / target tab
    Select the features that you want the model to use as input and as training target.

    • Inputs column: select only sentence.

    • Target column: select genre.

  • Problem type tab
    Select Single-label text classification, since we want to classify text examples into single topics. The platform recommends an appropriate Problem type based on the input and target features.

Click Create, and all blocks needed will be added to the Modeling canvas.

Create button

Modeling canvas

This will build a model in the modeling canvas around the Multilingual BERT which is adapted to your features.

The ready to use Multilingual BERT wizard
Figure 1. The wizard creates and configures the model automatically. Blocks can be further edited and moved around before running.

Note that the last Dense block must output a tensor of shape 6. This matches the shape of the genre feature, which has 6 categories and is the target feature that the model learns.

You can find the shape of features in the datasets view.

Run the experiment

Click Run to fine-tune Multilingual BERT on your dataset!

BERT is a large model, so training will take some time. Expect about 30 minutes to finish 1 epoch.

Run button

Evaluate your model

Navigate to the evaluation view (click the Evaluation black tab at the top of the page).

Predictions inspection

Once at least 1 epoch has been completed, you can also click on Predictions inspection to see how individual examples are classified.

Confusion matrix

The confusion matrix shows, by default, the predictions for the validation subset from the best epoch.

If a model is good the diagonal cells have the highest count. Higher values on the diagonal of the confusion matrix indicate that more predictions correctly match the actual categoty.

Confusion matrix for text classification
Figure 2. The confusion matrix shows that most predictions fall in the correct (actual) category. Clicking on a cell will filter the examples shown on the side.

Enable the deployment

Finally, when the training is complete. In the Evaluation view click Create deployment.

Create deployment button

Set up your deployment and click Enable to set your model live in the cloud.

Using the model URL and secret Token, you can now get predictions from your model from anywhere over the Internet. Use the Deployment API to let your applications use the model.

Test with deployment web app

Once the model is deployed, you can test it by clicking on Open web app above the API information.

Open web app button

This will open the Deployment web app. Add an text string to the app and click Get your result.

The deployment is private by default, but if you toggle the switch the deployment becomes public which means that you can share the app with anyone you like.

Tutorial recap

Congrats, in this tutorial, you have:

  • Used a corpus of text written in several languages.

  • Trained a model to classify pieces of text into one of several possible topics.
    Not every category was available in every language.

  • Deployed a model for production usage.

  • Requested your model for classification of new text over the Internet.

A true end-to-end project, and no coding involved!

Next steps

Learn how to improve a sentiment analysis model

In the tutorial Improve sentiment analysis, you run several experiments to solve a text classification problem using Multilingual BERT. The input is an IMDB dataset consisting of movie reviews, tagged with either positive or negative sentiment – i.e., how a user or customer feels about the movie.

We suggest that the next tutorial you should do is Find similar Google questions. In this tutorial you will work with text similarity. This technique will help you build models that compare and find texts that are similar in context and meaning, without them sharing a single common word. This means that for a given set of texts, the model can give you a quantifiable measure of how alike they are and give you the best matches in return, and in 100 languages.

You will learn to:

  • Build and deploy a model for text similarity

  • Use the output from a specific block in the model

Find similar Google questions
Was this page helpful?
Yes No