Classify text in any language
Multilingual AI at your fingertips
Learn how you can use the Peltarion Platform and its Multilingual BERT to create a model that is able to work with multiple languages simultaneously!
AI has made big strides in being able to automatically identify relationships and context in text data, and through that extract critical information locked in it.
However, many of these capabilities were only available for text written in English. But not anymore!
- Target audience: Beginners
- Tutorial type: Get started tutorial
- Problem type: Text classification
You will learn to
- Use a Multilingual BERT model, the state of the art AI in language processing.
- Automatically classify text extracts depending on their topic.
- Mix the available languages for training the model, and test it in any language.
Create a project
Log in on platform.peltarion.com, and click on New project to start working on this problem.
Import the data
After creating the project, you will be taken to the Datasets view, where you can import data.
Click the Import free datasets button.
Look for the Multilingual book corpus - tutorial data dataset in the list. Click on it to get more information.
Click Accept and import. This will import the dataset in your project, and you will be taken to the dataset’s details where you can edit features and subsets.
This will import the dataset in your project, and you can now edit it.
The Multilingual book corpus dataset
The training data is taken from the Project Gutenberg: A fantastic library of over 60,000 free ebooks. There, you’ll find the world’s greatest literature, with a focus on older works for which U.S. copyright has expired.
We selected about 1779 books that were marked with one of the following topics: biography, children’s stories, drama, history, mystery, and poetry.
We extracted 10 passages of about 100 words in every book available, at random and without any overlap.
Build a multilingual model in the Experiment wizard
Click Use in new experiment to open the Experiment wizard.
Name the experiment in the Experiment wizard.
Make sure that the Multilingual book corpus dataset is selected.
Inputs / target tab
Select the features that you want the model to use as input and as training target.
Inputs column: select only sentence.
Target column: select genre.
Problem type tab
Select Single-label text classification, since we want to classify text examples into single topics. The platform recommends an appropriate Problem type based on the input and target features.
Click Create, and all blocks needed will be added to the Modeling canvas.
This will build a model in the modeling canvas around the Multilingual BERT which is adapted to your features.
You can find the shape of features in the datasets view.
Run the experiment
Click Run to fine-tune Multilingual BERT on your dataset!
BERT is a large model, so training will take some time. Expect about 30 minutes to finish 1 epoch.
Evaluate your model
Navigate to the evaluation view (click the Evaluation black tab at the top of the page).
Once at least 1 epoch has been completed, you can also click on Predictions inspection to see how individual examples are classified.
The confusion matrix shows, by default, the predictions for the validation subset from the best epoch.
If a model is good the diagonal cells have the highest count. Higher values on the diagonal of the confusion matrix indicate that more predictions correctly match the actual categoty.
Enable the deployment
Finally, when the training is complete. In the Evaluation view click Create deployment.
Set up your deployment and click Enable to set your model live in the cloud.
Using the model URL and secret Token, you can now get predictions from your model from anywhere over the Internet. Use the Deployment API to let your applications use the model.
Test with deployment web app
Once the model is deployed, you can test it by clicking on Open web app above the API information.
This will open the Deployment web app. Add an text string to the app and click Get your result.
The deployment is private by default, but if you toggle the switch the deployment becomes public which means that you can share the app with anyone you like.
Congrats, in this tutorial, you have:
Used a corpus of text written in several languages.
Trained a model to classify pieces of text into one of several possible topics.
Not every category was available in every language.
Deployed a model for production usage.
Requested your model for classification of new text over the Internet.
A true end-to-end project, and no coding involved!
Learn how to improve a sentiment analysis model
In the tutorial Improve sentiment analysis, you run several experiments to solve a text classification problem using Multilingual BERT. The input is an IMDB dataset consisting of movie reviews, tagged with either positive or negative sentiment – i.e., how a user or customer feels about the movie.
Get started with text similarity search
We suggest that the next tutorial you should do is Find similar Google questions. In this tutorial you will work with text similarity. This technique will help you build models that compare and find texts that are similar in context and meaning, without them sharing a single common word. This means that for a given set of texts, the model can give you a quantifiable measure of how alike they are and give you the best matches in return, and in 100 languages.
You will learn to:
Build and deploy a model for text similarity
Use the output from a specific block in the model