Movie review feelings

Solve a text classification problem with BERT

In this tutorial, you will solve a text classification problem using Multilingual BERT (Bidirectional Encoder Representations from Transformers). The input is an IMDB dataset consisting of movie reviews, tagged with either positive or negative sentiment – i.e., how a user or customer feels about the movie.

Text classification aims to assign text, e.g., tweets, messages, or reviews, to one or multiple categories. Such categories can be the author’s mood: is a review positive or negative?

Person - Target audience: Beginners

You will learn to
Peltarion logo - How to build and deploy a model based on BERT.
Peltarion logo - Work with single-label text classification.

Create a project

First, navigate to the Projects view.
Click New project to create a project and name it so you know what kind of project it is.

New project icon

A project combines all of the steps in solving a problem, from the pre-processing of datasets to model building, evaluation, and deployment. Using projects makes it easy to collaborate with others.

What is BERT?

BERT pushed the state of the art in Natural Language Processing (NLP), the techniques that aim to automatically process, analyze and manipulate (large amounts) of language data like speech and text.
BERT does this by combining two powerful technologies:

  • It is based on a deep Transformer network. A type of network that can process efficiently long texts by using attention.

  • It is bidirectional. Meaning that it takes into account the whole text passage to understand the meaning of each word.

Numerical method

If you want it could be a good idea to read about word embeddings and BERT’s attention mechanism, which are important concepts in NLP. For an introduction and overview material, check out the links below:

Add the IMDB data

After creating the project, you will be taken to the Datasets view, where you can import data.

There are several ways to import your data to the platform. This time we will use the Data library that is packed with free-to-use datasets.

Click the Import free datasets button.

Import free datasets button

Look for the IMDB - tutorial data dataset in the list. Click on it to get more information. You can also read this article for more info on the IMDB dataset.

Movie theater IMDM datatset
Figure 1. The IMDB dataset includes a huge amount of movie reviews.

If you agree with the license, click Accept and import.

Accept and import button

This will import the dataset in your project, and you can now edit it.

Feature encoding

The feature encoding determines the way in which example data is turned into numbers that a model can do calculations with. Verify that the default feature encoding is correct:

  • The review feature should use the Text encoding.

  • The sentiment feature should use the Binary encoding, and use positive as the Positive class.

If a feature uses the wrong settings, click the Wrench icon to change it.

Datasets view
Figure 2. Datasets view. Click on the Wrench icon to change a feature’s settings.

Subsets of the dataset

All samples in the dataset are split into Subsets.
By default, we put 80% of the dataset samples into a training subset, 10% into a validation subset, and 10% into a test subset.

Keep these default values in this project, but you can use whatever subset split you want later.

Save the dataset

You’ve now created a dataset ready to be used in the platform.
Click Save version and then click Use in new experiment and the Experiment wizard will pop up.

Save button

Design a text classification model with the wizard

The Experiment wizard opens to help you set up your training experiment. We’ll now go over each tab.

  • Dataset tab
    The platform selects the correct subsets by default.

    • Training uses 80% of the available examples.

    • Validation uses the 10% validation examples to check how the model generalizes to unseen examples.

  • Inputs / target tab
    The platform should select the correct features by default.

    • Input column: make sure that only review is selected.

    • Target column: make sure that sentiment is selected.

  • Problem type tab

    • Make sure the Problem type is Single-label text classification, since we want to classify text into two possible categories (positive and negative). The platform recommends the appropriate Problem type based on the input and target features.

  • Click Create to build the model.

Create button

Idea for later experiments

Since all the reviews in the dataset are written in English, you could replace the Multilingual BERT block with an English BERT block and check how the peformance changes.

To do that, simply delete the Multilingual BERT block from the modeling canvas and add an English BERT block by clicking on its name in the Build list.

Connect it like the original block was and you’re good to go.

You can do this change now, or start running with the Multilingual BERT block then Duplicate the experiment and do the change.
This will allow you to compare the performance from both models in the Evaluation view.

Tune model

The model is now built in the Modeling view, and you can change some settings.

BERT model

The model contain four blocks:

  • The Input block represents data coming into the model.

  • The Multilingual BERT block implements the BERT network in its base size.

  • The Dense block represents a fully connected layer of artificial nodes.
    It outputs a tensor of shape 1 since the target feature sentiment is one out of two values (negative or positive).

  • The Target block represents the output that we are trying to learn with our model.

Change sequence length in BERT block

Click on the Multilingual BERT block in the modeling canvas.

Set the Sequence length to 256, since this is a good compromise between speed and the amount of text preserved. Smaller sequences compute faster, but they might cut some words from your text.

Sequence length is the number of tokens (roughly the number of words) kept during text tokenization. If there are fewer tokens in an example than indicated by this parameter the text will be padded. If there are more tokens, the text will be truncated, i.e., cut from the end to fit the sequence length.

Distribution of the number of tokens for the movie reviews in the dataset
Figure 3. Distribution of the number of tokens for the movie reviews in the dataset.

In the dataset, a Sequence length of 512 will cut 8% of the review samples, a length of 256 will cut 30%, and 128 will cut 74%.

Check experiment settings & run the experiment

Click the Settings tab:

  • Set the Batch Size to 16.
    Batch size is how many samples should be calculated at the same time. A larger batch size than 16 would run out of memory. For text processing models, we recommend keeping the product Sequence length x Batch size below 3000 to avoid running out of memory.

  • Check that Epochs is 2. BERT models are already pretrained, and a delicate fine-tuning generally gives the best results.

  • Keep the learning rate which is the size of the update steps along the gradient.

  • Keep the loss function in the Target block.
    Loss is a number on how well the model performs. If the model predictions are totally wrong, the loss will be a high number. If they’re pretty good, it will be close to zero.

By default, we’ve also checked the Early stopped check-box. This means that the training will automatically be stopped when a chosen metric has stopped improving. Great to make sure you don’t train for too long.

The next step is simply to click Run. So let’s do that!
Click Run to start the training.

Run button

The training will take some time since BERT is a very large and complex model.
Expect about half an hour of training time per epoch with the Sequence length of 256.

Analyzing the first experiment

Click Evaluate to navigate to the Evaluation view to watch the model train.

Evaluate button

Until at least the first epoch finishes, you can read more about the loss and metrics, or grab some fika (a Swedish coffee and cake break) while you wait.

Binary accuracy

To evaluate the performance of the model, you can look at the Binary accuracy by clicking on its name under the plot.

Binary accuracy gives the percentage of predictions that are correct. It should be about 85-90% by the end of training.


The precision gives the proportion of positive predictions, i.e., examples classified as positive, that was actually correct.


The recall gives the proportion of positive examples that are identified by the model.

Predictions inspection

After at least the first epoch has finished, you can use the predictions inspection to see the confusion matrix and the predictions for individual examples.

Confusion matrix

The confusion matrix shows how often examples are correctly or incorrectly classified as another category. Correct predictions fall on the diagonal.

The confusion matrix shows the frequency of every predicted/actual category combination
Figure 4. The confusion matrix shows the frequency of every predicted/actual category combination.

ROC curve

Since the problem is a binary classification problem, the ROC curve (Receiver Operating Characteristic curve) will also be shown.

The ROC curve is a nice way to see how good the model generally is. The closer the ROC curve passes to the top left corner, the better the model is performing.

It also allows to know how changing the threshold will affect the recall and True negative rate.

Examples of ROC curve
Figure 5. The ROC Curve shows how many positive examples are correctly identified, as a function of how many negative examples are wrongly classified as positive. The threshold determines the point on the ROC curve where the model operates.

Deploy your trained experiment

In the Evaluation view click Create deployment.

Create deployment button
  1. Select Experiment that you want to deploy for use in production.
    In this tutorial we only trained one model so there is only one experiment in the list, but if you train more models with different tweaks, they will become available for deployment.

  2. Select the Checkpoint marked with (best), since this is when the model had the best performance.
    The platform creates a checkpoint after every epoch of training. This is useful since performance can sometimes get worse when a model is trained for too many epochs.

  3. Click Create to create the new deployment from the selected experiment and checkpoint.

  4. Click on Enable to deploy the experiment.
    As soon as your deployment is enabled, you can start requesting predictions.

Enable button

Test with our web app

Let’s test your model now when it’s deployed. Click Open web app, and you’ll open the Deployment web app.

Open web app button

Add review

Now write your own review, copy the example below, or simply copy a recent review from, e.g., IMDB:

I don’t want to complain about the movie, it was really just ok. I would consider it an epilogue to Endgame as opposed to a middle film in what I’m assuming is a trilogy (to match the other MCU character films). Anyhow, I was just meh about this one. I will say that the mid-credit scene was one of the best among the MCU movies.

Click Get your results

Click Get your result

Tutorial recap and next steps

In this tutorial, you’ve created a text classification model that you first evaluated and then deployed. You have used all the tools you need to go from data to production — easily and quickly.

Continue experimenting

Continue to experiment with different hyperparameters and tweak your experiments to see if you can improve the accuracy of the model further. You may also want to experiment with datasets from different sources but note that the model in this tutorial works best with short text samples.

To make small modifications, go back to the Modeling view, and click on Iterate.

Modeling view
  • Continue training will let you train the same model for more epochs. Try to increase the batch size or to reduce the learning rate to see if performance improves.

  • Reuse part of model creates a new experiment with a single block that contains the model you just trained. This is useful to build another model around the current one.

To make more modifications to the model, go back to the Modeling view, and click on Duplicate. This will create a copy of your current model that you can edit, but training progress will be lost.

Next tutorial - Classify text in any language

We suggest that the next tutorial you should do is Classify text in any language. You will learn how to use the Multilingual BERT block to create a model that is able to work with multiple languages simultaneously!

This will unlock the AI possibilities to automatically identify relationships and context in text data in 100 languages.

You will learn to:

  • Build, train, and deploy a Multilingual BERT model, the state of the art AI in language processing.

  • Automatically classify text extracts depending on their topic.

  • Mix the available languages for training the model, and test it in any language.

Figure 6. Next tutorial → Classify text in any language
Was this page helpful?