Movie review feelings

Solve a text classification problem with BERT

In this tutorial, you will solve a text classification problem using BERT (Bidirectional Encoder Representations from Transformers). The input is an IMDB dataset consisting of movie reviews, tagged with either positive or negative sentiment – i.e., how a user or customer feels about the movie.

Target audience: Beginners

Preread: Before following this tutorial, it is strongly recommended that you complete the tutorial Deploy an operational AI model, just to get a feeling on how the platform works.

tut4 1

If you want it could be a good idea to read about word embeddings, which is an important concept in NLP (Natural Language Processing). For an introduction and overview of different types of word embeddings, check out the links below:

The problem - predicting sentiment

Text classification aims to assign text, e.g., tweets, messages, or reviews, to one or multiple categories. Such categories can be the author’s mood: is a review positive or negative?

You will learn to

This tutorial will teach you how to build and deploy a model based on BERT.
BERT pushed the state of the art in Natural Language Processing (NLP) by combining two powerful technologies:

  • It is based on a deep Transformer network. A type of network that can process efficiently long texts by using attention.

  • It is bidirectional. Meaning that it takes into account the whole text passage to understand the meaning of each word.

Dataset - The Large Movie Review Dataset v1.0

The content and format of the raw dataset
The raw dataset contains movie reviews along with their associated binary category: positive or negative. The dataset is intended to serve as a benchmark for sentiment classification.

The core dataset contains 50,000 reviews split evenly into a training and test subset. The overall distribution of labels is balanced, i.e., there are 25,000 positive and 25,000 negative reviews.

The raw dataset also includes 50,000 unlabeled reviews for unsupervised learning, these will not be used in this tutorial.

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings.

In the labeled train/test sets, a negative review has a score that is less or equal to 4 out of 10, and a positive review has a score that is higher than 7. Reviews with more neutral ratings are not included in the dataset.

Each review is stored in a separate text file, located in a folder named either “positive” or “negative.”

Note: For more information about the raw dataset, see the ACL 2011 paper "Learning Word Vectors for Sentiment Analysis".

Written by Maas, A., Daly, R., Pham, P., Huang, D., Ng, A. and Potts, C. (2011). Learning Word Vectors for Sentiment Analysis: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. [online] Portland, Oregon, USA: Association for Computational Linguistics, pp.142–150. Available at:

You will use a preprocessed dataset
The dataset that you will upload in this tutorial has been preprocessed so that all the reviews and their respective sentiments are stored in a single CSV file with two fields, “review” and “sentiment.”

The review text may include commas, which will be interpreted as a field delimiter on the platform. To escape these commas, the text is surrounded by double-quotes.

The processed dataset only includes the training data.

Explore the dataset with Python
If you are familiar with Python and want to learn how the raw data was processed, you may want to try to generate it yourself using this Jupyter notebook. Among other things, it will give you some insight into why certain values are used later on in the tutorial.

To run the notebook you must either save or clone the entire GitHub repository or save the file in raw format with the .ipynb extension.

Create a project

Let’s begin!

First, create a project and name it so you know what kind of project it is. Naming is important!

Add the data

After creating the project, you will be taken to the Datasets view, where you can import data.

Click the Data library button and look for the IMDB - tutorial data dataset in the list. Click on it to get more information.

If you agree with the license, click Accept and import. This will import the dataset in your project, and you will be taken to the dataset’s details where you can edit features and subsets.

Text encoding

Click the wrench icon in the review column and set:

  • Encoding to Text (Beta)

  • Sequence length to 512

  • Language model to English BERT uncased.

tut4 2
Figure 1. Datasets view

Sequence length

Sequence length corresponds to the expected number of words in each review sample. If the actual number of words in a sample are fewer than indicated by this parameter, the text will be padded. If it is longer, the text will be truncated, i.e., cut from the end to fit in the sequence.

The BERT block accepts any sequence length between 3 and 512. Smaller sequences compute faster, but they might cut some words from your text. You will want to set this value as low as possible to reduce training time and minimize memory use but still large enough to include enough information. In our dataset a Sequence length of 512 will cut 8% of the review samples, a length of 256 will cut 30%, and 128 will cut 74%.

tut4 3

Language model

The Language model should match the language of the input data, in this case, English BERT uncased. Read here for more information about this parameter.

Subsets of the dataset

In the top right corner, you’ll see the subsets. All samples in the dataset are by default split into 20% validation and 80% training subsets. Keep these default values in this project.

Save the dataset

You’ve now created a dataset ready to be used in the platform. Click Save version and then click Use in new model and the Experiment wizard will pop up.

BERT - Design a text classification model

Define dataset tab

Make sure that the IMDB dataset is selected in the Experiment wizard.

Choose snippet tab

Click on the Choose snippet tab and select the ERT English uncased snippet.

Set the input feature to review.

Set the target feature to sentiment.

The BERT English uncased snippet includes the whole BERT network. The BERT block implements the base version of the BERT network. It is composed of 12 encoding layers from a Transformer network, each layer having 12 attention heads. The total number of parameters is 110 million. The snippet allows you to use this massive network with weights pre-trained to understand the text.

The BERT snippet includes:

  • An Input block.

  • A BERT Encoder block with pre-trained weights which gives BERT a general understanding of English. The BERT encoder block looks at the input sequence as a whole, producing an output that contains an understanding of the sentence. The block outputs a single vector of 768 size.

  • A Dense block with pre-trained weights.

  • A Dense block that is untrained.

  • A Target block.

Initialize weights tab

Click on the Initialize weights tab. Select the pretrained weights English Wikipedia and BookCorpus and set all weights trainable by checking the Weights trainable (all blocks) box.

Create experiment

Click Create and the prepopulated BERT model will appear in the Modeling canvas.

tut4 4

Check experiment settings & run the experiment

Click the Settings tab and check that:

  • Batch Size is 6. If you set a larger batch size you will run out of memory since we’ve set sequence length to 512.

  • Epochs is 2. Training takes a long time, so don’t train for too long the first time when you check if your model is good.

  • Learning rate is 0.000001 (5 zeros). To avoid catastrophic forgetting.

Click Run.

Analyzing the first experiment

Navigate to the Evaluation view and watch the model train. The training will take quite a long time since BERT is a very large and complex model. Currently expect more than four hours for 2 epochs when you use Sequence length 512.

Go and grab some fika (a Swedish coffee and cake break) while you wait.

The training loss will decrease for each epoch, but the evaluation loss may start to increase. This means that the model is starting to overfit to the training data.

You can read more about the loss metrics here.

To evaluate the performance of the model, you can look at overall accuracy, which is displayed in the Experiment info section to the right. It should be approximately 85-90%. For comparison, a classifier that would predict the class randomly would have a 50% accuracy, since 50% of reviews in the dataset are positive and 50% are negative.

Confusion matrix
Since the model solves a classification problem, a confusion matrix is displayed. The top-left to bottom-right diagonal shows correct predictions. Everything outside this diagonal are errors.

tut4 5
Figure 2. Confusion matrix

The recall per class corresponds to the percentage values in the confusion matrix diagonal. You can display the same metric by hovering over the horizontal bars to the right of the confusion matrix. You can also view the precision per class by hovering over the vertical bars on top of the confusion matrix.

tut4 6
Figure 3. Percentage values in the confusion matrix

Now that we know the accuracy of the model based on validation data, let’s deploy the model and try it with some new data.

Deploy your trained experiment

  1. In the Deployment view click New deployment.

  2. Select experiment and checkpoint of your trained model to test it for predictions, or enable for business product calls. Both best epoch and last epoch for each trained experiment are available for deployment.

  3. Click the Enable switch to deploy the experiment.

Test the text classifier in a browser

Let’s test your model. Click the Test deployment button, and you’ll open the Text classifier API tester with all relevant data copied from your deployment.

tut4 7
Figure 4. Text classifier API tester

Add review

Now, write your own review, copy the example below or simply copy a recent review from, e.g., IMDB:

I don’t want to complain about the movie, it was really just ok. I would consider it an epilogue to Endgame as opposed to a middle film in what I’m assuming is a trilogy (to match the other MCU character films). Anyhow, I was just meh about this one. I will say that the mid-credit scene was one of the best across the MCU movies.

Click play

Click Play to get a result.

tut4 8
Figure 5. Example result

Test the text classifier in a terminal

To see what an actual request from the application and the response from the model may look like, you can run the example CURL command that is provided in the Code examples section of the Deployment view. Replace the VALUE parameter with review text and run the command in a terminal.

tut4 9
Figure 6. Deployment view - Input examples (Curl)
curl -X POST \
-F "review=A fun brain candy movie...good dialog. A genuinely good day" \
-u "<Token>" \

The output will look something like this:


The predicted result is positive since it gets the highest value, 0.82801074.

Tutorial recap and next steps

In this tutorial, you’ve created a text classification model that you first evaluated and then deployed.

Continue to experiment with different hyperparameters and tweak your experiments to see if you can improve the accuracy of the model further.

You may also want to experiment with datasets from different sources but note that the model in this tutorial works best with short text samples.

The web app you’ve used now can be used for testing other single-label text classification models as well.