# Movie review feelings

Solve a text classification problem with BERT

In this tutorial, you will solve a text classification problem using BERT (Bidirectional Encoder Representations from Transformers). The input is an IMDB dataset consisting of movie reviews, tagged with either positive or negative sentiment – i.e., how a user or customer feels about the movie.

Text classification aims to assign text, e.g., tweets, messages, or reviews, to one or multiple categories. Such categories can be the author’s mood: is a review positive or negative?

- Target audience: Beginners

You will learn to
- How to build and deploy a model based on BERT.

Before following this tutorial, it is strongly recommended that you complete the tutorial Deploy an operational AI model, just to get a feeling of how the platform works.

## Create a project

First, click New project to create a project and name it so you know what kind of project it is. Naming is important.

A project combines all of the steps in solving a problem, from the pre-processing of datasets to model building, evaluation, and deployment. Using projects makes it easy to collaborate with others.

## What is BERT?

BERT pushed the state of the art in Natural Language Processing (NLP) by combining two powerful technologies:

• It is based on a deep Transformer network. A type of network that can process efficiently long texts by using attention.

• It is bidirectional. Meaning that it takes into account the whole text passage to understand the meaning of each word.

### Numerical method

If you want it could be a good idea to read about word embeddings and BERT’s attention mechanism, which are important concepts in NLP (Natural Language Processing). For an introduction and overview material, check out the links below:

After creating the project, you will be taken to the Datasets view, where you can import data.

Figure 1. The IMDB dataset includes a huge amount of movie reviews.

If you agree with the license, click Accept and import. This will import the dataset in your project, and you will be taken to the dataset’s details where you can edit features and subsets.

### Feature encoding

Verify that the default feature encoding is correct:

• The review feature should use the Text encoding.

• The sentiment feature should use the Binary encoding, and use positive as the Positive class.

If a feature uses the wrong settings, click the wrench icon to change it.

Figure 2. Datasets view

### Subsets of the dataset

In the left side of the Datasets view, you’ll see the subsets. All samples in the dataset are by default split into 20% validation and 80% training subsets. Keep these default values in this project.

### Save the dataset

You’ve now created a dataset ready to be used in the platform. Click Save version and then click Use in new experiment and the Experiment wizard will pop up.

## BERT - Design a text classification model

The Experiment wizard opens to help you set up your training experiment and to recommend snippets as prebuilt models.

We’ll now go over the Experiment wizard tab by tab.

### Dataset tab

The platform selects the correct subsets by default.

• Training uses 80% of the available examples.

• Validation uses the remaining 20% examples to check how the model generalizes to unseen examples.

### Input(s) / target tab

The platform should select the correct features by default.

• Input(s) column: make sure that only review () is selected.

• Target column: make sure that sentiment (1) is selected.

### Snippet tab

The platform recommends the appropriate Problem type and snippets based on the input and target features.

• Make sure the Problem type is Binary text classification, since we want to classify text into two possible categories (positive and negative).

• Select the English BERT uncased snippet.
All the training examples in the dataset are written in English, and for this tutorial we plan to use the model to only process English text.

You can create another experiment later on and use the Multilingual BERT cased snippet instead. Multilingual models are useful when you want to work with other languages than English. They are also useful when your training dataset is only available in some languages that are different from the languages that you want to use your model for.

## Create experiment

Click Create to build the model in the Modeling canvas.

Note that the last Dense block must output a tensor of shape 1. This matches the shape of the sentiment feature, which is the target feature that the model learns.
You can find the shape of features in the Datasets view.

### Sequence length

Sequence length is the number of tokens (roughly the number of words) kept during text tokenization.
If there are fewer tokens in an example than indicated by this parameter the text will be padded. If there are more tokens, the text will be truncated, i.e., cut from the end to fit the sequence length.

Figure 3. Distribution of the number of tokens for the movie reviews in the dataset.

In the dataset, a Sequence length of 512 will cut 8% of the review samples, a length of 256 will cut 30%, and 128 will cut 74%.

Smaller sequences compute faster, but they might cut some words from your text.

Click on the Tokenizer block in the modeling canvas. Set the Sequence length to 256, since this is a good compromise between speed and the amount of text preserved.

## Check experiment settings & run the experiment

Click the Settings tab:

• Set the Batch Size to 16. A larger batch size would run out of memory. For text processing models, we recommend to keep the product Sequence length x Batch size below 3000 to avoid running out of memory.

• Check that Epochs is 2. BERT models are already pretrained, and a delicate fine-tuning generally gives the best results.

Click Run.

The training will take some time since BERT is a very large and complex model.
Expect about half an hour of training time per epoch with the Sequence length of 256.

## Analyzing the first experiment

Navigate to the Evaluation view and watch the model train.

Until at least the first epoch finishes, you can read more about the loss and metrics, or grab some fika (a Swedish coffee and cake break) while you wait.

### Binary accuracy

To evaluate the performance of the model, you can look at the Binary accuracy by clicking on its name under the plot.

Binary accuracy gives the percentage of predictions that are correct. It should be about 85-90% by the end of training.

### Precision

The precision gives the proportion of positive predictions, i.e., examples classified as Science Fiction, that were actually correct.

### Recall

The recall gives the proportion of positive examples, i.e., actual Science Fiction texts, that are identified by the model.

### Predictions inspection

After at least the first epoch has finished, you can use the predictions inspection to see the confusion matrix and the predictions for individual examples.

#### Confusion matrix

The confusion matrix shows how often examples are correctly or incorrectly classified as another category. Correct predictions fall on the diagonal.

Figure 4. The confusion matrix shows the frequency of every predicted/actual category combination.

#### ROC curve

Since the problem is a binary classification problem, the ROC curve will also be shown.

The ROC curve is a nice way to see how good the model generally is. The closer the ROC curve passes to the top left corner, the better the model is performing.

It also allows to know how changing the threshold will affect the recall and True negative rate.

Figure 5. The ROC Curve shows how many positive examples are correctly identified, as a function of how many negative examples are wrongly classified as positive. The threshold determines the point on the ROC curve where the model operates.

1. In the Deployment view click New deployment.

2. Select Experiment that you want to deploy for use in production.
In this tutorial we only trained one model so there is only one experiment in the list, but if you train more models with different tweaks, they will become available for deployment.

3. Select the Checkpoint marked with (best), since this is when the model had the best performance.
The platform creates a checkpoint after every epoch of training. This is useful since performance can sometimes get worse when a model is trained for too many epochs.

4. Click Create to create the new deployment from the selected experiment and checkpoint.

5. Click on Enable to deploy the experiment.

## Test the text classifier in a browser

Let’s test your model. Click the Test deployment button, and you’ll open the Image & Text Classifier API tester with all relevant data copied from your deployment.

Figure 6. Text classifier API tester

Now write your own review, copy the example below, or simply copy a recent review from, e.g., IMDB:

Example:
I don’t want to complain about the movie, it was really just ok. I would consider it an epilogue to Endgame as opposed to a middle film in what I’m assuming is a trilogy (to match the other MCU character films). Anyhow, I was just meh about this one. I will say that the mid-credit scene was one of the best across the MCU movies.

### Click play

Click Play to get a result.

Figure 7. Example result

## Test the text classifier in a terminal

To see what an actual request from the application and the response from the model may look like, you can run the example CURL command that is provided in the Code examples section of the Deployment view. Replace the VALUE parameter with review text and run the command in a terminal.

Figure 8. Deployment view - Input examples (Curl)
curl -X POST \
-F "review=A fun brain candy movie...good action...fun dialog. A genuinely good day" \
-u "<Token>" \
<URL>

The output will look something like this:

{"sentiment":0.9956326}

The predicted result is positive since it’s above the arbitrary threshold value.

## Tutorial recap and next steps

In this tutorial, you’ve created a text classification model that you first evaluated and then deployed.

Continue to experiment with different hyperparameters and tweak your experiments to see if you can improve the accuracy of the model further.

You may also want to experiment with datasets from different sources but note that the model in this tutorial works best with short text samples.

The web app you’ve used now can be used for testing other single-label text classification models as well.