Estimated time: 60 min

BERT movie review sentiment analysis

Solve a text classification problem with BERT

In this tutorial, you will solve a text classification problem using BERT (Bidirectional Encoder Representations from Transformers). The input is an IMDB dataset consisting of movie reviews, tagged with either positive or negative sentiment – i.e., how a user or customer feels about the movie.

Target audience: Data scientists and developers

Preread: 

Before following this tutorial, it is strongly recommended that you complete the tutorial Deploy an operational AI model, just to get a feeling on how the platform works.

Embeddings

If you want it could be a good idea to read about word embeddings, which is an important concept in NLP (Natural Language Processing). For an introduction and overview of different types of word embeddings, check out the links below:

The problem - predicting sentiment

Text classification aims to assign text, e.g., tweets, messages, or reviews, to one or multiple categories. Such categories can be the author’s mood: is a review positive or negative? 

Goal with this experiment

You will learn how to build and deploy a model based on BERT.
BERT pushed the state of the art in Natural Language Processing by combining two powerful technologies:

  • It is based on a deep Transformer network, a type of network that can process efficiently long texts by using attention.
  • It is bidirectional, meaning that it takes into account the whole text passage to understand the meaning of each word.

Dataset - The Large Movie Review Dataset v1.0

The content and format of the raw dataset

The raw dataset contains movie reviews along with their associated binary category: positive or negative. The dataset is intended to serve as a benchmark for sentiment classification.

The core dataset contains 50,000 reviews split evenly into a training and test subset. The overall distribution of labels is balanced, i.e., there are 25,000 positive and 25,000 negative reviews.

The raw dataset also includes 50,000 unlabeled reviews for unsupervised learning, these will not be used in this tutorial.

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings.

In the labeled train/test sets, a negative review has a score that is less or equal to 4 out of 10, and a positive review has a score that is higher than 7. Reviews with more neutral ratings are not included in the dataset.

Each review is stored in a separate text file, located in a folder named either “positive” or “negative.”

For more information about the raw dataset, see the ACL 2011 paper "Learning Word Vectors for Sentiment Analysis".

Maas, A., Daly, R., Pham, P., Huang, D., Ng, A. and Potts, C. (2011). Learning Word Vectors for Sentiment Analysis: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. [online] Portland, Oregon, USA: Association for Computational Linguistics, pp.142–150. Available at: http://www.aclweb.org/anthology/P11-1015.

You will use a preprocessed dataset

The dataset that you will upload in this tutorial has been preprocessed so that all the reviews and their respective sentiments are stored in a single CSV file with two fields, “review” and “sentiment.”

The review text may include commas, which will be interpreted as a field delimiter on the platform. To escape these commas, the text is surrounded by double-quotes.

The processed dataset only includes the training data.

Explore the dataset with Python

If you are familiar with Python and want to learn how the raw data was processed, you may want to try to generate it yourself using this Jupyter notebook. Among other things, it will give you some insight into why certain values are used later on in the tutorial.

To run the notebook you must either save or clone the entire GitHub repository or save the file in raw format with the .ipynb extension.

Create a project

Let's begin!

First, create a project and name it so you know what kind of project it is. Naming is important!

Add the Large Movie Review Dataset v1.0 to the platform

  1. Navigate to the Datasets view and click New dataset.
  2. Expand the Import data section.
  3. Copy the link to the dataset below:

    https://storage.googleapis.com/bucket-8732/Large-Movie-Review-Dataset-1_0/training_validation_data.zip
  4. Paste the copied link and click Import. The zip includes the preprocessed training and validation data.

    If you have generated the preprocessed data using the notebook, you can upload the resulting CSV file directly. It is not required to create a zip file.
  5. When done, click Next. Name the dataset IMDB and click Done.

Datasets view

Text encoding

Click the review column and set:

  • Encoding to Text (Beta)
  • Sequence length to 512
  • Language model to English BERT uncased.

Sequence length

Sequence length corresponds to the expected number of words in each review sample. If the actual number of words in a sample are fewer than indicated by this parameter, the text will be padded. If it is longer, the text will be truncated, i.e., cut from the end to fit in the sequence. 

The BERT block accepts any sequence length between 3 and 512. Smaller sequences compute faster, but they might cut some words from your text. You will want to set this value as low as possible to reduce training time and minimize memory use but still large enough to include enough information.
In our dataset a Sequence length of 512 will cut 8% of the review samples, a length of 256 will cut 30%, and 128 will cut 74%.

Language model

The Language model should match the language of the input data, in this case, English BERT uncased. Read here for more information about this parameter.

Subsets of the dataset

In the top right corner, you’ll see the subsets. All samples in the dataset are by default split into 20% validation and 80% training subsets. Keep these default values in this project.

Save the dataset

You’ve now created a dataset ready to be used in the platform. Click Save version and navigate to the Modeling view.

BERT - Design a text classification model

In the Modeling view add the BERT English uncased snippet. Select the pretrained weights English Wikipedia and BookCorpus.
Set all weights trainable by checking the Weights trainable (all blocks) box.

The BERT English uncased snippet includes the whole BERT network. The BERT block implements the base version of the BERT network. It is composed of 12 encoding layers from a Transformer network, each layer having 12 attention heads. The total number of parameters is 110 million. The snippet allows you to use this massive network with weights pre-trained to understand the text.

The BERT snippet includes:

  • An Input block.
  • A BERT Encoder block with pre-trained weights which gives BERT a general understanding of English. The BERT encoder block looks at the input sequence as a whole, producing an output that contains an understanding of the sentence. The block outputs a single vector of 768 size.
  • A Dense block with pre-trained weights.
  • A Dense block that is untrained.
  • A Target block.

Set input and target features

Set the input feature to review.

Set the target feature to sentiment.


Change experiment settings & run the experiment

Click the Settings tab and change:

  • Batch Size to 6. If you set a larger batch size you will run out of memory since we've set sequence length to 512.
  • Epochs to 2.
  • Learning rate to 0.000001 (5 zeros)

Keep the default settings for all other parameters.

Click Run.

Analyzing the first experiment

Navigate to the Evaluation view and watch the model train. The training will take quite a long time since BERT is a very large and complex model. Currently expect more than four hours for 2 epochs when you use Sequence length 512.
Go and grab some fika (a Swedish coffee and cake break) while you wait.

The training loss will decrease for each epoch, but the evaluation loss may start to increase. This means that the model is starting to overfit to the training data.

You can read more about the loss metrics here.

Accuracy

To evaluate the performance of the model, you can look at overall accuracy, which is displayed in the Experiment info section to the right. It should be approximately 85-90%. For comparison, a classifier that would predict the class randomly would have a 50% accuracy, since 50% of reviews in the dataset are positive and 50% are negative. 

Confusion matrix

Since the model solves a classification problem, a confusion matrix is displayed. The top-left to bottom-right diagonal shows correct predictions. Everything outside this diagonal are errors.

Recall

The recall per class corresponds to the percentage values in the confusion matrix diagonal. You can display the same metric by hovering over the horizontal bars to the right of the confusion matrix. You can also view the precision per class by hovering over the vertical bars on top of the confusion matrix.

Evaluation view - Confusion matrix

Now that we know the accuracy of the model based on validation data, let’s deploy the model and try it with some new data.

Create new deployment

  1. In the Deployment view click New deployment.
  2. Select experiment and checkpoint of your trained model to test it for predictions, or enable for business product calls. Both best epoch and last epoch for each trained experiment are available for deployment.
  3. Click the Enable switch to deploy the experiment.

Test the text classifier in a browser

Enter this address to our web app Text classifier API tester into your preferred browser: https://bit.ly/2LL73fm

Text classifier API tester

Copy the URL in the Deployment view and paste it into the URL field of the web app. 

Copy the Token in the Deployment view. The API is called by sending an HTTP POST to the endpoint shown in the interface. The token is required to authenticate the calls. Paste the copied token into the Token field in the web app.

Type review (case sensitive) in the Input name field. This value must match the name of the input feature in the Deployment view. 

Copying field values from Deployment view to Text classifier API tester

Let's test your model. Write your own review, copy the example below or simply copy a recent review from e.g., IMDB , then

Click Play to get a result.

Example:

I don’t want to complain about the movie, it was really just ok. I would consider it an epilogue to Endgame as opposed to a middle film in what I’m assuming is a trilogy (to match the other MCU character films). Anyhow, I was just meh about this one. I will say that mid-credit scene was one of the best across the MCU movies.

Example result

Test the text classifier in a terminal

To see what an actual request from the application and the response from the model may look like, you can run the example CURL command that is provided in the Code examples section of the Deployment view. Replace the VALUE parameter with review text and run the command in a terminal.

Deployment view - Input examples (Curl)

curl -X POST \
-F "review=A fun brain candy movie...good action...fun dialog. A genuinely good day" \
-u "<Token>" \
<URL>

The output will look something like this:

{"sentiment":{"negative":0.18836714,"positive":0.82801074}}

The predicted result is positive since it gets the highest value, 0.82801074.

Tutorial recap and next steps

In this tutorial, you’ve created a text classification model that you first evaluated and then deployed. 

Continue to experiment with different hyperparameters and tweak your experiments to see if you can improve the accuracy of the model further.

You may also want to experiment with datasets from different sources but note that the model in this tutorial works best with short text samples.

The web app you’ve used now can be used for testing other single-label text classification models as well.