Estimated time: 60 min

Movie review sentiment analysis

Solving a text classification problem with natural language processing (NLP)

In this tutorial you will solve a text classification problem using a convolutional neural network (CNN). The input is a dataset consisting of movie reviews and the classes represent either positive or negative sentiment – i.e., how a user or customer feels or reacts to a product, service or campaign.

Target audience: Data scientists and developers

Preread: 

Before following this tutorial, it is strongly recommended that you complete the tutorial Deploy an operational AI model if you have not done so already.

Word embeddings is an important concept in NLP. For an introduction and overview of different types of word embeddings, check out the links below:

The problem - predicting sentiment

Text classification aims to assign text, e.g., tweets, messages or reviews, to one or multiple categories. Such categories can be linked to user sentiment. In this tutorial, the expressed review is analyzed as either positive or negative.

To build your classifier, you'll need labeled data, which consists of text and their corresponding labels.

CNNs are generally used in computer vision, but are also suitable for specific NLP tasks such as sentiment analysis. Unlike neural networks that are designed to process image data, the CNN that you will build is based on 1D layers and depends on text embeddings, which provide a vector representation of the input words and their use.

The Large Movie Review Dataset v1.0

The content and format of the raw dataset

This dataset contains movie reviews along with their associated binary sentiment polarity labels (positive or negative). It is intended to serve as a benchmark for sentiment classification.

The core dataset contains 50,000 reviews split evenly into a training and test subset.

The overall distribution of labels is balanced, i.e., there are 25,000 positive and 25,000 negative reviews.

The raw dataset also includes 50,000 unlabeled reviews for unsupervised learning, these will not be used in this tutorial.

In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings.

In the labeled train/test sets, a negative review has a score that is less or equal to 4 out of 10, and a positive review has a score that is higher than 7. Reviews with more neutral ratings are not included in the dataset.

Each review is stored in a separate text file, located in a folder named either “positive” or “negative.”

For more information about the raw dataset, see the ACL 2011 paper "Learning Word Vectors for Sentiment Analysis".

Maas, A., Daly, R., Pham, P., Huang, D., Ng, A. and Potts, C. (2011). Learning Word Vectors for Sentiment Analysis: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. [online] Portland, Oregon, USA: Association for Computational Linguistics, pp.142–150. Available at: http://www.aclweb.org/anthology/P11-1015.

The preprocessed dataset

The dataset that you will upload has been preprocessed so that all the reviews and their respective sentiments are stored in a single CSV file with two fields, “review” and “sentiment.”

The review text may include commas, which will be interpreted as a field delimiter on the platform. To escape these commas, the text is surrounded by double quotes.

The processed dataset only includes the training data.

Exploring the dataset

If you are familiar with Python and want to learn how the raw data was processed, you may want to try to generate it yourself using this Jupyter notebook.

To run the notebook you must either save or clone the entire GitHub repository or save the file in raw format with the .ipynb extension.

Among other things, it will give you some insight into why certain values are used later on in the tutorial.

Create a project

Let's begin!

First, create a project and name it so you know what kind of project it is. 

Add the Large Movie Review Dataset v1.0 to the platform

  1. Navigate to the Datasets view and expand Import Data.
  2. Copy the link below:

    https://storage.googleapis.com/bucket-8732/Large-Movie-Review-Dataset-1_0/training_validation_data.zip
  3. Click Import and paste the copied link. The zip includes the preprocessed training and validation data.

    If you have generated the preprocessed data using the notebook, you can upload the resulting CSV file directly. It is not required to create a zip file.
  4. When done click Next, name the dataset IMDB and click Done.

Datasets view

Text encoding

Click the review column and set Encoding to Text (Beta). Two new parameter settings will appear.

Sequence length corresponds to the expected number of words in each review sample. If the actual number of words in a sample are fewer than indicated by this parameter, the text will be padded. If it is longer, the text will be truncated, i.e., cut from the bottom-up to fit in the sequence. 

If you run the preprocessing notebook or use other means to evaluate the text, you will notice that the longest text is approximately 2,500 words, but that is an outlier. A sequence length of approximately 600-650 makes it possible to process 95% of the data without any truncation. This is important because you will want to set this value as low as possible to reduce training time and minimize memory use. Based on the information that we have about this dataset, set the Sequence length to 650.

The Language model should match the language of the input data, in this case English. Read here for more information about this parameter.

Subsets of the Large movie review dataset 1.0

In the top right corner, you’ll see the subsets. All samples in the dataset are by default split into 20% validation and 80% training subsets. Keep these default values in this project.

Save the dataset

You’ve now created a dataset ready to be used in the platform. Click Save version and navigate to the Modeling view.

Design a text classification model

The model data you will create is based on a text classification example provided in the Keras documentation.

The example suggests that you use a Text embedding block followed by a Dropout block and then a 1D Convolution block. The output from these blocks is then bridged to a fully connected layer, a Dense block, via a 1D Global max pooling block. Max pooling is required here since the input to the Dense block must have only 1 dimension.

You can see a representation of this model below, but don't start building it just yet.

Example model

Before you go ahead and create a new experiment, you may want to consider some improvements that will give you a better performing model.

In this contribution to a competition on Kaggle, a similar classification model is used, but there are four parallel 1D Convolution blocks instead of one. Each 1D Convolution block has different filter sizes and is followed by a 1D Global max pooling block. The output from each of the four “arms” are then concatenated. This "multi-channel CNN" allows the document to be processed at different resolutions and will give you somewhat better results.

This approach was first described by Yoon Kim in his 2014 paper titled Convolutional Neural Networks for Sentence Classification.

This is the model you are going to build:

Improved model

You can use either the default ReLU activation function or TanH in the 1D Convolution blocks. The TanH activation function will produce slightly better results.

Adding blocks to the model
Input and text embedding blocks
  1. Click the Build tab in the Inspector and then the Blocks section to expand it.
  2. Click the Input block in the Inspector. This will add an Input block to the Modeling canvas. The parameters for the currently selected block will be displayed in the Inspector. Set Feature to review.
  3. Add a Text Embedding (beta) block and change the Language model to English. This block contains a learned vector representation of the words in the input.

    The Output dimension (default 64) determines the size of the word representation. The optimum value is typically found empirically, but a commonly used thumb rule is to set it to the approximate 4th root of your vocabulary length (number of different words in the input).
  4. Add a Dropout block and set the Rate to 0.2. This will mitigate overfitting to the training data.
Convolution blocks
  1. Add a 1D Convolution block:
    Filters: 256
    Size: 1
    Activation: TanH
  2. Add a 1D Global max pooling block.
  3. Select the 1D Convolution block and the 1D Global max pooling block and then click the Copy button.
  4. Click the Paste button.
  5. Select the pasted 1D Convolution block and set Size to 2.
  6. Click the Paste button again.
  7. Select the second pasted 1D Convolution block and set Size to 3.
  8. Click the Paste button a third time.
  9. Select the third pasted 1D Convolution block and set Size to 5.
  10. Connect the Dropout block to all the 1D Convolution blocks.
Dense and target blocks
  1. Add a Concatenate block and connect it to the output of the 1D Convolution blocks.
  2. Add a Dense block after the Concatenate block and set Nodes to 128.
  3. Add another Dropout block after the Dense block and set Rate to 0.2.
  4. Add another Dense block after the second Dropout block:
    Nodes: 2 (the number of classes in the dataset).
    Activation: Softmax
  5. Add a Target block:
    Feature: sentiment
    Loss: Categorical crossentropy

Running the experiment

Click the Settings tab and change Batch Size to 1024 and Epochs to 5. Keep the default settings for all other parameters.

Click Run.

Analyzing the first experiment

Navigate to the Evaluation view and watch the model train. It will only take a few minutes as a result of the combination of data size, model complexity and number of epochs.

Evaluation view - Training overview

The training loss will continue to decrease for each epoch, but the evaluation loss will start to increase again after the first two epochs. This means that the model is starting to overfit to the training data.

You can read more about the loss metrics here.

To evaluate the performance of the model, you can look at overall accuracy, which is displayed in the Experiment info section to the right. It should be approximately 85-90%. For comparison, a “baseline classifier,” which simply predicts that all samples belong to the most common class would have an accuracy of approximately 50%. This is because there is an even distribution between positive and negative labels in the validation subset.

Confusion matrix

Since the model solves a classification problem, a confusion matrix is displayed. The top-left to bottom-right diagonal shows correct predictions. Everything outside this diagonal are errors.

 The recall per class corresponds to the percentage values in the confusion matrix diagonal. You can display the same metric by hovering over the horizontal bars to the right of the confusion matrix. You can also view the precision per class by hovering over the vertical bars on top of the confusion matrix.

Evaluation view - Confusion matrix

Key metrics

High recall and low precision mean that the model "catches" most of the examples in a class, but many other predictions will be false positives. 

High precision and low recall is just the opposite. The model underpredicts the class but is accurate in the positive predictions that it makes. That is to say that it is too picky.

The key metric here depends on how you will apply the results in your deep learning application.

The total number of correct vs. incorrect predictions are displayed in the error graph below the confusion matrix.

Evaluation view - Error graph

Now that we know the accuracy of the model based on validation data, let’s deploy the model and try it with some new data.

Create new deployment

  1. In the Deployment view click New deployment.
  2. Select experiment and checkpoint of your trained model to test it for predictions, or enable for business product calls. Both best epoch and last epoch for each trained experiment are available for deployment.
  3. Click the Enable switch to deploy the experiment.

Test the text classifier in a browser

Enter this address to our web app Text classifier API tester into your preferred browser: https://bit.ly/2LL73fm

Text classifier API tester

Copy the URL in the Deployment view and paste it into the URL field of the web app. 

Copy the Token in the Deployment view. The API is called by sending an HTTP POST to the endpoint shown in the interface. The token is required to authenticate the calls. Paste the copied token into the Token field in the web app.

Type review (case sensitive) in the Input name field. This value must match the name of the input feature in the Deployment view. 

Copying field values from Deployment view to Text classifier API tester

Copy a recent review from e.g., IMDB , paste it into the text field, or write your own, then click the Play button to get a result.

Example:

I don’t want to complain about the movie, it was really just ok. I would consider it an epilogue to Endgame as opposed to a middle film in what I’m assuming is a trilogy (to match the other MCU character films). Anyhow, I was just meh about this one. I will say that mid-credit scene was one of the best across the MCU movies.

Example result

Test the text classifier in a terminal

To see what an actual request from the application and the response from the model may look like, you can run the example CURL command that is provided in the Code examples section of the Deployment view. Replace the VALUE parameter with review text and run the command in a terminal.

Deployment view - Input examples (Curl)

curl -X POST \
-F "review=A fun brain candy movie...good action...fun dialog. A genuinely good day" \
-u "<Token>" \
<URL>

The output will look something like this:

{"sentiment":{"negative":0.18836714,"positive":0.82801074}}

The predicted result is positive since it gets the highest value, 0.82801074.

Tutorial recap and next steps

In this tutorial, you’ve created a text classification model that you first evaluated and then deployed. 

Continue to experiment with different hyperparameters and tweak your experiments to see if you can improve the accuracy of the model further.

You may also want to experiment with datasets from different sources but note that the model in this tutorial works best with short text samples (approximately 1,000 words or less).

The web app you’ve used now can be used for testing other single-label text classification models as well.