Estimated time: 60 min

Book genre classification

Solve a text classification problem with BERT

Bookstores rarely split them apart, but we at Peltarion argue that fantasy and science fiction clearly are different things. To make the point, we decided to create an AI model that classifies the genre of a book solely on its summary.

In this tutorial, we’ll show you how to use the Peltarion Platform to build a model on your own and correct all major bookstores in your country!

Target audience: Data scientists and developers

Preread

Before following this tutorial, it is strongly recommended that you complete the tutorial Deploy an operational AI model, just to get a feeling on how the platform works.

Embeddings

If you want it could be a good idea to read about word embeddings, which is an important concept in NLP (Natural Language Processing). For an introduction and overview of different types of word embeddings, check out the links below:

The problem - sci-fi or centaurs?

Text classification aims to assign text, e.g., tweets, messages, or reviews, to one or multiple categories. Such categories can be whether or not a book is considered as science fiction.

Goal with this experiment

You will learn how to build and deploy a model based on BERT.
BERT pushed the state of the art in Natural Language Processing (NLP) by combining two powerful technologies:

  • It is based on a deep Transformer network. A type of network that can process efficiently long texts by using attention.
  • It is bidirectional. Meaning that it takes into account the whole text passage to understand the meaning of each word.

Dataset - CMU book summary dataset

The content and format of the dataset

The data comes from the CMU Book Summary Dataset, a dataset of over 16,000 book summaries. For this project, we wanted a dataset with science fiction book summaries, so we chose to preprocess the data to our task, so it contains book summaries along with their associated binary category: Science Fiction or not. 

The dataset is intended to serve as a benchmark for sentiment classification. The overall distribution of labels is balanced, i.e., there are approximately 2,500 science fiction and 2,500 non-science fiction book summaries. Each summary is stored in a column, with a science fiction classification of either “yes” or “no”.

Create a project

Let's begin!

First, create a project and name it so you know what kind of project it is. Naming is important!

  1. Navigate to the Datasets view and click New dataset.
  2. Locate the Import data from URL box and copy-paste the link to the dataset below:

    https://storage.cloud.google.com/bucket-8732/sci_fi_or_no_sci_fi.zip
  3. Click the arrow to start the import.
  4. When done, name the dataset BookSummaries and click Done.
  5. The dataset is labeled binary, that is, 1 indicates that the book is classified as a science fiction book and 0 is not
Text encoding

Click the SummaryCropped column and set:

  • Encoding to Text (Beta)
  • Sequence length to 512
  • Language model to English BERT uncased.

Datasets view with the BookSummaries dataset.

Sequence length

Sequence length corresponds to the expected number of words in each summary sample. If the actual number of words in a sample are fewer than indicated by this parameter, the text will be padded. If it is longer, the text will be truncated, i.e., cut from the end to fit in the sequence. 

The BERT block accepts any sequence length between 3 and 512. Smaller sequences compute faster, but they might cut some words from your text. You will want to set this value as low as possible to reduce training time and minimize memory use but still large enough to include enough information.

Language model

The Language model should match the language of the input data, in this case, English BERT uncased. Read here for more information about this parameter.

Subsets of the dataset

In the top right corner, you’ll see the subsets. All samples in the dataset are by default split into 20% validation and 80% training subsets. Keep these default values in this project.

Save the dataset

You’ve now created a dataset ready to be used in the platform. Click Save version and then click Use in new model and the Experiment wizard will pop up.

BERT - Design a text classification model

Define dataset tab

Make sure that the BookSummaries dataset is selected in the Experiment wizard.

Choose snippet tab

Click on the Choose snippet tab and select the BERT English uncased snippet.

Set the input feature to SummaryCropped.

Set the target feature to Science Fiction.

The BERT English uncased snippet includes the whole BERT network. The BERT block implements the base version of the BERT network. It is composed of 12 encoding layers from a Transformer network, each layer having 12 attention heads. The total number of parameters is 110 million. The snippet allows you to use this massive network with weights pre-trained to understand the text.

The BERT snippet includes:

  • An Input block.
  • A BERT Encoder block with pre-trained weights which gives BERT a general understanding of English. The BERT encoder block looks at the input sequence as a whole, producing an output that contains an understanding of the sentence. The block outputs a single vector of 768 size.
  • A Dense block with pre-trained weights.
  • A Dense block that is untrained.
  • A Target block.
Initialize weights tab

Click on the Initialize weights tab. Select the pretrained weights English Wikipedia and BookCorpus and set all weights trainable by checking the Weights trainable (all blocks) box.

Create experiment

Click Create and the prepopulated BERT model will appear in the Modeling canvas.

Check experiment settings & run the experiment

Click the Settings tab and check that:

  • Batch Size is 6. If you set a larger batch size you will run out of memory since we've set sequence length to 512.
  • Epochs is 2. Training takes a long time, so don't train for too long the first time when you check if your model is good.
  • Learning rate is 0.000001 (5 zeros). To avoid catastrophic forgetting.

Click Run.

Analyzing the first experiment

Navigate to the Evaluation view and watch the model train. The training will take quite a long time since BERT is a very large and complex model.


Go and grab some fika (a Swedish coffee and cake break) while you wait.

The training loss will decrease for each epoch, but the evaluation loss may start to increase. This means that the model is starting to overfit to the training data.

You can read more about the loss metrics here.

Accuracy

To evaluate the performance of the model, you can look at overall accuracy, which is displayed in the Experiment info section to the right. It should be approximately 85-90%.

Confusion matrix

Since the model solves a classification problem, a confusion matrix is displayed. The top-left to bottom-right diagonal shows correct predictions. Everything outside this diagonal are errors.

Recall

The recall per class corresponds to the percentage values in the confusion matrix diagonal. You can display the same metric by hovering over the horizontal bars to the right of the confusion matrix. You can also view the precision per class by hovering over the vertical bars on top of the confusion matrix.

Evaluation view - Confusion matrix

Now that we know the accuracy of the model based on validation data, let’s deploy the model and try it with some new data.

Create new deployment

  1. In the Deployment view click New deployment.
  2. Select experiment and checkpoint of your trained model to test it for predictions, or enable for business product calls. Both best epoch and last epoch for each trained experiment are available for deployment.
  3. Click the Enable switch to deploy the experiment.

Test the text classifier in a browser

Let's test your model. Click the Test deployment button, and you'll open the Text classifier API tester with all relevant data copied from your deployment.

Text classifier API tester

Add summary

Now, write your own summary, copy the example below or simply copy a recent summary from, e.g., Amazon :

Example:

Harry Potter has never been the star of a Quidditch team, scoring points while riding a broom far above the ground. He knows no spells, has never helped to hatch a dragon, and has never worn a cloak of invisibility.

All he knows is a miserable life with the Dursleys, his horrible aunt and uncle, and their abominable son, Dudley -- a great big swollen spoiled bully. Harry's room is a tiny closet at the foot of the stairs, and he hasn't had a birthday party in eleven years.

But all that is about to change when a mysterious letter arrives by owl messenger: a letter with an invitation to an incredible place that Harry -- and anyone who reads about him -- will find unforgettable.

Click play

Click Play to get a result.

Example result. 0 indicates that the model predicts that Harry Potter is not a science fiction book.

Tutorial recap and next steps

Congrats, a complete end-to-end data science project! In this tutorial you’ve learned how to build, train and deploy a model on the Peltarion Platform to analyze text data. A complete end-to-end data science project!

It was not that hard, right?

Next steps

The next steps could be to try to run the project for more epochs and see if that improves the result or maybe change the learning rate. 

Linked ideas

This is, of course, only a demo of what you can do with language data on the platform. Knowing if a book is sci-fi or not is just fun. However, this tutorial is an example of how deep learning can be used for task automation that is way faster and more accurate than any manual labor. Business cases similar to this just need some extra thinking.