Book genre classification

Solve a text classification problem with BERT

Bookstores rarely split them apart, but we at Peltarion argue that fantasy and science fiction clearly are different things. To make the point, we decided to create an AI model that classifies the genre of a book solely on its summary.

In this tutorial, we’ll show you how to use the Peltarion Platform to build a model on your own and correct all major bookstores in your country!

- Target audience: Beginners

Figure 1. What kind of book is it?

Before following this tutorial, it is strongly recommended that you complete the tutorial Deploy an operational AI model, just to get a feeling on how the platform works.

Embeddings

If you want it could be a good idea to read about word embeddings and BERT’s attention mechanism, which are an important concepts in NLP (Natural Language Processing). For an introduction and overview material, check out the links below:

The problem - sci-fi or centaurs?

Text classification aims to assign text, e.g., tweets, messages, or reviews, to one or multiple categories. Such categories can be whether or not a book is considered as science fiction.

Goal with this experiment

You will learn how to build and deploy a model based on BERT.

BERT pushed the state of the art in Natural Language Processing (NLP) by combining two powerful technologies:

• It is based on a deep Transformer network. A type of network that can process efficiently long texts by using attention.

• It is bidirectional. Meaning that it takes into account the whole text passage to understand the meaning of each word.

Dataset - CMU book summary dataset

The content and format of the dataset

The data comes from the CMU Book Summary Dataset, a dataset of over 16 000 book summaries. For this project, we wanted a dataset with science fiction book summaries, so we chose to preprocess the data to our task, so it contains book summaries along with their associated binary category: Science Fiction or not.

The dataset is intended to serve as a benchmark for sentiment classification. The overall distribution of labels is balanced, i.e., there are approximately 2 500 science fiction and 2 500 non-science fiction book summaries. Each summary is stored in a column, with a science fiction classification of either "yes" or "no".

Create a project

Let’s begin!

First, create a project and name it so you know what kind of project it is. Naming is important!

After creating the project, you will be taken to the Datasets view, where you can import data.

Click on Data library and select Book summaries dataset.
Click Accept and import.

Feature encoding

Inside a dataset, you can switch views by clicking on the Features or Table button.

Verify that the default feature encoding is correct:

• The Summary feature should use the Text encoding.

• The Science Fiction feature should use the Binary encoding, and use 1 as the Positive class.
This tells the platform that ´1´ means an example is Science Fiction, and 0 isn’t Science Fiction.

If a feature uses the wrong settings, click the wrench icon to change it.

Figure 2. Datasets view with the BookSummaries dataset.

Text word count

Note the Word count histogram of the SummaryCropped feature. This histogram shows how many examples of text have a certain length, given in number of words.

If you pass your mouse over this histogram, you will see that many examples have over 500 words.

Figure 3. The word count histogram shows the distribution of the examples’s length. 1163 examples have between 576 and 600 words. The longest example has 852 words.

BERT models can process at most 512 tokens (roughly equivalent to words) per example.

While we usually want to train models using less tokens to speed up calculations, in this case we will remember to use as many tokens as possible when we design the model, so that our text summaries get truncated as little as possible.

Subsets of the dataset

In the left side of the Datasets view, you’ll see the subsets. All samples in the dataset are by default split into 10% validation, 10% test, and 80% training subsets. Keep these default values in this project.

Save the dataset

You’ve now created a dataset ready to be used in the platform. Click Save version and then click Use in new experiment and the Experiment wizard will pop up.

BERT - Design a text binary classification model

The Experiment wizard opens to help you set up your training experiment and to recommend snippets as prebuilt models.

We’ll now go over the Experiment wizard tab by tab.

Dataset tab

The platform selects the correct subsets by default.

• Training uses 80% of the available examples.

• Validation uses the 10% validation examples to check how the model generalizes to unseen examples.

Input(s) / target tab

• Input(s) column: make sure that only SummaryCropped () is selected.

• Target column: make sure that Science Fiction (1) is selected.

Snippet tab

The platform recommends the appropriate Problem type and snippets based on the input and target features.

• Make sure the Problem type is Binary text classification, since we want to classify text into two possible categories (Science Fiction or not).

• Select the English BERT uncased snippet.
All the training examples in the dataset are written in English, and we plan to only ever use the model with English text.

If you want to make the model work with text in any language, you can create another experiment later on and use the Multilingual BERT cased snippet instead.
You won necessarily need training data for every language that you plan to use. See for example the Classify text in any language tutorial.

Create experiment

Click Create to build the model in the Modeling canvas.

Note that the last Dense block must output a tensor of shape 1. This matches the shape of the Science Fiction feature, which is the target feature that the model learns.
You can find the shape of features in the Datasets view.

Sequence length

Sequence length is the number of tokens (roughly the number of words) kept during text tokenization.
If there are fewer tokens in an example than indicated by this parameter the text will be padded. If there are more tokens, the text will be truncated, i.e., cut from the end to fit the sequence length.

Smaller sequences compute faster, but as we saw in the Datasets view many summaries have 500 words or more. To avoid ignoring parts of the summaries that could potentially be rich in information, we’ll use the maximum sequence length of 512 tokens.

Click on the BERT Tokenizer block in the modeling canvas. Set the Sequence length to 512.

Check experiment settings & run the experiment

Click the Settings tab and check that:

• Set the Batch Size to 6. A larger batch size would run out of memory. For text processing models, we recommend to keep the product Sequence length x Batch size below 3000 to avoid running out of memory.

• Check that Epochs is 2. BERT models are already pretrained, and a delicate fine-tuning generally gives the best results.

Click Run.

The training will take some time since BERT is a very large and complex model.
Expect about 20 minutes of training time per epoch.

Analyzing the first experiment

Navigate to the Evaluation view and watch the model train.

Until at least the first epoch finishes, you can read more about the loss and metrics, or grab some fika (a Swedish coffee and cake break) while you wait.

Binary accuracy

To evaluate the performance of the model, you can look at the Binary accuracy by clicking on its name under the plot.

Binary accuracy gives the percentage of predictions that are correct. It should be about 85-90% by the end of training.

Precision

The precision gives the proportion of positive predictions, i.e., examples classified as Science Fiction, that were actually correct.

Recall

The recall gives the proportion of positive examples, i.e., actual Science Fiction texts, that are identified by the model.

Predictions inspection

After at least the first epoch has finished, you can use the predictions inspection to see the confusion matrix and the predictions for individual examples.

Confusion matrix

The confusion matrix shows how often examples are correctly or incorrectly classified as another category. Correct predictions fall on the diagonal.

Figure 4. The confusion matrix shows the frequency of every predicted/actual category combination.

ROC curve

Since the problem is a binary classification problem, the ROC curve will also be shown.

The ROC curve is a nice way to see how good the model generally is. The closer the ROC curve passes to the top left corner, the better the model is performing.

It also allows to know how changing the threshold will affect the recall and True negative rate.

Figure 5. The ROC Curve shows how many positive examples are correctly identified, as a function of how many negative examples are wrongly classified as positive. The threshold determines the point on the ROC curve where the model operates.

1. In the Deployment view click New deployment.

2. Select Experiment that you want to deploy for use in production.
In this tutorial we only trained one model so there is only one experiment in the list, but if you train more models with different tweaks, they will become available for deployment.

3. Select the Checkpoint marked with (best), since this is when the model had the best performance.
The platform creates a checkpoint after every epoch of training. This is useful since performance can sometimes get worse when a model is trained for too many epochs.

4. Click Create to create the new deployment from the selected experiment and checkpoint.

5. Click on Enable to deploy the experiment.

Test the text classifier in a browser

Let’s test your model. Click the Test deployment button, and you’ll open the Text classifier API tester with all relevant data copied from your deployment.

Figure 6. Text classifier API tester

Now, write your own summary, copy the example below or simply copy a recent summary from, e.g., Amazon :

Example:
Harry Potter has never been the star of a Quidditch team, scoringpoints while riding a broom far above the ground. He knows no spells, has never helped to hatch a dragon, and has never worn a cloak of invisibility.

All he knows is a miserable life with the Dursleys, his horrible aunt and uncle, and their abominable son, Dudley -- a great big swollen spoiled bully. Harry's room is a tiny closet at the foot of the stairs, and he hasn't had a birthday party in eleven years.

But all that is about to change when a mysterious letter arrives by owl messenger: a letter with an invitation to an incredible place that Harry — and anyone who reads about him — will find unforgettable.

Click play

Click Play to get a result.

Figure 7. Example result. 0 indicates that the model predicts that Harry Potter is not a science fiction book.

Tutorial recap and next steps

Congrats, in this tutorial you’ve learned how to build, train and deploy a model on the Peltarion Platform to analyze text data. A complete end-to-end data science project!

It wasn’t that hard, right?

Next steps

The next steps could be to try to run the project for more epochs and see if that improves the result or maybe change the learning rate.

This is, of course, only a demo of what you can do with language data on the platform. Knowing if a book is sci-fi or not is just fun. However, this tutorial is an example of how deep learning can be used for task automation that is way faster and more accurate than any manual labor. Business cases similar to this just need some extra thinking.