Peltarion

# Book genre classification

Solve a text classification problem with BERT

Bookstores rarely split them apart, but we at Peltarion argue that fantasy and science fiction clearly are different things. To make the point, we decided to create an AI model that classifies the genre of a book solely on its summary. A model that sets down the foot and states if the book is sci-fi or not.

In this tutorial, we’ll show you how to use the Peltarion Platform to build a model on your own and correct all major bookstores in your country!

- Target audience: Beginners
- Tutorial type: Learn AI & platform
- Problem type: Text classification

You will learn to
- Build and deploy a model based on BERT

Figure 1. What kind of book is it?

Before following this tutorial, it is strongly recommended that you complete the tutorial Deploy an operational AI model, just to get a feeling on how the platform works.

## Embeddings

If you want it could be a good idea to read about word embeddings and BERT’s attention mechanism, which are an important concepts in NLP (Natural Language Processing). For an introduction and overview material, check out the links below:

## The problem - sci-fi or not?

Text classification aims to assign text, e.g., tweets, messages, or reviews, to one or multiple categories. Such categories can be whether or not a book is considered as science fiction.

## Create a project

First, create a project and name it so you know what kind of project it is. Naming is important!

After creating the project, you will be taken to the Datasets view, where you can import data.

Click the Import free datasets button.

Select Book summaries dataset.
Click Accept and import.

This will import the dataset in your project, and you can now edit it.

### CMU book summary dataset

The data comes from the CMU Book Summary Dataset, a dataset of over 16 000 book summaries. For this project, we wanted a dataset with science fiction book summaries, so we chose to preprocess the data to our task, so it contains book summaries along with their associated binary category: Science Fiction or not.

The dataset is intended to serve as a benchmark for sentiment classification. The overall distribution of labels is balanced, i.e., there are approximately 2 500 science fiction and 2 500 non-science fiction book summaries. Each summary is stored in a column, with a science fiction classification of either "yes" or "no".

### Feature encoding

Inside a dataset, you can switch views by clicking on the Features or Table button.

Verify that the default feature encoding is correct:

• The Summary feature should use the Text encoding.

• The Science Fiction feature should use the Binary encoding, and use 1 as the Positive class.
This tells the platform that ´1´ means an example is Science Fiction, and 0 isn’t Science Fiction.

If a feature uses the wrong settings, click the wrench icon to change it.

Figure 2. Datasets view with the BookSummaries dataset.

### Text word count

Note the Word count histogram of the SummaryCropped feature. This histogram shows how many examples of text have a certain length, given in number of words.

If you pass your mouse over this histogram, you will see that many examples have over 500 words.

Figure 3. The word count histogram shows the distribution of the examples’s length. 1163 examples have between 576 and 600 words. The longest example has 852 words.

BERT models can process at most 512 tokens (roughly equivalent to words) per example.

While we usually want to train models using less tokens to speed up calculations, in this case we will remember to use as many tokens as possible when we design the model, so that our text summaries get truncated as little as possible.

### Subsets of the dataset

In the left side of the Datasets view, you’ll see the subsets. All samples in the dataset are by default split into 10% validation, 10% test, and 80% training subsets. Keep these default values in this project.

### Save the dataset

You’ve now created a dataset ready to be used in the platform. Click Use in new experiment and the Experiment wizard will pop up.

## Design a text binary classification model

We’ll now go over the Experiment wizard tab by tab.

• Dataset tab.
The platform selects the correct subsets by default.

• Training uses 80% of the available examples.

• Validation uses the 10% validation examples to check how the model generalizes to unseen examples.

• Inputs / target tab.

• Input column: make sure that only Summary is selected.

• Target column: make sure that Science Fiction is selected.

• Problem type tab.
The platform recommends the appropriate Problem type based on the input and target features.
Make sure the Problem type is Single-label text classification, since we want to classify text into two possible categories (Science Fiction or not).

## Inspect the BERT model

Click Create to build the model in the Modeling canvas. A BERT-based model will appear in the Modeling canvas.

BERT pushed the state of the art in Natural Language Processing (NLP) by combining two powerful technologies:

• It is based on a deep Transformer network. A type of network that can process efficiently long texts by using attention.

• It is bidirectional. Meaning that it takes into account the whole text passage to understand the meaning of each word.

### Shape of last Dense block is 1

Note that the last Dense block must output a tensor of shape 1. This matches the shape of the Science Fiction feature, which is the target feature that the model learns.
You can find the shape of features in the Datasets view.

### Sequence length

Sequence length is the number of tokens (roughly the number of words) kept during text tokenization.
If there are fewer tokens in an example than indicated by this parameter the text will be padded. If there are more tokens, the text will be truncated, i.e., cut from the end to fit the sequence length.

Smaller sequences compute faster, but as we saw in the Datasets view many summaries have 500 words or more. To avoid ignoring parts of the summaries that could potentially be rich in information, we’ll use the maximum sequence length of 512 tokens.

Click on the Multilingual BERT block in the modeling canvas. Set the Sequence length to 512.

## Check experiment settings & run the experiment

Click the Settings tab and check that:

• Set the Batch Size to 6. A larger batch size would run out of memory. For text processing models, we recommend to keep the product Sequence length x Batch size below 3000 to avoid running out of memory.

• Check that Epochs is 2. BERT models are already pretrained, and a delicate fine-tuning generally gives the best results.

Click Run.

The training will take some time since BERT is a very large and complex model.
Expect about 20 minutes of training time per epoch.

## Analyzing the first experiment

Navigate to the Evaluation view and watch the model train.

Until at least the first epoch finishes, you can read more about the loss and metrics, or grab some fika (a Swedish coffee and cake break) while you wait.

### Binary accuracy

To evaluate the performance of the model, you can look at the Binary accuracy by clicking on its name under the plot.

Binary accuracy gives the percentage of predictions that are correct. It should be about 85-90% by the end of training.

### Precision

The precision gives the proportion of positive predictions, i.e., examples classified as Science Fiction, that were actually correct.

### Recall

The recall gives the proportion of positive examples, i.e., actual Science Fiction texts, that are identified by the model.

### Predictions inspection

After at least the first epoch has finished, you can use the predictions inspection to see the confusion matrix and the predictions for individual examples.

#### Confusion matrix

The confusion matrix shows how often examples are correctly or incorrectly classified as another category. Correct predictions fall on the diagonal.

Figure 4. The confusion matrix shows the frequency of every predicted/actual category combination.

#### ROC curve

Since the problem is a binary classification problem, the ROC curve (Receiver Operating Characteristic curve) will also be shown.

The ROC curve is a nice way to see how good the model generally is. The closer the ROC curve passes to the top left corner, the better the model is performing.

It also allows to know how changing the threshold will affect the recall and True negative rate.

Figure 5. The ROC Curve shows how many positive examples are correctly identified, as a function of how many negative examples are wrongly classified as positive. The threshold determines the point on the ROC curve where the model operates.

In the Evaluation view click Create deployment.

1. Select Experiment that you want to deploy for use in production.
In this tutorial we only trained one model so there is only one experiment in the list, but if you train more models with different tweaks, they will become available for deployment.

2. Select the Checkpoint marked with (best), since this is when the model had the best performance.
The platform creates a checkpoint after every epoch of training. This is useful since performance can sometimes get worse when a model is trained for too many epochs.

3. Click Create to create the new deployment from the selected experiment and checkpoint.

4. Click Enable to deploy the experiment.

## Test the text classifier in a browser

Let’s test your model. Click the Open web app button, and you’ll open the Deployment web app.

Now, write your own summary, copy the example below or simply copy a recent summary from, e.g., Amazon :

Example:
Harry Potter has never been the star of a Quidditch team, scoringpoints while riding a broom far above the ground. He knows no spells, has never helped to hatch a dragon, and has never worn a cloak of invisibility.

All he knows is a miserable life with the Dursleys, his horrible aunt and uncle, and their abominable son, Dudley -- a great big swollen spoiled bully. Harry's room is a tiny closet at the foot of the stairs, and he hasn't had a birthday party in eleven years.

But all that is about to change when a mysterious letter arrives by owl messenger: a letter with an invitation to an incredible place that Harry — and anyone who reads about him — will find unforgettable.

Click Get your prediction to get a result.

Figure 6. Example result. 18% which is close to 0 indicates that the model predicts that Harry Potter is not a science fiction book. Potterfans agree.

## Tutorial recap

Congrats, in this tutorial you’ve learned how to build, train and deploy a model on the Peltarion Platform to analyze text data. A complete end-to-end data science project!

It wasn’t that hard, right?

This is, of course, only a demo of what you can do with language data on the platform. Knowing if a book is sci-fi or not is just fun. However, this tutorial is an example of how deep learning can be used for task automation that is way faster and more accurate than any manual labor. Business cases similar to this just need some extra thinking.

## Next steps

Navigate back to the Modeling view, and click on Iterate.

• Continue training will let you train the same model for more epochs to see if that improves the result.
Try to increase the batch size or to reduce the learning rate to see if performance improves.

• Reuse part of model creates a new experiment with a single block that contains the model you just trained. This is useful to build another model around the current one.