Book genre classification
Solve a text classification problem with BERT
Bookstores rarely split them apart, but we at Peltarion argue that fantasy and science fiction clearly are different things. To make the point, we decided to create an AI model that classifies the genre of a book solely on its summary.
In this tutorial, we’ll show you how to use the Peltarion Platform to build a model on your own and correct all major bookstores in your country!
Target audience: Beginners
Before following this tutorial, it is strongly recommended that you complete the tutorial Deploy an operational AI model, just to get a feeling on how the platform works.
If you want it could be a good idea to read about word embeddings, which is an important concept in NLP (Natural Language Processing). For an introduction and overview of different types of word embeddings, check out the links below:
The problem - sci-fi or centaurs?
Text classification aims to assign text, e.g., tweets, messages, or reviews, to one or multiple categories. Such categories can be whether or not a book is considered as science fiction.
Goal with this experiment
You will learn how to build and deploy a model based on BERT.
BERT pushed the state of the art in Natural Language Processing (NLP) by combining two powerful technologies:
It is based on a deep Transformer network. A type of network that can process efficiently long texts by using attention.
It is bidirectional. Meaning that it takes into account the whole text passage to understand the meaning of each word.
Dataset - CMU book summary dataset
The content and format of the dataset
The data comes from the CMU Book Summary Dataset, a dataset of over 16 000 book summaries. For this project, we wanted a dataset with science fiction book summaries, so we chose to preprocess the data to our task, so it contains book summaries along with their associated binary category: Science Fiction or not.
The dataset is intended to serve as a benchmark for sentiment classification. The overall distribution of labels is balanced, i.e., there are approximately 2 500 science fiction and 2 500 non-science fiction book summaries. Each summary is stored in a column, with a science fiction classification of either "yes" or "no".
Create a project
Let's begin! First, create a project and name it so you know what kind of project it is. Naming is important!
Navigate to the Datasets view and click New dataset
Locate the Import data from URL box and copy-paste the link to the dataset below:
Click the arrow to start the import.
When done, name the dataset Book Summaries and click Done.
The dataset is labeled binary, that is, 1 indicates that the book is classified as a science fiction book and 0 is not.
Click the Summary Cropped column and set:
Encoding to Text (Beta)
Sequence length to 512
Language model to English BERT uncased
Sequence length corresponds to the expected number of words in each summary sample. If the actual number of words in a sample are fewer than indicated by this parameter, the text will be padded. If it is longer, the text will be truncated, i.e., cut from the end to fit in the sequence.
The BERT block accepts any sequence length between 3 and 512. Smaller sequences compute faster, but they might cut some words from your text. You will want to set this value as low as possible to reduce training time and minimize memory use but still large enough to include enough information.
The Language model should match the language of the input data, in this case, English BERT uncased. Read here for more information about this parameter.
Subsets of the dataset
In the top right corner, you’ll see the subsets. All samples in the dataset are by default split into 20% validation and 80% training subsets. Keep these default values in this project.
Save the dataset
You’ve now created a dataset ready to be used in the platform. Click Save version and then click Use in new model and the Experiment wizard will pop up.
BERT - Design a text classification model
Define dataset tab
Make sure that the BookSummaries dataset is selected in the Experiment wizard.
Choose snippet tab
Click on the Choose snippet tab and select the [.userinput]#BERT English uncased] snippet.
Set the input feature to Summary Cropped.
Set the target feature to Science Fiction.
The BERT English uncased snippet includes the whole BERT network. The BERT block implements the base version of the BERT network. It is composed of 12 encoding layers from a Transformer network, each layer having 12 attention heads. The total number of parameters is 110 million. The snippet allows you to use this massive network with weights pre-trained to understand the text.
The BERT snippet includes:
An Input block.
A BERT Encoder block with pre-trained weights which gives BERT a general understanding of English. The BERT encoder blocklooks at the input sequence as a whole, producing an output thatcontains an understanding of the sentence. The block outputs a singlevector of 768 size.
A label block with pre-trained weights.
A Dense block that is untrained.
A Target block.
Initialize weights tab
Click on the Initialize weights tab. Select the pretrained weights English Wikipedia and BookCorpus and set all weights trainable by checking the Weights trainable (all blocks) box.
Click Create and the prepopulated BERT model will appear in the Modeling canvas.
Check experiment settings & run the experiment
Click the Settings tab and check that:
Batch Size is 6. If you set a larger batch size you will run out of memory since we've set sequence length to 512.
Epochs is 2. Training takes a long time, so don't train for too long the first time when youcheck if your model is good.
Learning rate is 0.000001 (5 zeros). To avoid catastrophic forgetting.
Analyzing the first experiment
Navigate to the Evaluation view and watch the model train. The training will take quite a long time since BERT is a very large and complex model.
Go and grab some fika (a Swedish coffee and cake break) while you wait.
The training loss will decrease for each epoch, but the evaluation loss may start to increase. This means that the model is starting to overfit to the training data.
You can read more about the loss metrics here.
To evaluate the performance of the model, you can look at overall accuracy, which is displayed in the Experiment info section to the right. It should be approximately 85-90%.
Since the model solves a classification problem, a confusion matrix is displayed. The top-left to bottom-right diagonal shows correct predictions. Everything outside this diagonal are errors.
The recall per class corresponds to the percentage values in the confusion matrix diagonal. You can display the same metric by hovering over the horizontal bars to the right of the confusion matrix. You can also view the precision per class by hovering over the vertical bars on top of the confusion matrix.
Now that we know the accuracy of the model based on validation data, let’s deploy the model and try it with some new data.
Create new deployment
In the Deployment view click New deployment.
Select experiment and checkpoint of your trained model to test it for predictions, or enable for business product calls. Both best epoch and last epoch for each trained experiment areavailable for deployment.
Click the Enable switch to deploy the experiment.
Test the text classifier in a browser
Let's test your model. Click the Test deployment button, and you’ll open the Text classifier API tester with all relevant data copied from your deployment.
Now, write your own summary, copy the example below or simply copy a recent summary from, e.g., Amazon :
Harry Potter has never been the star of a Quidditch team, scoringpoints while riding a broom far above the ground. He knows no spells, has never helped to hatch a dragon, and has never worn a cloak of invisibility.
All he knows is a miserable life with the Dursleys, his horrible aunt and uncle, and their abominable son, Dudley -- a great big swollen spoiled bully. Harry's room is a tiny closet at the foot of the stairs, and he hasn't had a birthday party in eleven years.
But all that is about to change when a mysterious letter arrives by owl messenger: a letter with an invitation to an incredible place that Harry — and anyone who reads about him — will find unforgettable.
Click Play to get a result.
Tutorial recap and next steps
Congrats, a complete end-to-end data science project! In this tutorial you’ve learned how to build, train and deploy a model on the Peltarion Platform to analyze text data. A complete end-to-end data science project!
It was not that hard, right?
The next steps could be to try to run the project for more epochs and see if that improves the result or maybe change the learning rate.
This is, of course, only a demo of what you can do with language data on the platform. Knowing if a book is sci-fi or not is just fun. However, this tutorial is an example of how deep learning can be used for task automation that is way faster and more accurate than any manual labor. Business cases similar to this just need some extra thinking.