Book genre classification
Solve a text classification problem with BERT
Bookstores rarely split them apart, but we at Peltarion argue that fantasy and science fiction clearly are different things. To make the point, we decided to create an AI model that classifies the genre of a book solely on its summary.
In this tutorial, we’ll show you how to use the Peltarion Platform to build a model on your own and correct all major bookstores in your country!
- Target audience: Beginners
Before following this tutorial, it is strongly recommended that you complete the tutorial Deploy an operational AI model, just to get a feeling on how the platform works.
If you want it could be a good idea to read about word embeddings and BERT’s attention mechanism, which are an important concepts in NLP (Natural Language Processing). For an introduction and overview material, check out the links below:
The problem - sci-fi or centaurs?
Text classification aims to assign text, e.g., tweets, messages, or reviews, to one or multiple categories. Such categories can be whether or not a book is considered as science fiction.
Goal with this experiment
You will learn how to build and deploy a model based on BERT.
BERT pushed the state of the art in Natural Language Processing (NLP) by combining two powerful technologies:
It is based on a deep Transformer network. A type of network that can process efficiently long texts by using attention.
It is bidirectional. Meaning that it takes into account the whole text passage to understand the meaning of each word.
Dataset - CMU book summary dataset
The content and format of the dataset
The data comes from the CMU Book Summary Dataset, a dataset of over 16 000 book summaries. For this project, we wanted a dataset with science fiction book summaries, so we chose to preprocess the data to our task, so it contains book summaries along with their associated binary category: Science Fiction or not.
The dataset is intended to serve as a benchmark for sentiment classification. The overall distribution of labels is balanced, i.e., there are approximately 2 500 science fiction and 2 500 non-science fiction book summaries. Each summary is stored in a column, with a science fiction classification of either "yes" or "no".
Create a project
First, create a project and name it so you know what kind of project it is. Naming is important!
Add the data
After creating the project, you will be taken to the Datasets view, where you can import data.
Click on URL import. Paste in this URL: https://storage.googleapis.com/bucket-8732/datalibrary/sci-fi-or-not.csv and click on Import.
The URL points to a CSV formatted file with the data.
Change the Dataset name to Book Summaries.
When uploading has finished, click Done.
Inside a dataset, you can switch views by clicking on the Features or Table button.
Verify that the default feature encoding is correct:
The SummaryCropped feature should use the Text encoding.
The Science Fiction feature should use the Binary encoding, and use 1 as the Positive class.
This tells the platform that ´1´ means an example is Science Fiction, and 0 isn’t Science Fiction.
If a feature uses the wrong settings, click the wrench icon to change it.
Text word count
Note the Word count histogram of the SummaryCropped feature. This histogram shows how many examples of text have a certain length, given in number of words.
If you pass your mouse over this histogram, you will see that many examples have over 500 words.
BERT models can process at most 512 tokens (roughly equivalent to words) per example.
While we usually want to train models using less tokens to speed up calculations, in this case we will remember to use as many tokens as possible when we design the model, so that our text summaries get truncated as little as possible.
Subsets of the dataset
In the left side of the Datasets view, you’ll see the subsets. All samples in the dataset are by default split into 20% validation and 80% training subsets. Keep these default values in this project.
Save the dataset
You’ve now created a dataset ready to be used in the platform. Click Save version and then click Use in new experiment and the Experiment wizard will pop up.
BERT - Design a text binary classification model
The Experiment wizard opens to help you set up your training experiment and to recommend snippets as prebuilt models.
We’ll now go over the Experiment wizard tab by tab.
The platform selects the correct subsets by default.
Training uses 80% of the available examples.
Validation uses the remaining 20% examples to check how the model generalizes to unseen examples.
Input(s) / target tab
Input(s) column: make sure that only SummaryCropped () is selected.
Target column: make sure that Science Fiction (1) is selected.
The platform recommends the appropriate Problem type and snippets based on the input and target features.
Make sure the Problem type is Binary text classification, since we want to classify text into two possible categories (Science Fiction or not).
Select the English BERT uncased snippet.
All the training examples in the dataset are written in English, and we plan to only ever use the model with English text.
If you want to make the model work with text in any language, you can create another experiment later on and use the Multilingual BERT cased snippet instead.
You won necessarily need training data for every language that you plan to use. See for example the Classify text in any language tutorial.
Click Create to build the model in the Modeling canvas.
Note that the last Dense block must output a tensor of shape
This matches the shape of the Science Fiction feature, which is the target feature that the model learns.
You can find the shape of features in the Datasets view.
Sequence length is the number of tokens (roughly the number of words) kept during text tokenization.
If there are fewer tokens in an example than indicated by this parameter the text will be padded. If there are more tokens, the text will be truncated, i.e., cut from the end to fit the sequence length.
Smaller sequences compute faster, but as we saw in the Datasets view many summaries have 500 words or more. To avoid ignoring parts of the summaries that could potentially be rich in information, we’ll use the maximum sequence length of 512 tokens.
Click on the Tokenizer block in the modeling canvas. Set the Sequence length to 512.
Check experiment settings & run the experiment
Click the Settings tab and check that:
Set the Batch Size to 6. A larger batch size would run out of memory. For text processing models, we recommend to keep the product Sequence length x Batch size below 3000 to avoid running out of memory.
Check that Epochs is 2. BERT models are already pretrained, and a delicate fine-tuning generally gives the best results.
The training will take some time since BERT is a very large and complex model.
Expect about 20 minutes of training time per epoch.
Analyzing the first experiment
Navigate to the Evaluation view and watch the model train.
To evaluate the performance of the model, you can look at the Binary accuracy by clicking on its name under the plot.
Binary accuracy gives the percentage of predictions that are correct. It should be about 85-90% by the end of training.
The precision gives the proportion of positive predictions, i.e., examples classified as Science Fiction, that were actually correct.
The recall gives the proportion of positive examples, i.e., actual Science Fiction texts, that are identified by the model.
The confusion matrix shows how often examples are correctly or incorrectly classified as another category. Correct predictions fall on the diagonal.
Since the problem is a binary classification problem, the ROC curve will also be shown.
The ROC curve is a nice way to see how good the model generally is. The closer the ROC curve passes to the top left corner, the better the model is performing.
Deploy your trained experiment
In the Deployment view click New deployment.
Select Experiment that you want to deploy for use in production.
In this tutorial we only trained one model so there is only one experiment in the list, but if you train more models with different tweaks, they will become available for deployment.
Select the Checkpoint marked with (best), since this is when the model had the best performance.
The platform creates a checkpoint after every epoch of training. This is useful since performance can sometimes get worse when a model is trained for too many epochs.
Click Create to create the new deployment from the selected experiment and checkpoint.
Click on Enable to deploy the experiment.
Test the text classifier in a browser
Let’s test your model. Click the Test deployment button, and you’ll open the Text classifier API tester with all relevant data copied from your deployment.
Now, write your own summary, copy the example below or simply copy a recent summary from, e.g., Amazon :
Harry Potter has never been the star of a Quidditch team, scoringpoints while riding a broom far above the ground. He knows no spells, has never helped to hatch a dragon, and has never worn a cloak of invisibility.
All he knows is a miserable life with the Dursleys, his horrible aunt and uncle, and their abominable son, Dudley -- a great big swollen spoiled bully. Harry's room is a tiny closet at the foot of the stairs, and he hasn't had a birthday party in eleven years.
But all that is about to change when a mysterious letter arrives by owl messenger: a letter with an invitation to an incredible place that Harry — and anyone who reads about him — will find unforgettable.
Click Play to get a result.
Tutorial recap and next steps
Congrats, in this tutorial you’ve learned how to build, train and deploy a model on the Peltarion Platform to analyze text data. A complete end-to-end data science project!
It wasn’t that hard, right?
The next steps could be to try to run the project for more epochs and see if that improves the result or maybe change the learning rate.
This is, of course, only a demo of what you can do with language data on the platform. Knowing if a book is sci-fi or not is just fun. However, this tutorial is an example of how deep learning can be used for task automation that is way faster and more accurate than any manual labor. Business cases similar to this just need some extra thinking.