Writing style tutor
Text similarity end-to-end project
Since we are heading into the Nobel Prize season, we decided to create an app to figure out which of the old classic writing style you mirror (if any!). We found tons of books in Project Gutenberg that are available for free – Wilde, Nietzsche, Ibsen, Voltaire, Austen, Thoreau, Brontë, Tolstoy, so we built an app that identifies our colleagues’ author styles (to check if it’s anything but Wilde!).
In this tutorial, we’ll show you how to use the Peltarion Platform to build the model on your own and figure out which great author you were in your past life!
- Target audience: Intermediate users
Test the web app first
First, get into the mood with hot chocolate and a candle … then open the app author-style.demo.peltarion.com and write a few off-the-top-of-your-head sentences (ok, if creativity is zero it’s perfectly fine to copy something from Google). Then, click Analyze text and check your writing. Surprised? Honored? Terrified?
Next, let’s see how you can build this Author style predictor web app on your own.
You will learn to
Build, train, and deploy a BERT model in the Peltarion Platform.
Create a web app that can identify which author a piece of text could have been written by.
The data – 100 most downloaded ebooks
The data comes from the Project Gutenberg: A fantastic library of over 60,000 free ebooks. There, you’ll find the world’s greatest literature, with a focus on older works for which U.S. copyright has expired.
For this project, we wanted a dataset with popular books, so we chose to build the dataset with Project Gutenberg’s top 100 most downloaded ebooks over the last 30 days.
Download our ready-made dataset with books
It’s a bit cumbersome to get the data out of the Project Gutenberg project, so we’ll provide a ready-made dataset for you. Download the dataset here.
Create your own preprocessed dataset
If you want to try to preprocess the data yourself, even if it is cumbersome, it can be helpful to know how we did it. Below, we’ll outline our process.
If you have any questions, don’t hesitate to reach out to us at email@example.com.
Using 60,000 books is impractical. Just downloading them would take a long time since the plaintext files sum up to over 50GB, and it is frankly unnecessary since BERT is already trained on a large corpus of books. Therefore, we decided to only download books we really wanted.
Acquiring the data
We downloaded the books using Project Gutenberg’s instructions and this Python package. We also used this pre-downloaded list of metadata to be able to remove all books that had any of the following characteristics:
Not in English. There is a Multilingual BERT snippet that can handle multilingual datasets, but in this tutorial we’ll stay with English authors.
A non-public domain license.
Not listed with a clear author, because we wanted famous authors.
Covering NSFW (not safe for work) topics, because we don’t want this kind of content in the data (for example, we manually removed the Kama Sutra, which sat between 40th and 50th place in the number of downloads ranking).
Preprocessing for BERT – from whole books to rows in a table
BERT on the platform accepts short pieces of text as individual data examples (roughly 100 words or shorter), which means that each book had to be split into many blocks of text. We used a sentence tokenizer (an algorithm that splits text into sentences) from the NLTK Python library.
Merge small sentences
Since very short sentences are unlikely to work well as input to BERT, we used the simple rule of thumb of merging sentences shorter than 100 characters with the following one within the same book.
We then rearranged the data into a pandas DataFrame with one row per sentence, annotating each row with the author’s name and book title.
Prepare for classification
In order to keep the data size small, we sampled only a small amount of rows from the dataset we built. We chose an 80/20 training-validation split with at most 1,500 training examples (and 375 validation) per author. For some authors, we had less data than that, in which case we used the whole book.
We then saved the dataset as a csv file, using a new column to indicate if that row was included in the training or the validation subset.
Build a model on the platform
It’s time to use the Peltarion Platform. Open the platform here: https://platform.peltarion.com/.
If you don’t have an account yet, here’s where you sign up. It’s all free and you get instant access.
Import the books-dataset to the platform
Create a project on the platform by clicking New project, and navigate to the Datasets view.
Click Choose files and pick the Most_downloaded_public_domain_books_20191114.csv file (or your own preprocessed dataset, if you made one).
Click Done when the upload has finished, to close the dataset dialog.
Check the author feature and make sure the Encoding is Categorical.
Check the sentence feature and make sure the Encoding is Text.
You can click on the wrench icon to edit the encoding of each feature.
During preprocessing, we have marked ourselves whether a sentence should be used for training or for validation. We’ll now use this information to create custom subsets on the platform.
Create a validation subset
Click New subset and name the subset Author validation.
Click Add conditional filter and set:
Operator: Is equal to
Click Create subset
Create a training subset
Click New subset and name the subset Author training.
Click Add conditional filter and set:
Operator: Is equal to
Click Create subset
Save the dataset
Click on Save version to save the dataset, then click on Use in new experiment.
Load a BERT snippet
The Experiment wizard opens to help you set up your training experiment and to recommend snippets as prebuilt models.
We’ll now go over the Experiment wizard tab by tab.
Set the Training subset to Author training
Set the Validation subset to Author validation.
Make sure that sentence () is the only feature selected in the Input(s) list.
Make sure that author (63) is the feature selected in the Target list. That’s the feature we want the model to learn to predict.
The platform should automatically detect that the Problem type is Single-label text classification.
This means that we want to classify text examples based on their author. It’s single-label because a piece of text can only have been written by a single author.
Select English BERT uncased in the list of Recommended snippets.
If you want to go further and work with international authors, you could use the Multilingual BERT cased snippet instead (this snippet will also work with English).
Initialize weights tab
In the Initialize weights tab, tick the box Weights trainable (all blocks). The BERT snippet allows you to use this massive network with pretrained weights. You don’t have to build and train it yourself. BIG win!
Click Create and the Modeling canvas will appear and contain a prebuilt model.
Open the Settings tab and check that:
Batch size is 32. BERT models need a lot of memory to train. A rule of thumb is to keep the product Batch size x Sequence length (defined in the Tokenizer block) below 3000.
Epochs is 2.
Learning rate is 0.00002 (4 zeros).
Click Run to fire up BERT!
Navigate to the Evaluation view (click the Evaluation icon at the top of the page) and watch the model train.
Because BERT is a very large and complex model, completing one epoch will take a little while. Expect about one hour of training time per epoch.
Enable the deployment
Finally, when the model has been trained, navigate to the Deployment view and create a new deployment. Click Enable to set the deployment live.
Keep this window open. You’ll soon need to copy information from the Deployment view.
Create a web app – use the model
You have now deployed your trained “author style” model, and you can create a web app that uses the model. The idea is that the app will display a simple webpage with a text area and a submit button. The aspiring writer submits some text, clicks the submit button, and gets a response page stating which author could have written the text. Simple, but effective!
Clone and download the author style repository
Everything you need to create the web app is available in the repo demo-author-style. Clone and download this repo.
Set up the configuration first
Create a 'config/' folder.
Create a configuration file based on our example config file: sample-config.json. Name this file app-config.json (it must be this name).
Save your new configuration file in the 'config/' folder. We have entered the folder 'config/' in .gitignore, so it’s safe to put config files there.
Copy and paste deployment URL and token
Go to the Deployment view of your project and find the deployment’s URL and Token. Copy and paste deployment URL and token to app-config.json.
Start the app with npm
In a terminal navigate to the pop-or-rock repository directory.
Run (if needed):
(will use app-config.json is the config file you created previously)
Test the classifier in a browser
Open a browser and enter the address http://127.0.0.1:3000.
Write your own data. Copy/paste in the web app and hit the submit button. Which author are you?
Share your results
We encourage you to play around with the app, change its scope and share your results. Maybe you have a use case you care for (or simply something funny)? Try to build it and share it with us @Peltarion. We would love that!
Congrats, you’re a data scientist now! And a writer! Even a Nobel Prize nominee (well, soon to be)! In this tutorial you’ve learned how to build, train and deploy a model on the Peltarion Platform, and created a web app to figure out which author you write like. A complete end-to-end data science project!
Next steps and linked ideas
Next steps could be to try to run the project for more epochs and see if that improves the result or maybe change the learning rate.
This is, of course, only a demo of what you can do with BERT on the platform and how you can build a demo app. If you want to dig more into BERT, you should do our BERT movie review sentiment analysis tutorial or Build a besserwisser bot for Slack with BERT.
Knowing your writing style may not be very helpful, just fun, but ideas from this project can be used to find out if your company communicates in a nice, kind, evil, hostile, fun, or super boring way, which can actually be really useful. More ideas on what you can do with BERT include: joke classifier, novel categorization, political sentiment, and the list goes on and on.
The Peltarion Platform makes it easy to realize these ideas, but remember, it’s key to have good data.