Writing style tutor
Figure out which of the old classic writing style you mirror (if any!)
We found tons of books in Project Gutenberg that are available for free – Wilde, Nietzsche, Ibsen, Voltaire, Austen, Thoreau, Brontë, Tolstoy, so we built an app that identifies our colleagues’ author styles (to check if it’s anything but Wilde!).
In this tutorial, we’ll show you how to use the Peltarion Platform to build the model on your own and figure out which great author you were in your past life!
- Target audience: Intermediate users
You will learn to
- Build, train, and deploy a BERT model in the Peltarion Platform.
- Create a web app that can identify which author a piece of text could have been written by.
Test the web app first
First, get into the mood with hot chocolate and a candle … then open the app author-style.demo.peltarion.com and write a few off-the-top-of-your-head sentences (ok, if creativity is zero it’s perfectly fine to copy something from Google). Then, click Analyze text and check your writing. Surprised? Honored? Terrified?
Next, let’s see how you can build this Author style predictor web app on your own.
The data – 100 most downloaded ebooks
The data comes from the Project Gutenberg: A fantastic library of over 60,000 free ebooks. There, you’ll find the world’s greatest literature, with a focus on older works for which U.S. copyright has expired.
For this project, we wanted a dataset with popular books, so we chose to build the dataset with Project Gutenberg’s top 100 most downloaded ebooks over the last 30 days.
Create your own preprocessed dataset
What we do here to get the dataset is a bit cumbersome so we’ve prepared a ready-made dataset for you. It’s available in the Platform Data library and we’ll show you how to access the library in a bit.
But… if you want to try to preprocess the data yourself, even if it is cumbersome, it can be helpful to know how we did it. Below, we’ll outline our process.
If you have any questions, don’t hesitate to reach out to us at email@example.com.
Using 60,000 books is impractical. Just downloading them would take a long time since the plaintext files sum up to over 50GB, and it is frankly unnecessary since BERT is already trained on a large corpus of books. Therefore, we decided to only download books we really wanted.
Acquiring the data
We downloaded the books using Project Gutenberg’s instructions and this Python package. We also used this pre-downloaded list of metadata to be able to remove all books that had any of the following characteristics:
Not in English. There is a Multilingual BERT block that can handle multilingual datasets, but in this tutorial we’ll stay with English authors.
A non-public domain license.
Not listed with a clear author, because we wanted famous authors.
Covering NSFW (not safe for work) topics, because we don’t want this kind of content in the data (for example, we manually removed the Kama Sutra, which sat between 40th and 50th place in the number of downloads ranking).
Preprocessing for BERT – from whole books to rows in a table
BERT on the platform accepts short pieces of text as individual data examples (roughly 100 words or shorter), which means that each book had to be split into many blocks of text. We used a sentence tokenizer (an algorithm that splits text into sentences) from the NLTK Python library.
Merge small sentences
Since very short sentences are unlikely to work well as input to BERT, we used the simple rule of thumb of merging sentences shorter than 100 characters with the following one within the same book.
We then rearranged the data into a pandas DataFrame with one row per sentence, annotating each row with the author’s name and book title.
Prepare for classification
In order to keep the data size small, we sampled only a small amount of rows from the dataset we built. We chose an 80/20 training-validation split with at most 1,500 training examples (and 375 validation) per author. For some authors, we had less data than that, in which case we used the whole book.
We then saved the dataset as a csv file, using a new column to indicate if that row was included in the training or the validation subset.
Build a model on the platform
It’s time to use the Peltarion Platform. Open the platform here: https://platform.peltarion.com/.
If you don’t have an account yet, here’s where you sign up. It’s all free and you get instant access.
Import the books-dataset to the platform
Create a project on the platform by clicking New project, and navigate to the Datasets view.
Click Import free datasets and select the Most downloaded books - tutorial data (or click Choose files and select your own preprocessed dataset, if you made one).
This will import the dataset in your project, and you can now edit it.
Inside a dataset, you can switch views by clicking on the Features or Table button.
Verify that the default feature encoding is correct:
Check the author feature and make sure the Encoding is Categorical.
Check the sentence feature and make sure the Encoding is Text.
You can click on the wrench icon to edit the encoding of each feature.
During preprocessing, we have marked ourselves whether a sentence should be used for training or for validation. We’ll now use this information to create custom subsets on the platform.
If you haven’t used the advanced settings before, click on Show advanced settings to see the dataset’s Versions and Subsets.
Click New subset and name the subset Custom data split.
We don’t want to filter out any data from the data, we just want to organize it. So click Next to get to the Split subset tab.
Set the Type of split to Categorical to split according to the feature we’ve marked.
Select subset as Feature referring to training and validation.
Set the Category to train for the Training split.
Set the Category to val for the Validation split.
Click Create to create the split.
Save the dataset
You’ve now created a dataset ready to be used in the platform.
Click Save version and then click Use in new experiment and the Experiment wizard will pop up.
The Experiment wizard opens to help you set up your training experiment.
We’ll now go over the Experiment wizard tab by tab.
Set the Split to Custom data split.
Inputs / Target tab
Make sure that sentence is the only feature selected in the Inputs list.
Select author in the Target list. That’s the feature we want the model to learn to predict.
Problem type tab
The platform should automatically detect that the Problem type is Single-label text classification.
This means that we want to classify text examples based on their author. It’s single-label because a piece of text can only have been written by a single author.
Click Create and the Modeling canvas will appear and contain a prebuilt model.
Open the Settings tab and check that:
Batch size is 32.
BERT models need a lot of memory to train. A rule of thumb is to keep the product Batch size x Sequence length (defined in the Multilingual BERT block) below 3000.
Epochs is 2.
Learning rate is 0.00002 (4 zeros).
The small values for Epochs and Learning rate make sure that we only do a gentle fine-tuning of the Multilingual BERT block, and avoid catastrophic-forgetting.
Click Run to fire up BERT!
Navigate to the Evaluation view (click the Evaluation icon at the top of the page) and watch the model train.
Because BERT is a very large and complex model, completing one epoch will take a little while. Expect about one hour of training time per epoch.
Enable the deployment
Finally, the model has been trained.
In the Evaluation view click Create deployment.
Select experiment and checkpoint and then click Enable to set the deployment live.
Keep this window open. You’ll soon need to copy information from the Deployment view.
Create a web app – use the model
You have now deployed your trained “author style” model, and you can create a web app that uses the model. The idea is that the app will display a simple webpage with a text area and a submit button. The aspiring writer submits some text, clicks the submit button, and gets a response page stating which author could have written the text. Simple, but effective!
Clone and download the author style repository
Everything you need to create the web app is available in the repo demo-author-style. Clone and download this repo.
Set up the configuration first
Create a 'config/' folder.
Create a configuration file based on our example config file: sample-config.json. Name this file app-config.json (it must be this name).
Save your new configuration file in the 'config/' folder. We have entered the folder 'config/' in .gitignore, so it’s safe to put config files there.
Copy and paste deployment URL and token
Go to the Deployment view of your project and find the deployment’s URL and Token. Copy and paste deployment URL and token to app-config.json.
Start the app with npm
In a terminal navigate to the pop-or-rock repository directory.
Run (if needed):
(will use app-config.json is the config file you created previously)
Test the classifier in a browser
Open a browser and enter the address http://127.0.0.1:3000.
Write your own data. Copy/paste in the web app and hit the submit button. Which author are you?
Share your results
We encourage you to play around with the app, change its scope and share your results. Maybe you have a use case you care for (or simply something funny)? Try to build it and share it with us @Peltarion. We would love that!
Congrats, you’re a data scientist now! And a writer! Even a Nobel Prize nominee (well, soon to be)! In this tutorial you’ve learned how to build, train and deploy a model on the Peltarion Platform, and created a web app to figure out which author you write like. A complete end-to-end data science project!
Next steps and linked ideas
Next steps could be to to improving the model performance by making small or big changes.
To do that, go back to the Modeling view, and click on Iterate.
Continue training will let you train the same model for more epochs. Try to increase the batch size or to reduce the learning rate to see if performance improves.
Reuse part of model creates a new experiment with a single block that contains the model you just trained. This is useful to build another model around the current one.
To make more modifications to the model, go back to the Modeling view, and click on Duplicate. This will create a copy of your current model that you can edit, but training progress will be lost.
This is, of course, only a demo of what you can do with BERT on the platform and how you can build a demo app. If you want to dig more into BERT, you should do our BERT movie review sentiment analysis tutorial or Build a besserwisser bot for Slack with BERT.
Knowing your writing style may not be very helpful, just fun, but ideas from this project can be used to find out if your company communicates in a nice, kind, evil, hostile, fun, or super boring way, which can actually be really useful. More ideas on what you can do with BERT include: joke classifier, novel categorization, political sentiment, and the list goes on and on.
The Peltarion Platform makes it easy to realize these ideas, but remember, it’s key to have good data.