New to Peltarion? Discover our deep learning Platform
A single deep learning platform to build and deploy your projects, even if you’re not an AI superstar.FIND OUT MORE
Writing style tutor
Text similarity end-to-end project
Since we are heading into the Nobel Prize season, we decided to create an app to figure out which of the old classic writing style you mirror (if any!). We found tons of books in Project Gutenberg that are available for free – Wilde, Nietzsche, Ibsen, Voltaire, Austen, Thoreau, Brontë, Tolstoy, so we built an app that identifies our colleagues’ author styles (to check if it’s anything but Wilde!).
In this tutorial, we’ll show you how to use the Peltarion Platform to build the model on your own and figure out which great author you were in your past life!
Target audience: Intermediate users
Test the web app first
First, get into the mood with hot chocolate and a candle … then open the app author-style.demo.peltarion.com and write a few off-the-top-of-your-head sentences (ok, if creativity is zero it’s perfectly fine to copy something from Google). Then, click Analyze text and check your writing. Surprised? Honored? Terrified?
Next, let’s see if you can build this Author style predictor web app on your own.
You will learn to
Build, train and deploy a BERT model in the Peltarion Platform.
Create a web app that can identify which author a piece of text could have been written by.
The data – 100 most downloaded ebooks
The data comes from the Project Gutenberg: A fantastic library of over 60,000 free ebooks. There, you’ll find the world’s greatest literature, with a focus on older works for which U.S. copyright has expired.
For this project, we wanted a dataset with popular books, so we chose to build the dataset with Project Gutenberg’s top 100 most downloaded ebooks over the last 30 days.
Download our ready-made dataset with books
It’s a bit cumbersome to get the data out of the Project Gutenberg project, so we’ll provide a ready-made dataset for you. Download the dataset here.
Create your own preprocessed dataset
If you want to try to preprocess the data yourself, even if it is cumbersome, it can be helpful to know how we did it. Below, we’ll outline our process.
If you have any questions, don’t hesitate to reach out to us at firstname.lastname@example.org.
Using 60,000 books is impractical. Just downloading them would take a long time since the plaintext files sum up to over 50GB, and it is frankly unnecessary since BERT is already trained on a large corpus of books. Therefore, we decided to only download books we really wanted.
Acquiring the data
We downloaded the books using Project Gutenberg’s instructions and this Python package. We also used this pre-downloaded list of metadata to be able to remove all books that had any of the following characteristics:
A non-public domain license.
Not listed as in English, because the English BERT model wouldn’t make any sense with other languages.
Not listed with a clear author, because we wanted famous authors.
Covering NSFW (not safe for work) topics, because we don’t want this kind of content in the data (for example, we manually removed the Kama Sutra, which sat between 40th and 50th place in the number of downloads ranking).
Preprocessing for BERT – from whole books to rows in a table
BERT on the platform accepts short pieces of text as individual data examples (roughly 100 words or shorter), which means that each book had to be split into many blocks of text. We used a sentence tokenizer (an algorithm that splits text into sentences) from the NLTK Python library.
Merge small sentences
Since very short sentences are unlikely to work well as input to BERT, we used the simple rule of thumb of merging sentences shorter than 100 characters with the following one within the same book.
We then rearranged the data into a Pandas DataFrame with one row per sentence, annotating each row with the author’s name and book title.
Prepare for classification
In order to keep the data size small, we sampled only a small amount of rows from the dataset we built. We chose an 80/20 training-validation split with at most 1,500 training examples (and 375 validation) per author. For some authors, we had less data than that, in which case we used the whole book.
We then saved the dataset as a csv file, using a new column to indicate if that row was included in the training or the validation subset.
Build a model on the platform
It’s time to use the Peltarion Platform. Open the platform here: https://platform.peltarion.com/.
If you don’t have an account yet, here’s where you sign up. It’s all free and you get instant access.
Manage the books-dataset to the platform
Create a project on the platform by clicking New project, and navigate to the Datasets view.
Click New dataset and upload the Most_downloaded_public_domain_books_20191114.csv (or your own preprocessed dataset, if you made one).
Click Next, then Done, to close the dataset dialog.
Select the author feature and make sure the Encoding is Categorical.
Select the sentence feature, click the spanner and make sure the Encoding is Text.
Set Sequence length to 100
Set Language model to English BERT uncased.
Delete the existing subsets
Create a validation subset
Click New subset and name the subset Author validation
Click Add conditional filter and set:
Operator: Is equal to
Create a training subset
Click New subset and name the subset Author training
Click Add conditional filter and set:
Operator: Is equal to
Normalize on the Author training subset (there is a dropdown for this under the subsets). If you want to understand the importance normalization can have, check out Impact of standardization – create different versions of a dataset / Example workflow.
Save the dataset
Save this version of the dataset and click Use in new experiment.
Use the BERT snippet
Define dataset tab
Make sure that the correct dataset is selected in the Experiment wizard.
Make sure that your newly created dataset is selected (i.e. , make sure the Training subset points to Author training and the Validation subset points to Author validation).
Choose snippet tab
Select the BERT English uncased snippet in the Choose snippet tab.
Set Input feature to sentence and Target feature to author.
Initialize weights tab
In the Initialize weights tab, tick the box Weights trainable (all blocks). The BERT snippet allows you to use this massive network with pretrained weights. You don’t have to build and train it yourself. BIG win!
Click Create and the prepopulated BERT model will appear in the Modeling canvas.
Open the Settings tab and check that:
Batch size is 36. Due to the sequence length of 100, the batch size needs to be this small.
Epochs is 2. Don’t run it too long.
Learning rate is 0.000003 (5 zeros). This will allow the model to take small steps to avoid catastrophic forgetting.
Click Run to fire up BERT!
Navigate to the Evaluation view (click the Evaluation icon at the top of the page) and watch the model train. Because BERT is a very large and complex model, nothing much will show in the beginning since the training will take a long time.
Enable the deployment
Finally, when the model has been trained, navigate to the Deployment view and create a new deployment. Click Enable to set the deployment live.
Keep this window open. You’ll soon need to copy information from the Deployment view.
Create a web app – use the model
You have now deployed your trained “author style” model, and you can create a web app that uses the model. The idea is that the app will display a simple webpage with a text area and a submit button. The aspiring writer submits some text, clicks the submit button, and gets a response page stating which author could have written the text. Simple, but effective!
Clone and download the author style repository
Everything you need to create the web app is available in the repo demo-author-style. Clone and download this repo.
Set up the configuration first
Create a 'config/' folder.
Create a configuration file based on our example config file: sample-config.json. Name this file app-config.json (it must be this name).
Save your new configuration file in the 'config/' folder. We have entered the folder 'config/' in .gitignore, so it’s safe to put config files there.
Copy and paste deployment URL and token
Go to the Deployment view of your project and find the deployment’s URL and Token. Copy and paste deployment URL and token to app-config.json.
Start the app with npm
In a terminal navigate to the pop-or-rock repository directory.
Run (if needed):
(will use app-config.json is the config file you created previously)
Test the classifier in a browser
Open a browser and enter the address http://127.0.0.1:3000.
Write your own data. Copy/paste in the web app and hit the submit button. Which author are you?
Share your results
We encourage you to play around with the app, change its scope and share your results. Maybe you have a use case you care for (or simply something funny)? Try to build it and share it with us @Peltarion. We would love that!
Congrats, you’re a data scientist now! And a writer! Even a Nobel Prize nominee (well, soon to be)! In this tutorial you’ve learned how to build, train and deploy a model on the Peltarion Platform, and created a web app to figure out which author you write like. A complete end-to-end data science project!
Next steps and linked ideas
Next steps could be to try to run the project for more epochs and see if that improves the result or maybe change the learning rate.
This is, of course, only a demo of what you can do with BERT on the platform and how you can build a demo app. If you want to dig more into BERT, you should do our BERT movie review sentiment analysis tutorial or Build a besserwisser bot for Slack with BERT.
Knowing your writing style may not be very helpful, just fun, but ideas from this project can be used to find out if your company communicates in a nice, kind, evil, hostile, fun, or super boring way, which can actually be really useful. More ideas on what you can do with BERT include: joke classifier, novel categorization, political sentiment, and the list goes on and on.
The Peltarion Platform makes it easy to realize these ideas, but remember, it’s key to have good data.