Author style predictor

Text similarity end-to-end project

Since we are heading into the Nobel Prize season, we decided to create an app to figure out which of the old classic writing style you mirror (if any!). We found tons of books in Project Gutenberg that are available for free – Wilde, Nietzsche, Ibsen, Voltaire, Austen, Thoreau, Brontë, Tolstoy, so we built an app that identifies our colleagues’ author styles (to check if it’s anything but Wilde!).

In this tutorial, we’ll show you how to use the Peltarion Platform to build the model on your own and figure out which great author you were in your past life!

Which author are you? Voltaire, Austen, Caroll, Dickens or Brontë?

Test the web app first

First, get into the mood with hot chocolate and a candle ... then open the app author-style.demo.peltarion.com and write a few off-the-top-of-your-head sentences (ok, if creativity is zero it's perfectly fine to copy something from Google). Then, click Analyze text and check your writing.
Surprised? Honored? Terrified?

Next, let's see if you can build this Author style predictor web app on your own.

Someone wanted to write the new Anne of Green Gables.

You will learn how to

  • Build, train and deploy a BERT model in the Peltarion Platform.
  • Create a web app that can identify which author a piece of text could have been written by.

The data – 100 most downloaded ebooks

The data comes from the Project Gutenberg: A fantastic library of over 60,000 free ebooks. There, you’ll find the world's greatest literature, with a focus on older works for which U.S. copyright has expired.

For this project, we wanted a dataset with popular books, so we chose to build the dataset with Project Gutenberg’s top 100 most downloaded ebooks over the last 30 days.

Download our ready-made dataset with books

It’s a bit cumbersome to get the data out of the Project Gutenberg project, so we’ll provide a ready-made dataset for you. Download the dataset here.

Create your own preprocessed dataset

If you want to try to preprocess the data yourself, even if it is cumbersome, it can be helpful to know how we did it. Below, we’ll outline our process.
If you have any questions, don’t hesitate to reach out to us at support@peltarion.com.

Using 60,000 books is impractical. Just downloading them would take a long time since the plaintext files sum up to over 50GB, and it is frankly unnecessary since BERT is already trained on a large corpus of books. Therefore, we decided to only download books we really wanted.

Acquiring the data

We downloaded the books using Project Gutenberg’s instructions and this Python package. We also used this pre-downloaded list of metadata to be able to remove all books that had any of the following characteristics:

  • A non-public domain license.
  • Not listed as in English, because the English BERT model wouldn’t make any sense with other languages.
  • Not listed with a clear author, because we wanted famous authors.
  • Covering NSFW (not safe for work) topics, because we don’t want this kind of content in the data (for example, we manually removed the Kama Sutra, which sat between 40th and 50th place in the number of downloads ranking).

Preprocessing for BERT – from whole books to rows in a table

BERT on the platform accepts short pieces of text as individual data examples (roughly 100 words or shorter), which means that each book had to be split into many blocks of text. We used a sentence tokenizer (an algorithm that splits text into sentences) from the NLTK Python library. 

Merge small sentences

Since very short sentences are unlikely to work well as input to BERT, we used the simple rule of thumb of merging sentences shorter than 100 characters with the following one within the same book. 

Create Pandas

We then rearranged the data into a Pandas DataFrame with one row per sentence, annotating each row with the author’s name and book title. 

Prepare for classification 

In order to keep the data size small, we sampled only a small amount of rows from the dataset we built. We chose an 80/20 training-validation split with at most 1,500 training examples (and 375 validation) per author. For some authors, we had less data than that, in which case we used the whole book.

Save dataset

We then saved the dataset as a csv file, using a new column to indicate if that row was included in the training or the validation subset.

Build a model on the platform

It’s time to use the Peltarion Platform. Open the platform here: https://platform.peltarion.com/.

Sign up
If you don't have an account yet, here's where you sign up. It's all free and you get instant access.

Manage the books-dataset to the platform

  1. Create a project on the platform by clicking New project, and navigate to the Datasets view.
  2. Click New dataset and upload the Most_downloaded_public_domain_books_20191114.csv (or your own preprocessed dataset, if you made one).
  3. Click Next, then Done, to close the dataset dialog.

Set encodings

  1. Select the author feature and make sure the Encoding is Categorical.
  2. Select the sentence feature, click the spanner and make sure the Encoding is Text.
    Set Sequence length to 100;
    Set Language model to English BERT uncased.

Create subsets

  1. Delete the existing subsets
  2. Create a validation subset
    Click New subset and name the subset Author validation
    Click Add conditional filter and set:
    Feature: subset
    Operator: Is equal to
    Value: val
  3. Create a training subset
    Click New subset and name the subset Author training
    Click Add conditional filter and set:
    Feature: subset
    Operator: Is equal to
    Value: train
  4. Normalize on the Author training subset (there is a dropdown for this under the subsets). If you want to understand the importance normalization can have, check out Impact of standardization – create different versions of a dataset / Example workflow.

Save the dataset

Save this version of the dataset and navigate to the Modeling view (click the Modeling icon at the top of the page).

Use the BERT snippet

In the Modeling view click New experiment.

Define dataset tab

Make sure that the correct dataset is selected in the Experiment wizard.

Make sure that your newly created dataset is selected (i.e. , make sure the Training subset points to Author training and the Validation subset points to Author validation).

Choose snippet tab

Select the BERT English uncased snippet. in the Choose snippet tab.

Set Input feature to sentence and Target feature to author.

Initialize weights tab

In the Initialize weights tab, tick the box Weights trainable (all blocks).
The BERT snippet allows you to use this massive network with pretrained weights. You don’t have to build and train it yourself. BIG win!

Create experiment

Click Create and the prepopulated BERT model will appear in the Modeling canvas.

In the last Dense block, click the Drop pretrained weights button.
Change the number of nodes to 63. That is the number of authors in the dataset.

Open the Settings tab and check that:

  • Batch size is 36. Due to the sequence length of 100, the batch size needs to be this small.
  • Epochs is 2. Don’t run it too long.
  • Learning rate is 0.000003 (5 zeros). This will allow the model to take small steps to avoid catastrophic forgetting.

Run experiment

Click Run to fire up BERT!

Navigate to the Evaluation view (click the Evaluation icon at the top of the page) and watch the model train. Because BERT is a very large and complex model, nothing much will show in the beginning since the training will take a long time. 

Enable the deployment

Finally, when the model has been trained, navigate to the Deployment view and create a new deployment. Click Enable to set the deployment live. 

Keep this window open. You'll soon need to copy information from the Deployment view.

Create a web app – use the model

You have now deployed your trained “author style” model, and you can create a web app that uses the model. The idea is that the app will display a simple webpage with a text area and a submit button. The aspiring writer submits some text, clicks the submit button, and gets a response page stating which author could have written the text. Simple, but effective! 

Clone and download the author style repository

Everything you need to create the web app is available in the repo demo-author-style. Clone and download this repo.

Set up the configuration first

Create a config/ folder.

Create a configuration file based on our example config file: sample-config.json. Name this file app-config.json (it must be this name).

Save your new configuration file in the config/ folder. We have entered the folder config/ in .gitignore, so it's safe to put config files there.

Copy and paste deployment URL and token

Go to the Deployment view of your project and find the deployment's URL and Token. Copy and paste deployment URL and token to app-config.json.

Start the app with npm

In a terminal, navigate to the directory where you cloned the demo-author-style repository. 

Run:

npm install (if needed)

npm start (will use app-config.json is the config file you created previously)

 

Test the classifier in a browser

Open a browser and enter the address http://127.0.0.1:3000.

Write your own data. Copy/paste in the web app and hit the submit button. Which author are you?

Share your results

We encourage you to play around with the app, change its scope and share your results. Maybe you have a use case you care for (or simply something funny)? Try to build it and share it with us @Peltarion. We would love that!

Recap

Congrats, you're a data scientist now! And a writer! Even a Nobel Prize nominee (well, soon to be)! In this tutorial you’ve learned how to build, train and deploy a model on the Peltarion Platform, and created a web app to figure out which author you write like. A complete end-to-end data science project!

Next steps and linked ideas

Next steps could be to try to run the project for more epochs and see if that improves the result or maybe change the learning rate.

Linked ideas

This is, of course, only a demo of what you can do with BERT on the platform and how you can build a demo app. If you want to dig more into BERT, you should do our BERT movie review sentiment analysis tutorial or Build a besserwisser bot for Slack with BERT.

Knowing your writing style may not be very helpful, just fun, but ideas from this project can be used to find out if your company communicates in anice, kind, evil, hostile, fun, or super boring way, which can actually be really useful. More ideas on what you can do with BERT include: joke classifier, novel categorization, political sentiment, and the list goes on and on.

The Peltarion Platform makes it easy to realize these ideas, but remember, it’s key to have good data.