BERT - Text classification / cheat sheet

BERT explained

Use this cheat sheet

If you want to use BERT, and your input data consists of English text with a classification tag.

What is BERT?

BERT is a state-of-the-art deep learning language processing model. BERT performs significantly better than all other previous language models.

There are 2 BERT models that you can use in your experiments:

  • English BERT is the original model, pretrained to work exclusively with texts in English. We recommend this model if you work only with English texts.

  • Multilingual BERT is the same model, but pretrained to work with texts in any of 100 known languages, including English. We recommend this model if you might process texts in various languages.

Example use cases

Sentiment analysis

(😃 or 😱 or 😡 or 😍 or …​)

Try yourself: HappyDB - happy moments is a corpus of more than 100,000 happy moments.

Ticket classification

Save time and enable a quicker response by classifying incoming customer service requests as a first step, e.g., by topic, or urgency.

Positive v.s. negative


Try yourself: This Quora dataset consists of answers labeled as sincere or insincere.

Data preparation

Dataset requirements

  • Csv-file saved in UTF-8.

  • Features (e.g. text, label) must be separated with commas.

  • The first row must include the feature’s names, i.e., headings for each feature column.

  • The text feature will be used in the Input block.

    • Put text feature inside double quotes "…​".

    • If you have a double quote (") in the text feature, replace it with 2 double-quotes ("").

  • At least one categorical feature to be used as target.

  • A line should be less than 20k characters.

BERT example CSV
Figure 1. Example CSV.

Requirements for BERT

  • Text examples should be mostly less than 512 tokens. Longer texts will be cut from the end to fit the sequence length specified in the model block.

  • Text examples should be mostly in English to be understood if you use English BERT.

  • Different text examples can be in different languages if you use Multilingual BERT.

Multilingual training dataset and predictions
Figure 2. Example of sentiment classification. The training data combines examples from English and French which are easily available. The model predicts the sentiment of a sentence in any language.

Datasets view - Update text feature

In the Datasets view make sure that

  • The target feature Encoding is set to Categorical.

  • The input text feature Encoding is set to Text.

Modeling view

Connect an Input block with the text feature to either an English BERT or a Multilingual BERT block.

Check the Sequence length of the BERT block.
The BERT block accepts any integer input size from 3 to 512.
For the best performance, use the smallest size that does not result in your text being outrageously cut (this is difficult to estimate).

Connect the BERT block to at least one Dense block followed by a Target block to perform the classification.

Input & Target block

Set the Input feature to the text feature in your dataset, e.g., the text/ tweet/ doc/ review/ etc.

Set the Target feature to the categorical feature, e.g., sentiment/ label/ etc.

Example of NLP model using multilingual BERT.
Figure 3. Example of NLP model using multilingual BERT.

Change experiment settings

Click the Settings tab and change batch size, learning rate, and the number of epochs.

Batch size - as big as possible to train fast

The batch size should be as big as possible, simply because then the model will train faster. Too big batch size will fail the experiment because there will be not enough memory.


\[\text{Sequence}\text{ length} * \text{Batch}\text{ size} < 3000 \text{\textasciitilde} 4000\]
Table 1. Recommended max batch size based on sequence length (credit Google)
Sequence length Recommended max batch size













Epochs - 1 or 2

Set the number of epochs to 1 or 2.
Give the model a taste of your data, don’t brainwash it. It would just memorize your training set. BERT pre-knows a lot, but not quite what you need so it’s good to fine-tune it,

Learning rate - 0.000001 (5 zeros)

Set the learning rate to 0.000001 (5 zeros). Take tiny steps, don’t trample existing pre-trained weights.

Run the experiment

You’re all set. Hit the Run button.

Evaluation view

If you have a larger sequence length and many examples, the training will take quite a long time since BERT is a very large and complex model. Currently, expect more than four hours for 2 epochs when you use sequence length 512.
Go and grab some fika (a Swedish coffee and cake break) while you wait.

Crearte deployment

Click New deployment.

Select your experiment and epoch for deployment.

Click the Enable switch to deploy the experiment.

Text Classifier web app

To make it easier to test your model, we have provided a Text Classifier web app.

Open the Text classifier:, and copy-paste the following information from the Deployment view:

  • Deployment URL

  • Token

  • Input parameter feature name.

Then enter a new text and press the Run button ▶️ to predict!


Text web app
Was this page helpful?