BERT - pretrained
The BERT (Bidirectional Encoder Representations from Transformers) network redefines the state of the art for Natural Language Processing (NLP).
The BERT snippet allows you to use this massive network with weights pre-trained to understand text, currently in English.
The BERT English uncased snippet
The BERT snippet includes:
An Input block.
To use the pre-trained weights of this snippet, you must configure the input feature with Encoding: Text (beta), and select Language model: English BERT uncased.
The Sequence length of the text feature can be between 3 and 512 tokens. Smaller sequences compute faster, but the end of your text will be discarded if it doesn’t fit.
A BERT Encoder block with pre-trained weights.
Two Dense blocks with pre-trained weights.
These blocks are pre-trained on a next-sentence-prediction task, but can be fine-tuned for similar tasks with minimal effort. Resize (or replace) these blocks freely to fit your target.
A Target block, which may be linked to any categorical or numeric feature.
How to train the BERT snippet
The weights provided were pre-trained for a specific task, which gives BERT a general understanding of English. You could use these weights as-is and train only the blocks that come after the BERT Encoder block. However, the recommended practice is to fine-tune your entire model, including the BERT Encoder block, for your task.
There is a general procedure for fine-tuning pre-trained snippets.
However for BERT models, we simply recommend to fine-tune the whole model on your problem:
Set the BERT Encoder block as Trainable like all the other blocks, set the learning rate very low, and train the whole model until the results are satisfying.
Memory consumption of BERT
The BERT Encoder is a very large model, which requires a large amount of memory to train.
The estimation of memory consumption displayed when using a BERT model is unfortunately not accurate at the moment.
As a rule of thumb, keep the product Batch size * Sequence length lower than 3000 to avoid memory issues.
If an experiment fails because the model requires too much memory, try reducing the Batch size in the experiment’s settings.
You can consider reducing the Sequence length of the input feature as well, as long as this doesn’t remove significant information.
Fine-tuning a BERT model
BERT is also a powerful model, which can learn most fine-tuning datasets very easily. This means that it is prone to catastrophic forgetting and overfitting of the new dataset when trained with inappropriate settings.
To avoid these issues, train your model with a very low Learning rate, of the order of 10-5 to 10-7.
In addition, only train for a few Epochs, between 1 and 5.
Batch size: 6, when Sequence length is 512
Epochs: 5 or less.
Learning rate: 0.00001 or lower.
Jacob Devlin et al.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019.