Multilingual BERT cased

The BERT (Bidirectional Encoder Representations from Transformers) network redefines the state of the art for Natural Language Processing (NLP).
The Multilingual BERT cased snippet allows you to quickly get started with your language based model.

Why use a multilingual model?

A multilingual model allows you to deploy a single model able to work with any of the 100 known languages.

Multilingual training dataset and predictions
Figure 1. Example of sentiment classification. The training data combines examples from English and French which are easily available. The model predicts the sentiment of a sentence in any language.

More than a simple convenience, multilingual models often perform better than monolingual models.
One reason is that the training data available is generally more limited in any single language. In addition, many languages share common patterns that the model can pick up more easily when it is trained with a variety of languages.

The Multilingual BERT cased snippet

The Multilingual BERT snippet includes:

How to train the BERT snippet

Please note that datasets, machine-learning models, weights, topologies, research papers and other content, including open source software, (collectively referred to as “Content”) provided and/or suggested by Peltarion for use in the Platform and otherwise, may be subject to separate third party terms of use or license terms. You are solely responsible for complying with the applicable terms. Peltarion makes no representations or warranties about Content. You expressly relieve us from any and all liability, loss or risk arising (directly or indirectly) from Your use of any third party content.

The weights provided were pre-trained for a specific task, which gives BERT a general understanding of over 100 languages. You could use these weights as-is and train only the blocks that come after the Multilingual BERT encoder block. However, the recommended practice is to fine-tune your entire model, including the Multilingual BERT encoder block, for your task.

There is a general procedure for fine-tuning pre-trained snippets.
However for BERT models, we simply recommend to fine-tune the whole model on your problem:

  • Set the Multilingual BERT encoder block as Trainable like all the other blocks, set the learning rate very low, and train the whole model until the results are satisfying.

Memory consumption of BERT

The Multilingual BERT encoder is a very large model, which requires a large amount of memory to train.

The estimation of memory consumption displayed when using a BERT model is unfortunately not accurate at the moment.
As a rule of thumb, keep the product Batch size * Sequence length lower than 3000 to avoid memory issues.

If an experiment fails because the model requires too much memory, try reducing the Batch size in the experiment’s settings.
You can consider reducing the Sequence length of the input feature as well, as long as this doesn’t remove significant information.

Fine-tuning a BERT model

BERT is also a powerful model, which can learn most fine-tuning datasets very easily. This means that it is prone to catastrophic forgetting and overfitting of the new dataset when trained with inappropriate settings.

To avoid these issues, train your model with a very low Learning rate, of the order of 10-5 to 10-6.
In addition, only train for a few Epochs, between 1 and 3.

Available weights

The Multilingual BERT encoder block of this snippet uses the BERT-Base, Multilingual Cased weights, pre-trained by the Google AI Language Team on Wikipedia.


When using pretrained snippets, additional terms apply: BERT with weights licence.

Was this page helpful?