Multilingual BERT encoder

The Multilingual BERT Encoder block implements the BERT—​Bidirectional Encoder Representations from Transformers—​network in its base size, as published in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

BERT pushes the state of the art in Natural Language Processing by combining two powerful technologies:

  • It is based on a deep Transformer encoder network, a type of network that can process long texts efficiently by using self-attention.

  • It is bidirectional, meaning that it uses the whole text passage to understand the meaning of each word.

Please note that datasets, machine-learning models, weights, topologies, research papers and other content, including open source software, (collectively referred to as “Content”) provided and/or suggested by Peltarion for use in the Platform and otherwise, may be subject to separate third party terms of use or license terms. You are solely responsible for complying with the applicable terms. Peltarion makes no representations or warranties about Content. You expressly relieve us from any and all liability, loss or risk arising (directly or indirectly) from Your use of any third party content.

What’s more, the original authors have released pre-trained weights, so that you can use it with minimal training work.

Why use a multilingual model?

A multilingual model allows you to deploy a single model able to work with any of the 100 known languages.

More than a simple convenience, multilingual models often perform better than monolingual models.
One reason is that the training data available is generally more limited in any single language. In addition, many languages share common patterns that the model can pick up more easily when it is trained with a variety of languages.

Using the BERT Encoder

The Multilingual BERT encoder block is initialized with weights pretrained on Wikipedia.

Use the Multilingual BERT snippet to directly get a complete model for text classification or text regression that uses the Multilingual BERT encoder.


The input of the Multilingual BERT encoder must come from a Tokenizer block.

The tokenizer must use Multilingual cased as Vocabulary, so that the tokenized numerical values are compatible with the Multilingual BERT encoder block.


The Multilingual BERT encoder returns the so-called CLS output. This output is a vector that can be passed to other blocks to perform regression or classification.

BERT Structure

The BERT Encoder block implements the base version of the BERT network. It is composed of 12 successive transformer layers, each having 12 attention heads.
The total number of parameters is 110 million.

BERT Encoder
Figure 1. Structure of BERT

Every token in the input of the block is first embedded into a learned 768-long embedding vector.

Each embedding vector is then transformed progressively every time it traverses one of the BERT Encoder layers:

  • Through linear projections, every embedding vector creates a triplet of 64-long vectors, called the key, query, and value vectors

  • The key, query, and value vectors from all the embeddings pass through a self-attention head, which outputs one 64-long vector for each input triplet.
    Every output vector from the self-attention head is a function of the whole input sequence, which is what makes BERT context-aware.

  • A single embedding vector uses different linear projections to create 12 unique triplets of key, query, and value vectors, which all go through their own self-attention head.
    This allows each self-attention head to focus on different aspects of how the tokens interact with each other.

  • The output from all the self-attention heads are first concatenated together, then they go through another linear projection and a feed-forward layer, which helps to utilize deep non-linearity. Residual connections from previous states are also used to increase robustness.

The result is a sequence of transformed embedding vectors, which are sent through the same layer structure 11 more times.

After the 12th encoding layer, the embedding vectors have been transformed to contain more accurate information about each token. This block returns only the first one (corresponding to the [CLS] token), which is often sufficient for classification tasks.

Available weights

The Multilingual BERT encoder block uses the BERT-Base, Multilingual Cased weights, pre-trained by the Google AI Language Team on Wikipedia.


When using pretrained snippets, additional terms apply: BERT with weights licence.

Was this page helpful?