BERT Encoder (deprecated)

This block is deprecated and should not be used anymore. It exists only to support previously created experiments.

Check out the English BERT and Multilingual BERT blocks instead.

The BERT Encoder block implements the BERT—​Bidirectional Encoder Representations from Transformers—​network in its base size, as published in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

BERT pushes the state of the art in Natural Language Processing by combining two powerful technologies:

  • It is based on a deep Transformer encoder network, a type of network that can process long texts efficiently by using self-attention.

  • It is bidirectional, meaning that it uses the whole text passage to understand the meaning of each word.

What’s more, the original authors have released pre-trained weights, so that you can use it with minimal training work.

Using the BERT Encoder

The BERT Encoder block is initialized with random weights, and takes a long time and requires a lot of data to train.


Generally speaking, the block’s input must be a 1-dimensional vector with a size between 3 and 512, and containing integers between 0 and the Vocabulary size.
Keep in mind that longer inputs compute quadratically slower, so don’t use the largest size if you don’t have to.

You can get the proper input type from text data by connecting a feature that uses the Text (beta) encoding.

Use the Language model English BERT uncased for that feature, so that the text is encoded with BERT’s default Vocabulary size, and so that the special [CLS] and [SEP] tokens are appended around the text.
The Sequence length of the feature will determine the size of your text, in number of tokens, that is passed to your model.

Currently, you can only give a single text passage to the BERT Encoder block.


For each input token, the BERT Encoder block calculates a 768-long vector representing an embedding of this token.

You can choose to return only the first vector (CLS option), or all of them (Sequence option).

  • When the input is encoded using English BERT uncased as the Language model, the special [CLS] token is added at the first position. The output vector for this special token does not represent the token itself, but the input as a whole.
    So it is usually sufficient, and faster, to only use this vector for classification tasks.

  • If you choose to return the full output of the BERT Encoding block, you will get more detailed information about every token of the input. This can help to get better accuracy, e.g., when comparing different inputs.
    However, this output can only be followed by a 1D Global average pooling or a 1D Global max pooling block.

BERT Structure

The BERT Encoder block implements the base version of the BERT network. It is composed of 12 successive transformer layers, each having 12 attention heads.
The total number of parameters is 110 million.

BERT Encoder
Figure 1. Structure of BERT

Every token in the input of the block is first embedded into a learned 768-long embedding vector.

Each embedding vector is then transformed progressively every time it traverses one of the BERT Encoder layers:

  • Through linear projections, every embedding vector creates a triplet of 64-long vectors, called the key, query, and value vectors

  • The key, query, and value vectors from all the embeddings pass through a self-attention head, which outputs one 64-long vector for each input triplet.
    Every output vector from the self-attention head is a function of the whole input sequence, which is what makes BERT context-aware.

  • A single embedding vector uses different linear projections to create 12 unique triplets of key, query, and value vectors, which all go through their own self-attention head.
    This allows each self-attention head to focus on different aspects of how the tokens interact with each other.

  • The output from all the self-attention heads are first concatenated together, then they go through another linear projection and a feed-forward layer, which helps to utilize deep non-linearity. Residual connections from previous states are also used to increase robustness.

The result is a sequence of transformed embedding vectors, which are sent through the same layer structure 11 more times.

After the 12th encoding layer, the embedding vectors have been transformed to contain more accurate information about each token. You can choose if you want the BERT Encoder block to return all of them or only the first one (corresponding to the [CLS] token), which is often sufficient for classification tasks.


In this video, Romain gives a step-by-step walkthrough of self-attention, the mechanism powering the deep learning model, BERT, and other state-of-the-art models within natural language processing (NLP).

He goes through concepts such as:

  • Context

  • Word embeddings

  • Multi-head attention

Romain give a step-by-step walkthrough of self-attention.


Vocabulary size: The largest input value that the BERT Encoder block can recognize.
Use the amount of known tokens in the language model used to encode the input feature.

Output: CLS (default) returns only the transformed embedding of the first token, which is often the best way to do classification tasks.
Sequence returns the transformed embedding for all the input tokens, but you can only send this output to a 1D Global average pooling or a 1D Global max pooling block.

Was this page helpful?
Yes No