BERT Encoder

The BERT Encoder block implements the BERT—​Bidirectional Encoder Representations from Transformers—​network in its base size, as published in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

BERT pushes the state of the art in Natural Language Processing by combining two powerful technologies:

  • It is based on a deep Transformer encoder network, a type of network that can process long texts efficiently by using attention.

  • It is bidirectional, meaning that it uses the whole text passage to understand the meaning of each word.

What’s more, the original authors have released pre-trained weights, so that you can use it with minimal training work.

Using the BERT Encoder

The BERT Encoder block is initialized with random weights, and takes a long time and requires a lot of data to train. Build your model from the BERT English uncased snippet instead to benefit from a pre-trained BERT Encoder block, which can be fine-tuned much more easily.


The input must be a vector of integers, which is produced by encoding a textual feature as Text (beta). Pre-trained weights for the BERT Encoder block are only available for the Language model English BERT uncased, but other language models may be used.

Set the Vocabulary size block parameter to the largest possible input value, which is the number of known tokens for a given language model.

The BERT Encoder block accepts any input size from 3 to 512. Smaller inputs compute faster than longer ones (although they can contain less information).

Currently, you can only give a single text passage to the BERT Encoder block. While this is sufficient for many natural language tasks, this means that applications requiring pairs of text, illustrated in the original publication, are limited on the platform.


The BERT Encoder output is a single vector of size 768, which contains information about the input sequence as a whole.

Although the BERT Encoder calculates internal vectors for each value of the input sequence, these vectors are currently not returned by the BERT Encoder block.

BERT Structure

The BERT block implements the base version of the BERT network. It is composed of 12 encoding layers from a Transformer network, each layer having 12 attention heads.
The total number of parameters is 110 million.


Vocabulary size: The largest input value that the BERT Encoder block can recognize.
Use the amount of known tokens in the language model used to encode the input feature.