# English BERT

The English BERT block implements the BERT—​Bidirectional Encoder Representations from Transformers—​network in its base size, as published in BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.

BERT pushes the state of the art in Natural Language Processing by combining two powerful technologies:

• It is based on a deep Transformer encoder network, a type of network that can process long texts efficiently by using self-attention.

• It is bidirectional, meaning that it uses the whole text passage to understand the meaning of each word.

 Note DisclaimerPlease note that datasets, machine-learning models, weights, topologies, research papers and other content, including open source software, (collectively referred to as “Content”) provided and/or suggested by Peltarion for use in the Platform and otherwise, may be subject to separate third party terms of use or license terms. You are solely responsible for complying with the applicable terms. Peltarion makes no representations or warranties about Content. You expressly relieve us from any and all liability, loss or risk arising (directly or indirectly) from Your use of any third party content.

What’s more, the original authors have released pre-trained weights, so that you can use it with minimal training work.

## Using the English BERT block

### Use cases

The English BERT block is initialized with weights pretrained on BookCorpus and English Wikipedia, which gives English BERT a good general understanding of English.

This means that you can easily fine-tune this block for a variety of tasks, like sentiment analysis, topic classification, or authorship verification.

### Input

The input of the English BERT must come from an Input block that provides a text encoded feature.

### Output

The English BERT block returns the so-called CLS output. This output is a vector that can be passed to other blocks to perform regression or classification.

### Parameters

Sequence length: The maximum length of text that the model processes, in number of tokens. There are generally 1 to 3 tokens per word.
Aim for a Sequence length that is the typical size of your text feature, since larger values require more computation time but smaller values may cause the end of your text to be ignored.

The English BERT block supports sequence lengths between 3 and 512 tokens.
Note that 2 tokens are reserved for internal use, so a Sequence length of 3 processes a single token of text.

Trainable: Whether we want the training algorithm to change the value of the weights during training. In some cases, one will want to keep parts of the network static.

## Training the English BERT block

### Fine-tuning a BERT model

BERT is a powerful model, which can learn most fine-tuning datasets very easily. This means that it is prone to catastrophic forgetting and overfitting of the new dataset when trained with inappropriate settings.

To avoid these issues, train your model with a very low Learning rate, of the order of 10-5 to 10-6.
In addition, only train for a few Epochs, between 1 and 3.

### Memory consumption of BERT

The English BERT block is a very large model, which requires a large amount of memory to train.

The estimation of memory consumption displayed when using a BERT model is unfortunately not accurate at the moment.
As a rule of thumb, keep the product Batch size * Sequence length lower than 3000 to avoid memory issues.

If an experiment fails because the model requires too much memory, try reducing the Batch size in the experiment’s settings.
You can consider reducing the Sequence length of the block as well, as long as this doesn’t remove significant information.

Example

Sequence lengthBatch size

512

6

200

10

100

25

50

55

10

250

## BERT Structure

The English BERT block implements the base version of the BERT network. It is composed of 12 successive transformer layers, each having 12 attention heads.
The total number of parameters is 110 million.

Figure 1. Structure of BERT

Every token in the input of the block is first embedded into a learned 768-long embedding vector.

Each embedding vector is then transformed progressively every time it traverses one of the BERT Encoder layers:

• Through linear projections, every embedding vector creates a triplet of 64-long vectors, called the key, query, and value vectors

• The key, query, and value vectors from all the embeddings pass through a self-attention head, which outputs one 64-long vector for each input triplet.
Every output vector from the self-attention head is a function of the whole input sequence, which is what makes BERT context-aware.

• A single embedding vector uses different linear projections to create 12 unique triplets of key, query, and value vectors, which all go through their own self-attention head.
This allows each self-attention head to focus on different aspects of how the tokens interact with each other.

• The output from all the self-attention heads are first concatenated together, then they go through another linear projection and a feed-forward layer, which helps to utilize deep non-linearity. Residual connections from previous states are also used to increase robustness.

The result is a sequence of transformed embedding vectors, which are sent through the same layer structure 11 more times.

After the 12th encoding layer, the embedding vectors have been transformed to contain more accurate information about each token. This block returns all of them or only the first one (corresponding to the [CLS] token), which is often sufficient for classification tasks.

## Available weights

The English BERT block uses the BERT-Base Uncased weights, pre-trained by the Google AI Language Team on BookCorpus and English Wikipedia.

### Terms

When using pretrained blocks, additional terms apply: BERT with weights licence.

## BERT Tokenizer (deprecated)

This block is deprecated and should not be used anymore. It exists only to support previously created experiments.

The tokenizer block converts plain text into a sequence of numerical values, which AI models love to handle. The same block can process text written in over 100 languages thanks to the WordPiece method.

## How does text tokenization work?

1. The tokenizer splits the input text into small pieces, called tokens.
There can be more tokens than words if parts of a word (like prefixes and suffixes) are more common than the word itself.

2. The Sequence length is enforced by truncating or padding the sequence of tokens.

3. Special tokens required by the Multilingual BERT encoder and English BERT encoder blocks are added, and every token is then replaced with an integer value.

4. The sequence of integers is ready to be processed by one of the language processing blocks.

The Tokenizer block uses WordPiece under the hood.
This means that it can process input text features written in over 100 languages, and be directly connected to a Multilingial BERT Encoder or English BERT Encoder block for advanced Natural Language Processing.

## Parameters

Sequence length: The total number of tokens kept in the sequence. It’s necessary to fix the sequence length, since models require fixed size inputs.
Minimum: 3
Maximum: 512

If the text input is longer than the Sequence length, the end of the text will be ignored. If the text input is smaller, the sequence will be padded with PAD tokens.

Choose a length that matches your typical text size to utilize all the data while avoiding unnecessary calculations on the padding tokens.

Figure 2. A Sequence length of 5 causes long sentences to be truncated, and short sentences to be padded.

Vocabulary: The known vocabulary used to tokenize the text and assign numerical values.
Use English uncased if you connect the tokenizer block to an English BERT encoder block. Letter case (capitalization) in the text is ignored.
Use Multilingual cased if you connect the tokenizer block to a Multilingual BERT encoder block. Letter casing (capitalization) is preserved to get additional linguistic information. For example, i and I get different token values.