Text encoding (beta)

Text encoding is the conversion of plain text into a sequence of numerical values, which AI models love to handle. There are a few different ways to do that, but we have methods to handle your text if it’s written in English, Swedish, or Finnish.

How does text encoding work?

  • The first step is to break the whole text string into smaller pieces, called tokens.

  • Every token then gets a number, and the sequence is ready to be processed!

Example:

Encoding of a text string
Figure 1. The text input is tokenized into pieces that are assigned a number.

Language models

There is no standard way of tokenizing text or assigning numbers to tokens. But we have various language models for encoding text in a compatible way with various pre-trained blocks.

There are two types of language models, fastText and BERT, which will transform your text slightly differently.

  • fastText language models support English, Swedish, and Finnish input. They are compatible with the text embedding blocks that use the same language model.
    The vocabulary size of these language models is 50000.

  • The BERT uncased language model supports English, and is compatible with the BERT English uncased snippet.
    The vocabulary size of this language models is 30522.

Select the language model that matches the pre-trained blocks you intend to use in your model.
If you want to train your entire model from scratch, select a preset with the same language as your text so that tokens exist in the preset’s vocabulary.

Encoding with fastText language models

Tokenization uses the tokenizer from Europarl tools, which mostly splits around words.

These steps are performed during encoding:

  1. Split the text at every whitespace (space, tabulation, new line, etc.), at punctuation signs, and at symbols.

    ABBA symbols
  2. Regroup tokens based on a list of exceptions (abbreviations, numbers, etc.), which are language dependent.

    ABBA abbreviation
  3. Enforce the Sequence length parameter.
    Padding tokens (PAD) are added if the sequence is too short. Tokens are removed starting from the end if the sequence is too long.

    ABBA padding
  4. Replace unknown tokens with Out Of Vocabulary tokens (OOV).

    ABBA OOV
  5. Convert tokens into numerical values according to the particular language model.

Dictionary lookup for fastText

There are two particularities that we use when encoding text with a fastText language model:

  • Whereas fastText provides a way to generate embeddings for unknown tokens (based on the n-grams of a word), we provide only a simple lookup of known tokens. Unknown tokens are thus converted into Out Of Vocabulary (OOV) tokens.

  • Whereas fastText provides about 2 million known tokens, we only support the 50000 most common tokens.

Encoding with the BERT uncased language model

The BERT network was pre-trained using tokens created by WordPiece, which may split words in smaller parts.

These steps are performed during encoding:

  1. Split the text at every whitespace (space, tabulation, new line, etc.), at punctuation signs, and at symbols.

    ABBA symbols
  2. Accents are removed and the text is lower-cased.

    ABBA normalization
  3. Replace unknown tokens with Out Of Vocabulary tokens (OOV).
    If a token is unknown, it may be split if that creates a match. This tends to spell out unknown tokens more than using OOV tokens.

    ABBA OOV BERT
  4. Two special tokens, CLS and SEP, are added around the sequence. These tokens are required by the BERT Encoder block.

    ABBA CLSSEP
  5. Enforce the Sequence length parameter. Because the two special tokens are always present, a value of 6 means that only 4 text tokens are kept.
    Padding tokens (PAD) are added if the sequence is too short. Tokens are removed starting from the end if the sequence is too long.

    ABBA cutting BERT
  6. Convert tokens into numerical values defined by the English BERT uncased language model.

Text settings

Sequence length: The total number of tokens in the final sequence.
Shorter values may cause the end of your text to be cut. Choose a length that matches your longest text to avoid cuting out data and to avoid unnecessary calculations.

Encoding with the BERT uncased language model adds 2 special tokens to your text, so use a minimum length of 3 in order to encode at least 1 token from your text.
The maximum sequence length accepted by a BERT Encoder block is 512.

Language model: The particular way to encode your text.
It determines how the text is tokenized, which tokens are known, and which value is assigned to them.