Feature encoding

Encoding Datatype Applicable normalization

Numeric

Integer
Float
NumPy array

Standardization
Min-max normalization
None

Image

Image

Standardization
Min-max normalization
None

Categorical

String
Integer

-

Text (beta)

String

-

Standardization

Standardization converts a set of raw input data to have a zero mean and unit standard deviation. Values above the feature’s mean value will get positive scores, and those below the mean will get a negative score.

When you perform a standardization, you assume that the input data are normally distributed. After the normalization, the data will be Gaussian with mean zero and standard deviation one. It is possible to have values that are very far away from zero (e.g., -5, -10, 20), but if the distribution is unit Gaussian, those values will have very small probabilities.

Standard score is also called z-score, see Wikipedia.

Why use standardization

You standardize a dataset to make it easier and faster to train a model. Standardization normalization usually helps with getting the input data into a value range that works well with the default activation functions, weight initializations, and other Platform parameters.

Standardization puts the input values on a more equal footing so that there is less risk of one input feature drowning out the others.

Formula for standardization

The standard score of a raw input data value x is calculated:

Standardization / Z-Score Normalize
  • μ is the mean of the values of the feature in question.

  • σ is the standard deviation of the values of the feature in question.

Min-max normalization

Min-max normalization transforms input data to lie in a range between 0 and 1. After normalization, the lowest value of the input feature will be converted to 0, and the highest will get the value 1.

Why use min-max normalization

You normalize a dataset to make it easier and faster to train a model.

Min-max normalization puts values for different input features on a more equal footing. This will in many cases decrease the training time.

Min-max scaling helps with getting the input data into a value range that works well with the default activation functions, weight initializations, and other platform parameters.

If you have large outliers, you should be careful about applying min-max normalization, as this could put the non-outlier values in an overly narrow range.

Formula for min-max normalization

The normalized score of a raw input data value x is calculated:

Min-max normalization

where:
min(x) is the lowest value for input feature x.
max(x) is the highest input value for input feature x.

None

Selecting None means that your data will not be modified in any way.

Why use No preprocessing

Use None when you have numeric data that is on an appropriate scale.

If you have categorical or text data, you cannot use None. You will always have to preprocess categorical or text data as numeric data. If you have numeric data, you will often want to transform them anyway, for example with standardization, to facilitate model training.

Example: You are trying to forecast the price of a stock and the input data features consist of daily relative changes in the stock’s price. Then you can train a model with no preprocessed data.

Categorical

Categorical is used when you don’t want to impose a specific ordering on your data.

Categorical can be used both on input and target features.

How does categorical encoding work

Categorical is the same thing as one-hot encoding, it takes the categorical features in a dataset and converts them into new features. These features are binary vectors where only one entry is 1 while the rest is 0 (hence one-hot).

Categorical encoding

Why use categorical encoding

In deep learning, you need to convert your data to a numeric format to be able to train your models. Categorical encoding is used when you don’t want to impose a specific ordering on the categorical data.

Example: You have a dataset with five categories of clothes, "T-shirt", "Trouser", "Bag, "Hat", and "Ankle boot". If you select categorical this dataset you will not impose a specific ordering on the categories. If you code them as integers 1 to 5, you will treat "Ankle boot (5)" more similar to "Hat (4)" than "T-shirt (1)".

The drawback of categorical encoding is that it can generate a very large number of new features for input features that have a large number of possible values (have many unique values.) In these cases, it may be better to use an embedding layer to decrease the number of dimensions.

Text (beta)

Text encoding is the conversion of plain text into a sequence of numerical values, which AI models love to handle. There are a few different ways to do that, but we have methods to handle your text if it’s written in English, Swedish, or Finnish.

How does text encoding work?

  • The first step is to break the whole text string into smaller pieces, called tokens.

  • Every token then gets a number, and the sequence is ready to be processed!

Example:

Encoding of a text string
Figure 1. The text input is tokenized into pieces that are assigned a number.

Language models

There is no standard way of tokenizing text or assigning numbers to tokens. But we have various language models for encoding text in a compatible way with various pre-trained blocks.

There are two types of language models, fastText and BERT, which will transform your text slightly differently.

  • fastText language models support English, Swedish, and Finnish input. They are compatible with the text embedding blocks that use the same language model.
    The vocabulary size of these language models is 50000.

  • The BERT uncased language model supports English, and is compatible with the BERT English uncased snippet.
    The vocabulary size of this language models is 30522.

Select the language model that matches the pre-trained blocks you intend to use in your model.
If you want to train your entire model from scratch, select a preset with the same language as your text so that tokens exist in the preset’s vocabulary.

Encoding with fastText language models

Tokenization uses the tokenizer from Europarl tools, which mostly splits around words.

These steps are performed during encoding:

  1. Split the text at every whitespace (space, tabulation, new line, etc.), at punctuation signs, and at symbols.

    ABBA symbols
  2. Regroup tokens based on a list of exceptions (abbreviations, numbers, etc.), which are language dependent.

    ABBA abbreviation
  3. Enforce the Sequence length parameter.
    Padding tokens (PAD) are added if the sequence is too short. Tokens are removed starting from the end if the sequence is too long.

    ABBA padding
  4. Replace unknown tokens with Out Of Vocabulary tokens (OOV).

    ABBA OOV
  5. Convert tokens into numerical values according to the particular language model.

Dictionary lookup for fastText

There are two particularities that we use when encoding text with a fastText language model:

  • Whereas fastText provides a way to generate embeddings for unknown tokens (based on the n-grams of a word), we provide only a simple lookup of known tokens. Unknown tokens are thus converted into Out Of Vocabulary (OOV) tokens.

  • Whereas fastText provides about 2 million known tokens, we only support the 50000 most common tokens.

Encoding with the BERT uncased language model

The BERT network was pre-trained using tokens created by WordPiece, which may split words in smaller parts.

These steps are performed during encoding:

  1. Split the text at every whitespace (space, tabulation, new line, etc.), at punctuation signs, and at symbols.

    ABBA symbols
  2. Accents are removed and the text is lower-cased.

    ABBA normalization
  3. Replace unknown tokens with Out Of Vocabulary tokens (OOV).
    If a token is unknown, it may be split if that creates a match. This tends to spell out unknown tokens more than using OOV tokens.

    ABBA OOV BERT
  4. Two special tokens, CLS and SEP, are added around the sequence. These tokens are required by the BERT Encoder block.

    ABBA CLSSEP
  5. Enforce the Sequence length parameter. Because the two special tokens are always present, a value of 6 means that only 4 text tokens are kept.
    Padding tokens (PAD) are added if the sequence is too short. Tokens are removed starting from the end if the sequence is too long.

    ABBA cutting BERT
  6. Convert tokens into numerical values defined by the English BERT uncased language model.

Parameters

Sequence length: The total number of tokens in the final sequence.
Shorter values may cause the end of your text to be cut. Choose a length that matches your longest text to avoid cuting out data and to avoid unnecessary calculations.

Encoding with the BERT uncased language model adds 2 special tokens to your text, so use a minimum length of 3 in order to encode at least 1 token from your text.
The maximum sequence length accepted by a BERT Encoder block is 512.

Language model: The particular way to encode your text.
It determines how the text is tokenized, which tokens are known, and which value is assigned to them.