Feature encoding

Encoding Datatype Applicable normalization

Numeric

Integer
Float
NumPy array

Standardization
Min-max normalization
None

Image

Image

Standardization
Min-max normalization
None

Categorical

String
Integer

-

Text (beta)

String

-

Standardization

Standardization converts a set of raw input data to have a zero mean and unit standard deviation. Values above the feature’s mean value will get positive scores, and those below the mean will get a negative score.

When you perform a standardization, you assume that the input data are normally distributed. After the normalization, the data will be Gaussian with mean zero and standard deviation one. It is possible to have values that are very far away from zero (e.g., -5, -10, 20), but if the distribution is unit Gaussian, those values will have very small probabilities.

Standard score is also called z-score, see Wikipedia.

Why use standardization

You standardize a dataset to make it easier and faster to train a model. Standardization normalization usually helps with getting the input data into a value range that works well with the default activation functions, weight initializations, and other Platform parameters.

Standardization puts the input values on a more equal footing so that there is less risk of one input feature drowning out the others.

Formula for standardization

The standard score of a raw input data value x is calculated:

Standardization / Z-Score Normalize
  • μ is the mean of the values of the feature in question.

  • σ is the standard deviation of the values of the feature in question.

Min-max normalization

Min-max normalization transforms input data to lie in a range between 0 and 1. After normalization, the lowest value of the input feature will be converted to 0, and the highest will get the value 1.

Why use min-max normalization

You normalize a dataset to make it easier and faster to train a model.

Min-max normalization puts values for different input features on a more equal footing. This will in many cases decrease the training time.

Min-max scaling helps with getting the input data into a value range that works well with the default activation functions, weight initializations, and other platform parameters.

If you have large outliers, you should be careful about applying min-max normalization, as this could put the non-outlier values in an overly narrow range.

Formula for min-max normalization

The normalized score of a raw input data value x is calculated:

Min-max normalization

where:
min(x) is the lowest value for input feature x.
max(x) is the highest input value for input feature x.

None

Selecting None means that your data will not be modified in any way.

Why use No preprocessing

Use None when you have numeric data that is on an appropriate scale.

If you have categorical or text data, you cannot use None. You will always have to preprocess categorical or text data as numeric data. If you have numeric data, you will often want to transform them anyway, for example with standardization, to facilitate model training.

Example: You are trying to forecast the price of a stock and the input data features consist of daily relative changes in the stock’s price. Then you can train a model with no preprocessed data.

Categorical

Categorical is used when you don’t want to impose a specific ordering on your data.

Categorical can be used both on input and target features.

How does categorical encoding work

Categorical is the same thing as one-hot encoding, it takes the categorical features in a dataset and converts them into new features. These features are binary vectors where only one entry is 1 while the rest is 0 (hence one-hot).

Categorical encoding

Why use categorical encoding

In deep learning, you need to convert your data to a numeric format to be able to train your models. Categorical encoding is used when you don’t want to impose a specific ordering on the categorical data.

Example: You have a dataset with five categories of clothes, "T-shirt", "Trouser", "Bag, "Hat", and "Ankle boot". If you select categorical this dataset you will not impose a specific ordering on the categories. If you code them as integers 1 to 5, you will treat "Ankle boot (5)" more similar to "Hat (4)" than "T-shirt (1)".

The drawback of categorical encoding is that it can generate a very large number of new features for input features that have a large number of possible values (have many unique values.) In these cases, it may be better to use an embedding layer to decrease the number of dimensions.

Text (beta)

In a nutshell text encoding turns text into numbers. This transformation is necessary because the deep learning models you build on the Peltarion Platform require that the input is vectors of continuous values; they just won’t work on strings of plain text.

Why use text encoding

The text encoding can be used to do sentiment analysis. We define sentiment analysis as the same thing as text classification.

Example: Find out if a tweet is positive or negative or predict what rating a review renders.

How does text encoding work

Text encoding transform text into tokens. This tokenization is the process of breaking a stream of text up into words using the selected Language model.

Example: The sentence "I like jazz" is transformed to the tokens "I", "like", and "jazz".

When you select that a feature should be text encoded, this feature will be tokenized based on the selected Language model. This means that the language model translates the raw text string to a sequence of tokens that represents that specific word.

The Sequence length decides how many words will be tokenized. If the incoming text is too long then it will be cut to the set length and if it’s too short it will be padded.

Text tokenization

The tokenizer used by the Peltarion Platform simulates the europarl tokenizer. This means that:

  • Punctuation is tokenized. E.g. ‘.’ and ‘;’ are individual tokens that have their own embedding vectors.

  • Text is treated as case sensitive. E.g. The and the is not the same token.

  • When the sample text is less words than set sequence length, the tokenization vector is padded with index 0. Padding is done at the end of the token vector.

  • Out-of-vocabulary words (OOV) gets the index 1.

  • Numbers and special characters (like €, & etc) are also tokenized.

Example:

Tokenization of a text string
Figure 1. Each word gets a token. Note that ‘The’ and ‘the’ have different tokens and that ‘från’ doesn’t exist in English dictionary and therefore get token 1. There is padding in the end with 0’s since sequence length is set to 25.

Word embedding

The Encoding type Text in the Datasets view is closely connected with the Text embedding block. They are part of the same workflow when you want to do sentimental analysis.

So, in the Text embedding block, the tokens are transformed to vectors with numerical representation of the word´s semantic meaning. That is, the vector inherits the meaning of the word. These vectors will preserve word similarities, so words that regularly occur nearby in text will also be in close proximity in vector space. If two words or documents have a similar word embeddings, they are semantically similar.

Example: king and queen are royal words. This means that their vectors will be similar.

Word embedding PA1
Figure 2. The vectors for King shows that it means a royal male. The vector show that the word Boy means a young male that probably isn’t royal.

Also, in a language one topic can be expressed in many ways. These synonyms will have similar vectors.

Example: boat and ship will have similar vectors.

Parameters

Sequence length: The number of words that you want to tokenize. A longer sequence means extra computational complexity, hence longer run-times, but also allow more subtle representations and potentially better models. We only support a fixed sequence length. Select a sequence length that matches the average length of your samples.

Language model: Available languages. The pretrained word vectors trained on Common Crawl and Wikipedia using fastText.
Select a Language model that matches your input data language.

Get started for free