Feature encoding

Encoding Datatype Applicable normalization

Numeric

Integer
Float
NumPy array

Standardization
Min-max normalization
None

Image

Image

Standardization
Min-max normalization
None

Categorical

String
Integer

-

Standardization

Standardization converts a set of raw input data to have a zero mean and unit standard deviation. Values above the feature’s mean value will get positive scores, and those below the mean will get a negative score.

When you perform a standardization, you assume that the input data are normally distributed. After the normalization, the data will be Gaussian with mean zero and standard deviation one. It is possible to have values that are very far away from zero (e.g., -5, -10, 20), but if the distribution is unit Gaussian, those values will have very small probabilities.

Standard score is also called z-score, see Wikipedia.

Why use standardization

You standardize a dataset to make it easier and faster to train a model. Standardization normalization usually helps with getting the input data into a value range that works well with the default activation functions, weight initializations, and other Platform parameters.

Standardization puts the input values on a more equal footing so that there is less risk of one input feature drowning out the others.

Formula for standardization

The standard score of a raw input data value x is calculated:

Standardization / Z-Score Normalize
  • μ is the mean of the values of the feature in question.

  • σ is the standard deviation of the values of the feature in question.

Min-max normalization

Min-max normalization transforms input data to lie in a range between 0 and 1. After normalization, the lowest value of the input feature will be converted to 0, and the highest will get the value 1.

Why use min-max normalization

You normalize a dataset to make it easier and faster to train a model.

Min-max normalization puts values for different input features on a more equal footing. This will in many cases decrease the training time.

Min-max scaling helps with getting the input data into a value range that works well with the default activation functions, weight initializations, and other platform parameters.

If you have large outliers, you should be careful about applying min-max normalization, as this could put the non-outlier values in an overly narrow range.

Formula for min-max normalization

The normalized score of a raw input data value x is calculated:

Min-max normalization

where:
min(x) is the lowest value for input feature x.
max(x) is the highest input value for input feature x.

None

Selecting None means that your data will not be modified in any way.

Why use No preprocessing

Use None when you have numeric data that is on an appropriate scale.

If you have categorical or text data, you cannot use None. You will always have to preprocess categorical or text data as numeric data. If you have numeric data, you will often want to transform them anyway, for example with standardization, to facilitate model training.

Example: You are trying to forecast the price of a stock and the input data features consist of daily relative changes in the stock’s price. Then you can train a model with no preprocessed data.

Categorical

Categorical is used when you don’t want to impose a specific ordering on your data.

Categorical can be used both on input and target features.

How does categorical encoding work

Categorical is the same thing as one-hot encoding, it takes the categorical features in a dataset and converts them into new features. These features are binary vectors where only one entry is 1 while the rest is 0 (hence one-hot).

Categorical encoding

Why use categorical encoding

In deep learning, you need to convert your data to a numeric format to be able to train your models. Categorical encoding is used when you don’t want to impose a specific ordering on the categorical data.

Example: You have a dataset with five categories of clothes, "T-shirt", "Trouser", "Bag, "Hat", and "Ankle boot". If you select categorical this dataset you will not impose a specific ordering on the categories. If you code them as integers 1 to 5, you will treat "Ankle boot (5)" more similar to "Hat (4)" than "T-shirt (1)".

The drawback of categorical encoding is that it can generate a very large number of new features for input features that have a large number of possible values (have many unique values.) In these cases, it may be better to use an embedding layer to decrease the number of dimensions.

Try the platform