Categorical encoding

Categorical encoding (one-hot or index) can be used to encode features when:

  • Only a finite number of unique values exist, called classes

  • The values are not ordered, or the order is not relevant to solve the task

  • A feature has up to 2000 classes

Example: The Fashion MNIST tutorial has a dataset with five categories of clothes: T-shirt, Trouser, Bag, Dress, and Ankle boot.
If you treat these categories as numbers from 0 to 4, you will implicitly impose constraints to your model. For instance, the model will consider that Ankle boot (4) must be more similar to Dress (3) than to T-shirt (0). It will also consider that Trouser (1) must be a larger thing than T-shirt (0).
Such distinctions are meaningless, but will affect the numerical solutions that the model can provide.

Categorical can be used for both input and target features.

Two types of categorical encoding are available: One-hot and Integer.
You don’t have to interact with the encoded representation yourself, since the Platform uses the actual class names to reference model inputs and outputs. However, the type chosen may affect model training.

One-hot categorical encoding

One-hot categorical encoding turns each possible value of a categorical feature into a vector. The vector size is the amount of classes, for that feature, in the dataset. Its components are all 0, except for a single 1 at the component representing a particular class.


Categorical encoding
Figure 1. One-hot encoding of a categorical feature. Each possible class is given a vector of 0’s and 1 that the model uses in calculations.

When to use one-hot encoding

One-hot encoding is very useful for categorical input.
Since classes are perfectly separated from each other, the model can directly apply different operations to different input examples. It doesn’t need to spend computation just to "recognize" a class based on its value. If an ordering of values turns out to be useful to the model, this ordering can be learned easily from a single dense block.

One-hot encoding is also very common for the target feature in classification problems, since the one-hot vector can be thought of as a probability distribution that can be used with the categorical crossentropy loss function.

If an example can belong to several classes at once, as in multi-label classification problems, or if a category should have a value between 0 and 1, the Categorical One-hot encoding is not adapted.
In those cases, use either a single Numeric feature defined by a Numpy array, or a feature set of several Numeric features.

Index categorical encoding

Index categorical encoding turns each possible value of a categorical feature into a single integer value, starting at 0 for the first class encountered in the dataset.


Categorical encoding
Figure 2. Index encoding of a categorical feature. Each possible class is given an integer value that the model can use, and that is compatible in particular with the embedding block.

When to use index encoding

Index encoding gives a much more compact representation of classes, which may reduce the model input size considerably.
Although indices are inherently ordered, they can be passed to an embedding block to create a vector of continuous values, whose size you can specify.

An embedding is sometimes more adapted to solve a problem than a one-hot vector. It is commonly used, for instance, in Natural Language Processing (see the movie review sentiment analysis tutorial).

However, the Categorical encoding treats feature values as single objects and not sequences.

Was this page helpful?