Categorical encoding
Categorical encoding (onehot or index) can be used to encode features when:

Only discrete values exist, called classes

The values are not ordered, or the order is not relevant to solve the task

A feature has less than 2000 classes
Example: The Fashion MNIST tutorial has a dataset with five categories of clothes: Tshirt
, Trouser
, Bag
, Dress
, and Ankle boot
.
If you treat these categories as numbers from 0
to 4
, you will implicitly impose constraints to your model.
For instance, the model will consider that Ankle boot
(4) must be more similar to Dress
(3) than to Tshirt
(0).
It will also consider that Trouser
(1) must be a larger thing than Tshirt
(0).
Such distinctions are meaningless, but will affect the numerical solutions that the model can provide.
Two types of categorical encoding are available: Onehot and Integer.
You don’t have to interact with the encoded representation yourself, since the Platform uses the actual class names to reference model inputs and outputs.
However, the type chosen may affect model training.
Onehot categorical encoding
Onehot categorical encoding turns each possible value of a categorical feature into a vector.
The vector size is the amount of classes, for that feature, in the dataset.
Its components are all 0
, except for a single 1
at the component representing a particular class.
Example:
When to use onehot encoding
Onehot encoding is very useful for categorical input.
Since classes are perfectly separated from each other, the model can directly apply different operations to different input examples.
It doesn’t need to spend computation just to "recognize" a class based on its value.
If an ordering of values turns out to be useful to the model, this ordering can be learned easily from a single dense block.
Onehot encoding is also very common for the target feature in classification problems, since the onehot vector can be thought of as a probability distribution that can be used with the categorical crossentropy loss function.
If an example can belong to several classes at once, as in multilabel classification problems, or if a category should have a value between 0
and 1
, the Categorical Onehot encoding is not adapted.
In those cases, use either a single Numeric feature defined by a Numpy array, or a feature set of several Numeric features.
Index categorical encoding
Index categorical encoding turns each possible value of a categorical feature into a single integer value, starting at 0
for the first class encountered in the dataset.
Example:
When to use index encoding
Index encoding gives a much more compact representation of classes, which may reduce the model input size considerably.
Although indices are inherently ordered, they can be passed to an embedding block to create a vector of continuous values, whose size you can specify.
An embedding is sometimes more adapted to solve a problem than a onehot vector. It is commonly used, for instance, in Natural Language Processing (see the movie review sentiment analysis tutorial).
However, the Categorical encoding treats feature values as single objects and not sequences.