Categorical encoding can be used to encode features when:
Only discrete values exist, called classes
The values are not ordered, or the order is not relevant to solve the task
A feature has less than 2000 classes
Example: The Fashion MNIST tutorial has a dataset with five categories of clothes:
If you treat these categories as numbers from
4, you will implicitly impose constraints to your model.
For instance, the model will consider that
Ankle boot (4) must be more similar to
Dress (3) than to
It will also consider that
Trouser (1) must be a larger thing than
Such distinctions are meaningless, but will affect the numerical solutions that the model can provide.
Two types of categorical encoding are available: One-hot and Integer.
You don’t have to interact with the encoded representation yourself, since the Platform uses the actual class names to reference model inputs and outputs. However, the type chosen may affect model training.
One-hot categorical encoding
One-hot categorical encoding turns each possible value of a categorical feature into a vector.
The vector size is the amount of classes, for that feature, in the dataset.
Its components are all
0, except for a single
1 at the component representing a particular class.
When to use one-hot encoding
One-hot encoding is very useful for categorical input.
Since classes are perfectly separated from each other, the model can directly apply different operations to different input examples. It doesn’t need to spend computation just to "recognize" a class based on its value. If an ordering of values turns out to be useful to the model, this ordering can be learned easily from a single dense block.
One-hot encoding is also very common for the target feature in classification problems, since the one-hot vector can be thought of as a probability distribution that can be used with the categorical crossentropy loss function.
If an example can belong to several classes at once, as in multi-label classification problems, or if a category should have a value between
1, the Categorical One-hot encoding is not adapted.
In those cases, use either a single Numeric feature defined by a Numpy array, or a feature set of several Numeric features.
Index categorical encoding
Index categorical encoding turns each possible value of a categorical feature into a single integer value, starting at
0 for the first class encountered in the dataset.
When to use index encoding
Index encoding gives a much more compact representation of classes, which may reduce the model input size considerably.
Although indices are inherently ordered, they can be passed to an embedding block to create a vector of continuous values, whose size you can specify.
An embedding is sometimes more adapted to solve a problem than a one-hot vector. It is commonly used, for instance, in Natural Language Processing (see the movie review sentiment analysis tutorial).