Categorical crossentropy

Categorical crossentropy is a loss function that is used in multi-class classification tasks. These are tasks where an example can only belong to one out of many possible categories, and the model must decide which one.
Formally, it is designed to quantify the difference between two probability distributions.

Binary crossentropy setup

Categorical crossentropy math

The categorical crossentropy loss function calculates the loss of an example by computing the following sum:

\[\mathrm{Loss} = -\sum_{i=1}^{\mathrm{output \atop size}} y_i \cdot \mathrm{log}\; {\hat{y}}_i\]

where \(\hat{y}_i\) is the \(i\)-th scalar value in the model output, \(y_i\) is the corresponding target value, and output size is the number of scalar values in the model output.

This loss is a very good measure of how distinguishable two discrete probability distributions are from each other. In this context, \(y_i\) is the probability that event \(i\) occurs and the sum of all \(y_i\) is 1, meaning that exactly one event may occur.
The minus sign ensures that the loss gets smaller when the distributions get closer to each other.

How to use categorical crossentropy​​

The categorical crossentropy is well suited to classification tasks, since one example can be considered to belong to a specific category with probability 1, and to other categories with probability 0.

​​Example:​ The MNIST number recognition tutorial, where you have images of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.
The model uses the categorical crossentropy to learn to give a high probability to the correct digit and a low probability to the other digits.

Activation functions

Softmax is the only activation function recommended to use with the categorical crossentropy loss function.

Strictly speaking, the output of the model only needs to be positive so that the logarithm of every output value \(\hat{y}_i\) exists. However, the main appeal of this loss function is for comparing two probability distributions. The softmax activation rescales the model output so that it has the right properties.

Target feature

Use a single Categorical feature as target.
This will automatically create a one-hot vector from all the categories identified in the dataset. Each one-hot vector can be thought of as a probability distribution, which is why by learning to predict it, the model will output a probability that an example belongs to any of the categories.

One hot encoding of categorical features
Figure 1. Categorical features are one-hot encoded under the hood. This makes them directly appropriate to use with the categorical crossentropy loss function

Alternatively, you could use a Numeric feature using a Numpy array to specify any probability distribution.
This can be useful if you want your model to predict an arbitrary probability distribution, or if you want to implement label smoothing.

Read more