Categorical crossentropy is a loss function that is used for single label categorization. This is when only one category is applicable for each data point. In other words, an example can belong to one class only.
where ŷ is the predicted value.
Categorical crossentropy will compare the distribution of the predictions (the activations in the output layer, one for each class) with the true distribution, where the probability of the true class is set to 1 and 0 for the other classes. To put it in a different way, the true class is represented as a one-hot encoded vector, and the closer the model’s outputs are to that vector, the lower the loss.
Use categorical crossentropy together with the activation function Softmax.
Use categorical crossentropy in classification problems where only one result can be correct.
Example: In the MNIST problem where you have images of the numbers 0,1, 2, 3, 4, 5, 6, 7, 8 and 9. Categorical crossentropy gives the probability that an image of a number is, for example, a 4 or a 9.