Binary crossentropy

Binary crossentropy is a loss function that is used in binary classification tasks. These are tasks that answer a question with only two choices (yes or no, A or B, 0 or 1, left or right). Several independent such questions can be answered at the same time, as in multi-label classification or in binary image segmentation.
Formally, this loss is equal to the average of the categorical crossentropy loss on many two-category tasks.

Binary crossentropy setup

Binary crossentropy math

The binary crossentropy loss function calculates the loss of an example by computing the following average:

\[\mathrm{Loss} = - \frac{1}{\mathrm{output \atop size}} \sum_{i=1}^{\mathrm{output \atop size}} y_i \cdot \mathrm{log}\; {\hat{y}}_i + (1-y_i) \cdot \mathrm{log}\; (1-{\hat{y}}_i)\]

where \(\hat{y}_i\) is the \(i\)-th scalar value in the model output, \(y_i\) is the corresponding target value, and output size is the number of scalar values in the model output.

This is equivalent to the average result of the categorical crossentropy loss function applied to many independent classification problems, each problem having only two possible classes with target probabilities \(y_i\) and \((1-y_i)\).

How to use binary crossentropy

The binary crossentropy is very convenient to train a model to solve many classification problems at the same time, if each classification can be reduced to a binary choice (i.e. yes or no, A or B, 0 or 1).

Example: The build your own music critic tutorial contains music data and 46 labels like Happy, Hopeful, Laid back, Relaxing etc.
The model uses the binary crossentropy to learn to tag songs with every applicable label.

Activation functions

Sigmoid is the only activation function compatible with the binary crossentropy loss function. You must use it on the last block before the target block.

The binary crossentropy needs to compute the logarithms of \(\hat{y}_i\) and \((1-\hat{y}_i)\), which only exist if \(\hat{y}_i\) is between 0 and 1. The sigmoid activation function is the only one to guarantee that independent outputs lie within this range.

Target feature

For the Feature of the target block, use a feature set grouping all the Numeric features that you want your model to predict simultaneously.

Alternatively, you can use a single Numeric feature using a Numpy array to specify the value of every label without having to create a feature set.

Feature set for multi-label classification
Figure 1. Group several numeric features with a feature set, to be used as the model’s target feature

Even though each feature is generally given the value 0 or 1 to mark whether it applies to an example, remember that they are used as probabilities so any value between 0 and 1 is allowed.
For instance, the exact probability for Schrödinger’s cat to have the feature "Alive?" is 0.5!

Rather watch?

In this video Calle explains how to use binary crossentropy.

Binary crossentropy explained

Read more

Was this page helpful?
Yes No