Focal loss

Use the focal loss function in single-label classification tasks as an alternative to the more commonly used categorical crossentropy.

You can say that the focal loss is an extension to categorical crossentropy with an added weighting factor \((1-\hat{y}_i)^\gamma\).

The weighting factor allows the focal loss function to focus on actual mistakes, as it produces a lower loss value for samples with a probability of ground truth, i.e., correct, class significantly lower than 1 (compared categorical crossentropy).
You can see this in the graph as the blue line (categorical crossentropy) gives a much higher loss value than the other colored lines (focal loss with different \(\gamma\) values) when the probability of ground truth class gets lower.

Focal loss function vs Categorical crossentropy
Figure 1. Illustration from the paper Focal Loss for Dense Object Detection (reference below)

When to use the focal loss

The focal loss is a good alternative to categorical crossentropy for single-label classification tasks. Particularly for problems where:

  • You have an unbalanced dataset

  • The distinction between classes is not clear in the first place.

It is hard to define the genre of a song distinctly. Genre is quite subjective, and some songs might include elements from multiple genres. In this use case, it’s a good idea to use the focal loss function.

Paired activation

Softmax is the only activation recommended to use with the focal loss function. You must use softmax on the last block before the Target block.

Focal loss math

\[\textrm{Loss} = \sum_{i=1}^{\substack{\textrm{output}\\\textrm{size}}} (1-\hat{y}_i)^\gamma \cdot y_i \cdot \log \hat{y}_i\]


  • \((1-\hat{y}_i)^\gamma\) is the weighting factor

  • \(\hat{y}\) is the predicted value.

The focusing parameter \(\gamma\) smoothly adjusts the rate at which easy examples are down-weighted.


Was this page helpful?