Measure performance when working with imbalanced data

When working with imbalanced data:

  • Evaluate on:

    • Macro-precision

    • Macro-recall

    • Macro-F1.

  • Evaluate with the Confusion matrix.

  • Do not use loss or accuracy to compare experiments.

Evaluate on macro-precision, -recall, and -F1

The Evaluation view shows in several ways how the training of the model has progressed. When working with imbalanced datasets you should look at the macro scores for precision, recall and F1-score.

What’s a good score? It depends.

So, what is a good score? Well, off course it depends. Ideally, what one should do is associate a cost to each type of mistake and optimize for that. If a false positive costs you 5 SEK and a false negative costs you 500 SEK, you’ll want a model with very few false negatives.

Note that this is different from false negative % - since it is imbalanced classes, an unbiased model will produce a lot more false positives than false negatives, because there are a lot more real negatives than real positives.

Macro vs. micro

In this context, it is useful to distinguish between micro and macro averaging. These are very useful concepts for measuring performance right.

  • A micro-average of some measure will aggregate the contributions of all classes to compute the average measure.

  • A macro-average of some measure will instead compute the measure independently for each class and then take the average.

Hence the macro-average gives every class the same importance, and therefore better reflects how well the model performs - considering that you aim at having a model that performs well on ALL classes, including the minority classes.

The micro-average, instead, gives the same importance to each sample. This means that, the more the number of samples, the more impact the corresponding class has on the final score, thus favoring majority classes.

Evaluate with the confusion matrix

The goal must be to catch all frauds and not let anything slip through, i.e., you want as many True positives and as few False negatives as possible.

You can see this in the confusion matrix (located in the Predictions inspection tab). You want a high number in the bottom right corner (true positives) and a low one in the bottom left corner (false negatives). It is also bad if the number in the top right corner (false positives) is big.

Confusion matrix for the fraud detection problem. You want a high number in the bottom right corner and a low in the bottom left corner.
Figure 1. Confusion matrix for the fraud detection problem. You want a high number in the bottom right corner and a low in the bottom left corner.

Don’t evaluate on loss or accuracy

Don’t use loss as a function to compare experiments. It is not unusual to observe a high evaluation accuracy when testing a classification model trained on very imbalanced data. In such cases, the accuracy is only reflecting the underlying class distribution. You want to avoid that!

For example, if only 5% of all houses are affected by water damage, we can construct a model that guesses that no house ever gets water damage and still obtain an accuracy of 95%. While 95% is a pleasantly high proportion, this model will not do what it is intended to, i.e. distinguish well between houses that get water damage from those that don’t.

Test it on the platform

In the Use AI to detect fraud tutorial, we’ll show you how to use class weights.

Was this page helpful?