Macro F1-score

Macro F1-score (short for macro-averaged F1 score) is used to assess the quality of problems with multiple binary labels or multiple classes.

If you are looking to select a model based on a balance between precision and recall, don’t miss out on assessing your F1-scores!

Macro F1-score = 1 is the best value, and the worst value is 0.

All classes treated equally
Macro F1-score will give the same importance to each label/class. It will be low for models that only perform well on the common classes while performing poorly on the rare classes.


The Macro F1-score is defined as the mean of class-wise/label-wise F1-scores:

\[\begin{array}{rcl} \text{Macro F1-score} & = & \frac{1}{N} \sum_{i=0}^{N} {\text{F1-score}_i} \\ \end{array}\] where i is the class/label index and N the number of classes/labels.


Macro-averaging is used for models with multiple classes/labels, for example, in our tutorial Self sorting wardrobe. When macro-averaging, all classes contribute equally regardless of how often they appear in the dataset.

Macro F1-averaging is performed by first computing the F1-score per class/label and then averaging them.

Macro-averaging is the default aggregation method for F1-score for single-label multi-class problems.

Was this page helpful?