### Test it on the Peltarion Platform

A platform to build and deploy deep learning projects.

Even if you’re not an AI superstar.

# Classification loss metrics

## Loss curve

The loss curve is shown just after a model has started training and has completed at least one epoch.

The curve marked Training shows the loss calculated on the training data for each epoch.

The curve marked Validation shows the loss calculated on the validation data for each epoch.

Exactly what the loss curve means depends on which of the loss functions you selected in the Modeling view.

### Suggestions on how to improve

You want to have as low loss scores as possible and you want the training error to be slightly lower than test error

This table gives you a hint on what you can do if the model under- or overfits.

Underfitting Just right Overfitting

Symptoms

• High train error

• Training error close to test error

• High bias

Training error slightly lower than test error

• Very low training error

• Training error much lower than test error

• High variance

Loss curve

Possible remedies

• Create a more complex model

• Train longer

• Perform regularization

• Get more data

Example: Your model have a large discrepancy between training and validation losses, then you can try to introduce Dropout and/or Batch normalization blocks to improve generalization, where generalization is the the ability to correctly predict new, previously unseen data points.
For image classification, image augmentation may help in some cases.

The blog post Bias-Variance Tradeoff in Machine Learning is a nice summary of different things to try depending on your problem.

## Accuracy

Accuracy is the ratio of correctly classified examples and the total number of classified examples. Higher accuracy is better.

$\begin{array}{rcl} \text{Accuracy} & = & \dfrac{\text{Number of correct predictions}}{\text{Total number of examples}} \\ &&\\ & = & \dfrac{\text{True positive + True negative}}{\text{True positive + True negative + False positive + False negative}} \end{array}$

The curve marked Training shows the accuracy calculated on the training data set.

The curve marked Validation shows the accuracy calculated on the validation data set.

A large discrepancy between training and validation accuracy (where the training accuracy is much higher than the validation accuracy) can indicate overfitting, or that the validation data are too different from the training data.

### Suggestions on how to improve

If there is a large discrepancy between training and validation accuracy, try to introduce dropout and/or batch normalization blocks to improve generalization.

If the training accuracy is low, the model is not learning well enough. Try to build a new model or collect more training data.

## AUC — Area under curve

The area under curve (AUC) metric computes the area under a discretized curve of true positive versus false positive rates, Receiver Operating Characteristic curve.

The AUC is used to see how well your classifier can separate positive and negative examples.

AUC around 0.5 is the same thing as a random guess. The further away the AUC is from 0.5, the better. If AUC is below 0.5, then you may need to invert the decision your model is making.
AUC = 1 means the prediction model is perfect.
AUC = 0 is good news because you just need to invert your model’s output to obtain a perfect model.

This StackExchange post contains a nice explanation of the AUC.

Gradient norm indicates how much your model’s weights are adjusted during training. If gradient norm is high, it means that the weights are being adjusted a lot. If it’s low, it indicates that the model might have reached a local optimum.

## Precision

What proportion of positive predictions was actually correct?

The precision of the model ranges from 0 to 100%, where 100% means that all the samples that the model classifies as the positive class are truly positive.

Example: If a medical test that has high Precision shows that a patient has a disease, there is a high likelihood that the patient does, in fact, have the disease. However, the test could still fail to identify the presence of the disease in many patients! This is because the Precision is only concerned with positive predictions.

Precision is defined as:

$\text{Precision} = \frac{\text{True positive}}{\text{True positive} + \text{False positive}}$

Where
True positive is when actual positive is predicted positive, and
False positive is when actual negative is predicted positive.

## Recall

What proportion of actual positives was predicted correctly?

The recall of the model ranges from 0 to 100%, where 100% means that the model correctly classifies all truly positive samples as the positive class.

Example: A medical test with high Recall will identify a large proportion of the true disease cases. However, the same test might be over-predicting the positive class and give many false positive predictions!

Recall is defined as:

$\text{Recall} = \frac{\text{True positive}}{\text{True positive} + \text{False negative}}$

Where
True positive is when actual positive is predicted positive, and
False negative is when actual positive is predicted negative.

## Averaging of metrics for multi-class classification models

For models with more than 2 target classes the true positives, false positives, and false negatives are computed for each class independently. This is done by considering the class in question as the positive class and all the other classes as the negative class.

In order to compute an overall precision and recall metric over all the classes, we need to average the per-class values. This can be done either by micro-averaging or macro-averaging.

Micro-averaging

Micro-averaging is performed by first calculating the sum of all true positives, false positives, and false negatives, over all the classes. Then we compute the ratio for precision and recall for the sums.

Micro-averaged precision and recall values can be high even if the model is performing very poorly on a rare class since it gives more weight to the common classes.

Macro-averaging

Macro-averaging is performed by first computing precision and recall independently for each class, based on the true positives, false positives, and false negatives per class. Then the overall precision and recall are calculated by averaging the per class precision and recall values. Macro-averaged precision and recall values will be low if the model is performing poorly on some of the classes (usually the rare classes), since it weighs each class equally regardless of how common the class is.

### Single-label multi-class problems

Macro-averaging is the default aggregation method for precision and recall for single-label multi-class problems.

For single-label multi-class problems, micro-averaging would result in both precision and recall being exactly the same as accuracy. That does not provide any additional information about the model’s performance. If the target classes are imbalanced the accuracy is not a very useful metric, since it can be high for models that only focus on predicting correctly the common classes, while performing very poorly on the rare classes.

Macro-averaged precision and recall will be low for models that only perform well on the common classes while performing poorly on the rare classes, and therefore a complementary metric to the overall accuracy. Note that you can always check the precision and recall for each individual class in the Confusion matrix on the Evaluation view.

### Multi-label multi-class problems

Micro-averaging is the default aggregation method for precision and recall for multi-label multi-class problems

For multi-label multi-class models, micro-averaging of precision and recall already provides additional information compared to the overall accuracy. Micro-averaging will put more emphasis on the common classes in the data set, and for multi-label classification, this is usually the preferred behavior. Labels that are very rare in the dataset, e.g., a genre that only represents 0.01% of the data examples shouldn’t influence heavily the overall precision and recall metric if the model is performing well on the other more common genres.