Classification loss metrics
Loss curve
The loss curve is shown just after a model has started training and has completed at least one epoch.
The curve marked Training shows the loss calculated on the training data for each epoch.
The curve marked Validation shows the loss calculated on the validation data for each epoch.
Exactly what the loss curve means depends on which of the loss functions you selected in the Modeling view.
Suggestions on how to improve
You want to have as low loss scores as possible and you want the training error to be slightly lower than test error
This table gives you a hint on what you can do if the model under or overfits.
Underfitting  Just right  Overfitting  

Symptoms 

Training error slightly lower than test error 

Loss curve 

Possible remedies 


Example: Your model have a large discrepancy between training and validation losses, then you can try to introduce Dropout and/or Batch normalization blocks to improve generalization, where generalization is the the ability to correctly predict new, previously unseen data points.
For image classification, image augmentation may help in some cases.
The blog post BiasVariance Tradeoff in Machine Learning is a nice summary of different things to try depending on your problem.
Accuracy
Accuracy is the ratio of correctly classified examples and the total number of classified examples. Higher accuracy is better.
The curve marked Training shows the accuracy calculated on the training data set.
The curve marked Validation shows the accuracy calculated on the validation data set.
A large discrepancy between training and validation accuracy (where the training accuracy is much higher than the validation accuracy) can indicate overfitting, or that the validation data are too different from the training data.
Suggestions on how to improve
If there is a large discrepancy between training and validation accuracy, try to introduce dropout and/or batch normalization blocks to improve generalization.
If the training accuracy is low, the model is not learning well enough. Try to build a new model or collect more training data.
AUC — Area under curve
The area under curve (AUC) metric computes the area under a discretized curve of true positive versus false positive rates, Receiver Operating Characteristic curve.
The AUC is used to see how well your classifier can separate positive and negative examples.
AUC around 0.5 is the same thing as a random guess. The further away the AUC is from 0.5, the better. If AUC is below 0.5, then you may need to invert the decision your model is making.
AUC = 1 means the prediction model is perfect.
AUC = 0 is good news because you just need to invert your model’s output to obtain a perfect model.
This StackExchange post contains a nice explanation of the AUC.
Gradient norm
Gradient norm indicates how much your model’s weights are adjusted during training. If gradient norm is high, it means that the weights are being adjusted a lot. If it’s low, it indicates that the model might have reached a local optimum.
Precision
Precision answers the following question:
What proportion of positive predictions was actually correct?
The precision of the model ranges from 0 to 100%, where 100% means that all the samples that the model classifies as the positive class are truly positive.
Example: If a medical test that has high Precision shows that a patient has a disease, there is a high likelihood that the patient does, in fact, have the disease. However, the test could still fail to identify the presence of the disease in many patients! This is because the Precision is only concerned with positive predictions.
Precision is defined as:
Where
True positive is when actual positive is predicted positive, and
False positive is when actual negative is predicted positive.
Read more about this in the Confusion matrix entry in the glossary.
Recall
Recall answers the following question:
What proportion of actual positives was predicted correctly?
The recall of the model ranges from 0 to 100%, where 100% means that the model correctly classifies all truly positive samples as the positive class.
Example: A medical test with high Recall will identify a large proportion of the true disease cases. However, the same test might be overpredicting the positive class and give many false positive predictions!
Recall is defined as:
Where
True positive is when actual positive is predicted positive, and
False negative is when actual positive is predicted negative.
Read more about this in the Confusion matrix entry in the glossary.
Averaging of metrics for multiclass classification models
For models with more than 2 target classes the true positives, false positives, and false negatives are computed for each class independently. This is done by considering the class in question as the positive class and all the other classes as the negative class.
In order to compute an overall precision and recall metric over all the classes, we need to average the perclass values. This can be done either by microaveraging or macroaveraging.
Microaveraging
Microaveraging is performed by first calculating the sum of all true positives, false positives, and false negatives, over all the classes. Then we compute the ratio for precision and recall for the sums.
Microaveraged precision and recall values can be high even if the model is performing very poorly on a rare class since it gives more weight to the common classes.
Macroaveraging
Macroaveraging is performed by first computing precision and recall independently for each class, based on the true positives, false positives, and false negatives per class. Then the overall precision and recall are calculated by averaging the per class precision and recall values. Macroaveraged precision and recall values will be low if the model is performing poorly on some of the classes (usually the rare classes), since it weighs each class equally regardless of how common the class is.
Singlelabel multiclass problems
Macroaveraging is the default aggregation method for precision and recall for singlelabel multiclass problems.
For singlelabel multiclass problems, microaveraging would result in both precision and recall being exactly the same as accuracy. That does not provide any additional information about the model’s performance. If the target classes are imbalanced the accuracy is not a very useful metric, since it can be high for models that only focus on predicting correctly the common classes, while performing very poorly on the rare classes.
Macroaveraged precision and recall will be low for models that only perform well on the common classes while performing poorly on the rare classes, and therefore a complementary metric to the overall accuracy. Note that you can always check the precision and recall for each individual class in the Confusion matrix on the Evaluation view.
Multilabel multiclass problems
Microaveraging is the default aggregation method for precision and recall for multilabel multiclass problems
For multilabel multiclass models, microaveraging of precision and recall already provides additional information compared to the overall accuracy. Microaveraging will put more emphasis on the common classes in the data set, and for multilabel classification, this is usually the preferred behavior. Labels that are very rare in the dataset, e.g., a genre that only represents 0.01% of the data examples shouldn’t influence heavily the overall precision and recall metric if the model is performing well on the other more common genres.