📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

A complete guide
to classification metrics
in machine learning

For data scientists, ML engineers, product managers, and all practitioners alike.

How to evaluate the quality of a classification model? In this guide, we break down different machine learning metrics for binary and multi-class problems.

What you will learn in this guide:

How to calculate the key classification metrics, including accuracy, precision, recall, F1 score, and ROC AUC.
The pros and cons of each metric, how they behave in corner cases, and when some metrics are more suitable. ‍
Practical tips for using classification metrics in production settings and ML monitoring.

Here is what makes this guide different:

Explaining the intuition behind the metrics. We link to the formulas when needed but focus on simple explanations anyone can understand.
Illustrated guide. We added a lot of images, making it easy to follow along and visualize how each metric works. ‍
Real-world examples. Rather than abstract scenarios, we use relatable business cases that you might encounter in your work.

There is no need to read the guide cover-to-cover: each article is self-contained, and you can read it individually.

Get started with AI observability

Try open source

Explore topics

How to explain the ROC AUC score and ROC curve?

ROC AUC score

The ROC curve shows the performance of a binary classifier with varying decision thresholds. It plots the True Positive rate against the False Positive rate. The resulting area under the curve (ROC AUC score) is a common metric to evaluate the classifier’s quality. This chapter explains how to compute and interpret ROC AUC.

Accuracy, precision, and recall in multi-class classification

Multi-class Precision and Recall

There are different ways to calculate accuracy, precision, and recall for multi-class classification. You can calculate metrics by each class or use macro- or micro-averaging. This chapter explains the difference between the options and how they behave in important corner cases.

Accuracy vs. precision vs. recall in machine learning: what's the difference?

Accuracy, Precision, Recall

Accuracy reflects the overall correctness of the model. Precision shows how well the model detects the positive class. Recall shows the share of positive class detected by the model. This chapter explains how to choose an appropriate metric considering the use case and the costs of errors.

How to interpret a confusion matrix for a machine learning model

Confusion Matrix

A confusion matrix is a table that summarizes a classification model’s performance. It shows the number of correct predictions (true positives and negatives) and model errors (false positives and negatives). This chapter explains how to create a confusion matrix for binary and multi-class models and which metrics you can derive from it.

How to use classification threshold to balance precision and recall

Classification Threshold

In probabilistic machine learning problems, the model output is not a label but a score. You must then set a decision threshold to assign a specific label to a prediction. This chapter explains how to choose an optimal classification threshold to balance precision and recall.

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.