📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Classification metrics guide

How to explain the ROC curve and ROC AUC score?

Last updated:

January 9, 2025

contents‍

The ROC AUC score is a popular metric to evaluate the performance of binary classifiers. To compute it, you must measure the area under the ROC curve, which shows the classifier's performance at varying decision thresholds.

This chapter covers how to plot the ROC curve, compute the ROC AUC and interpret it. We will also showcase it using the open-source Evidently Python library.

TL;DR

The ROC curve shows the performance of a binary classifier with different decision thresholds. It plots the True Positive rate (TPR) against the False Positive rate (FPR).
The ROC AUC score is the area under the ROC curve. It sums up how well a model can produce relative scores to discriminate between positive or negative instances across all classification thresholds.
The ROC AUC score ranges from 0 to 1, where 0.5 indicates random guessing, and 1 indicates perfect performance.

Evidently Classification Performance Report

Start with AI observability

Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

Start free ⟶Or try open source ⟶

Start with AI observability

Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

Start free ⟶Or try open source ⟶

What is a ROC curve?

The ROC curve stands for the Receiver Operating Characteristic curve. It is a graphical representation of the performance of a binary classifier at different classification thresholds.

The curve plots the possible True Positive rates (TPR) against the False Positive rates (FPR).

Here is how the curve can look:

Each point on the curve represents a specific decision threshold with a corresponding True Positive rate and False Positive rate.

What is a ROC AUC score?

ROC AUC stands for Receiver Operating Characteristic Area Under the Curve.

ROC AUC score is a single number that summarizes the classifier's performance across all possible classification thresholds. To get the score, you must measure the area under the ROC curve.

ROC AUC score shows how well the classifier distinguishes positive and negative classes. It can take values from 0 to 1.

A higher ROC AUC indicates better performance. A perfect model would have an AUC of 1, while a random model would have an AUC of 0.5.

To understand the ROC AUC metric, it helps to understand the ROC curve first.

How does the ROC curve work?

Let’s explain it step by step! We will cover:

What TPR and FPR are, and how to calculate them
What a classification threshold is
How to plot a ROC curve

True vs. False Positive rates

The ROC curve plots the True Positive rate (TPR) against the False Positive rate (FPR) at various classification thresholds. You can derive TPR and FPR from a confusion matrix.

A confusion matrix summarizes all correct and false predictions generated for a specific dataset. Here is an example of a matrix generated for a spam prediction use case:

Confusion matrix example — *Source: example matrix from* *Confusion Matrix* *chapter.*

You can calculate the True Positive and False Positive rates directly from the matrix.

True positive rate and false positive rate

TPR (True Positive rate, also known as recall) shows the share of detected true positives. For example, the share of emails correctly labeled as spam out of all spam emails in the dataset.

To compute the TPR, you must divide the number of True Positives by the total number of objects of the target class – both identified (True Positives) and missed (False Negatives).

In the example confusion matrix above, TPR = 600 / ( 600 + 300) = 0.67. The model successfully detected 67% of all spam emails.

FPR (False Positive rate) shows the share of objects falsely assigned a positive class out of all objects of the negative class. For example, the proportion of legitimate emails falsely labeled as spam.

You can calculate the FPR by dividing the number of False Positives by the total number of objects of the negative class in the dataset.

You can think of the FPR as a "false alarm rate."

In our example, FPR = 100 / (100 + 9000) = 0.01. The model falsely flagged 1% of legitimate emails as spam.

To create the ROC curve, you need to plot the FPR values against TPR values at different decision thresholds.

Classification threshold

You might ask, what do "different" TPR and FPR values mean? Did we not just calculate them once and for all?

In fact, we calculated the values for a given confusion matrix at a given decision threshold. But for a probabilistic classification model, these TPR and FPR values are not set in stone.

You can vary the decision threshold that defines how to convert the model predictions into labels. This, in turn, can change the number of errors the model makes.

A probabilistic classification model returns a number from 0 to 1 for each object. For example, for each email, it predicts how likely this email is spam. For a given email, it can be 0.1, 0.55, 0.99, or any other number.

You then have to decide at which probability you convert this prediction to a label. For instance, you can label all emails with a predicted probability of over 0.5 as spam. Or, you can only apply this decision when the score is 0.8 or higher.

This choice is what sets the classification threshold.

To better understand the impact of the decision threshold, explore the Classification Threshold chapter in the guide.

As you change the threshold, you will usually get new combinations of errors of different types (and new confusion matrices)!

Confusion matrices with different classification thresholds

When you set the threshold higher, you make the model "more conservative." It assigns the True label when it is "more confident." But as a consequence, you typically lower recall: you detect fewer examples of the target class overall.

When you set the threshold lower, you make the model "less strict." It assigns the True label more often, even when "less confident." Consequently, you increase recall: you will detect more examples of the target class. However, this may also lead to lower precision, as the model may make more False Positive predictions.

TPR and FPR change in the same direction. The higher the recall (TPR), the higher the rate of false positive errors (FPR). The lower the recall, the fewer false alarms the model gives.

In the example above, the recall (TPR) decreases as we set the different decision higher:

- 0.5 threshold: 800/(800+100)=0.89
- 0.8 threshold: 600/(600+300)=0.67
- 0.95 threshold: 200/(200+700)=0.22

The FPR also goes down:

- 0.5 threshold: 500/(500+8600)=0.06
- 0.8 threshold: 100/(100+9000)=0.01
- 0.95 threshold: 10/(10+9090)=0.001

Plotting the ROC curve

Now, let’s get back to the curve!

The ROC curve illustrates this trade-off between the TPR and FPR we just explored. Unless your model is near-perfect, you have to balance the two. As you try to increase the TPR (i.e., correctly identify more positive cases), the FPR may also increase (i.e., you get more false alarms).

For example, the more spam you want to detect, the more legitimate emails you falsely flag as suspicious.

The ROC curve is a visual representation of this choice. Each point on the curve corresponds to a combination of TPR and FPR values at a specific decision threshold.

To create the curve, you should plot the FPR values as the x-axis and the TPR values as the y-axis.

If we continue with the example above, here is how it can look.

Since our imaginary model does fairly well, most values are "crowded" to the left.

The left side of the curve corresponds to the more "confident" thresholds: a higher threshold leads to lower recall and fewer false positive errors. The extreme point is when both recall and FPR are 0. In this case, there are no correct detections but also no false ones.

The right side of the curve represents the "less strict" scenarios when the threshold is low. Both recall and False Positive rates are higher, ultimately reaching 100%. If you put the threshold at 0, the model will always predict a positive class: both recall, and the FPR will be 1.

When you increase the threshold, you move left on the curve. If you decrease the threshold, you move to the right.

Now, let’s take a look at the perfect scenario.

If our model is correct in all the predictions, all the time, it means that the TPR is always 1.0, and FPR is 0. It finds all the cases and never gives false alarms.

Here is how the ROC curve would look.

Now, let’s look at the worst-case scenario.

Let’s say our model is random. In other words, it cannot distinguish between the two classes, and its predictions are no better than chance.

A genuinely random model will predict the positive and negative classes with equal probability.

The ROC curve, in this case, will look like a diagonal line connecting points (0,0) and (1,1). For a random classifier, the TPR is equal to the FPR because it makes the same number of true and false positive predictions for any threshold value. As the classification threshold changes, the TPR goes up or down in the same proportion as the FPR.

Most real-world models will fall somewhere between the two extremes. The better the model can distinguish between positive and negative classes, the closer the curve is to the top left corner of the graph.

A ROC curve is a two-dimensional reflection of classifier performance across different thresholds. It is convenient to get a single metric to summarize it.

This is what the ROC AUC score does.

How to get the ROC AUC score?

A ROC AUC score is a single metric to summarize the performance of a classifier across different thresholds. To compute the score, you must measure the area under the ROC curve.

There are different methods to calculate the ROC AUC score, but a common one is a trapezoidal rule. This involves approximating the area under the ROC curve by dividing it into trapezoids with vertical lines at the FPR values and horizontal lines at the TPR values. Then, you compute the area by summing the areas of the trapezoids.

You can compute ROC AUC in Python using sklearn.

If we return to our extreme "perfect" and "random" example, computing the ROC AUC score is easy. In the perfect scenario, we measure the square area: ROC AUC is 1. In the random scenario, it is precisely half: ROC AUC is 0.5.

What is a good ROC AUC?

The ROC AUC score can range from 0 to 1. A score of 0.5 indicates random guessing, and a score of 1 indicates perfect performance.

A score slightly above 0.5 shows that a model has at least "some" (albeit small) predictive power. This is generally inadequate for any real applications.

As a rule of thumb, a ROC AUC score above 0.8 is considered good, while a score above 0.9 is considered great.

However, the usefulness of the model depends on the specific problem and use case. There is no standard. You should interpret the ROC AUC score in context, together with other classification quality metrics, such as accuracy, precision, or recall.

How to explain ROC AUC?

The intuition behind ROC AUC is that it measures how well a binary classifier can distinguish or separate between the positive and negative classes.

It reflects the probability that the model will correctly rank a randomly chosen positive instance higher than a random negative one.

For example, this is how the model predictions might look, arranged by the predicted output scores.

ROC AUC reflects the likelihood that a random positive (red) instance will be located to the right of a random negative (gray) instance.

It shows how well a model can produce good relative scores and generally assign higher probabilities to positive instances over negative ones.

In the above picture, the classifier is not perfect but "directionally correct." It ranks most negative instances lower than positive ones.

The ideal situation is to have all positive instances ranked higher than all negative instances, resulting in an AUC of 1.0.

It’s worth noting that even a perfect ROC AUC does not mean the predictions are well-calibrated. A well-calibrated classifier produces predicted probabilities that reflect the actual probabilities of the events. Say, if it predicts that an event has a 70% chance of occurring, it should be correct about 70% of the time. ROC AUC is not a calibration measure.

ROC AUC score, instead, shows how well a model can produce relative scores that help discriminate between positive or negative instances.

ROC AUC pros and cons

Let’s sum up the important properties of the metric.

Here are some advantages of the ROC AUC score.

A single number. ROC AUC reflects the model quality in one number. It is convenient to use a single metric, especially when comparing multiple models.
Does not change with the classification threshold. Unlike precision and recall, ROC AUC stays the same. In fact, it sums up the performance across the different classification thresholds. It is a valuable "overall" quality measure, whereas precision and recall provide a quality "snapshot" at a given decision threshold.
It is a suitable evaluation metric for imbalanced data. ROC AUC measures the model's ability to discriminate between the positive and negative classes, regardless of their relative frequencies in the dataset.
More tolerant to the drift in class balance. The ROC AUC generally remains more stable if the distribution of classes changes. This often happens in production use, for example, when fraud rates vary month-by-month. If they change significantly, the earlier chosen decision threshold might become inadequate. For example, if fraud becomes more prevalent, the recall of the fraud detection model might drop, as this metric uses the absolute number of actual fraud cases in the denominator. However, ROC AUC might remain stable, indicating that the model can still differentiate between the two classes despite the changes in their relative frequencies.
Scale-invariant. ROC AUC measures how well predictions are ranked rather than their absolute values. This way, it helps compare the quality of models that might output "different ranges" of predicted probabilities. It is typically relevant when you experiment with different models during the model training stage.

The metric also has a few downsides. As usual, a lot depends on the context!

ROC AUC is not intuitive. This metric can be hard to explain to business stakeholders and does not have an immediately interpretable meaning.
It does not consider the cost of errors. ROC AUC does not account for different types of errors and their consequences. In many scenarios, false negatives can be more costly than false positives, or vice versa. In this case, working to balance precision and recall and setting the appropriate classification threshold to minimize a certain type of error is often a more suitable approach. ROC AUC is not useful for this type of optimization.
It can be misleading if the class imbalance is severe. When the positive class is very small, ROC AUC can give a false impression of high quality. Imagine that a classifier predicts almost all instances as negative. TPR and FPR will be close to 0 because there are few positive predictions. As a result, the ROC curve will appear to "hug" the top left corner of the plot, giving the impression that the classifier is performing well and definitely better than random. However, though it correctly classifies most of the negative instances, it may miss most of the positives, which is likely more important for the model performance. In this case, it may be more appropriate to look at the precision-recall curve and rely on metrics like precision, recall, or F1-score to evaluate ML model quality.

Want to see an example of using ROC AUC? We prepared a tutorial on the employee churn prediction problem "What is your model hiding?". You will train two classification models with similar ROC AUC and explore how to compare them.

When to use ROC AUC

Considering all the above, ROC AUC is useful, but as usual, not a perfect metric.

During model training, it helps compare multiple ML models against each other.
ROC AUC is particularly useful when the goal is to rank predictions in order of their confidence level rather than produce well-calibrated probability estimates.
Both in training and production evaluation, ROC AUC helps provide a more complete picture of the model performance, giving a single metric that sums up the quality across different thresholds.

However, there are limitations:

ROC AUC is less useful when you care about different costs of error and want to find the optimal threshold to optimize for the cost of a specific error.
It can be misleading when the data is heavily imbalanced (which is coincidentally often the cases where you ultimately care about different costs of errors).

ROC AUC in ML monitoring

You can use ROC AUC during production model monitoring as long as you have the true labels to compute it.

However, a high ROC AUC score does not communicate all relevant aspects of the model quality. The score evaluates the degree of separability and does not consider the asymmetric costs of false positives and negatives. It captures, in one number, the quality of the model across all possible thresholds.

In many real-world scenarios, this overall performance is not relevant: you need to consider the costs of error and define a specific threshold to make automated decisions. Therefore, the ROC AUC score should be used with other metrics, such as precision and recall. You might also want to monitor precision and recall for specific important segments in your data (such as users in specific locations, premium users, etc.) to capture differences in performance.

However, having ROC AUC as an additional metric might still be informative. For example, in cases where the shifting balance of classes might negatively impact recall, tracking ROC AUC might communicate whether the model itself remains reasonable.

ROC curve in Python

To quickly calculate and visualize the ROC curve and ROC AUC score, as well as other metrics and plots to evaluate the quality of a classification model, you can use Evidently, an open-source Python library to evaluate, test and monitor ML models in production.

You will need to prepare your dataset that includes predicted values for each class and true labels and pass it to the tool. You will instantly get an interactive report that includes ROC AUC, accuracy, precision, recall, F1-score metrics as well as other visualizations. You can also integrate these model quality checks into your production pipelines.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶