This article is a part of the Classification Metrics guide.
In probabilistic machine learning problems, the model output is not a label but a score. You must then set a decision threshold to decide when to assign a specific label to each prediction.
A 0.5 threshold is a frequent choice, but it often makes sense to define a custom threshold instead. This chapter explains the trade-off between precision and recall and how to set an optimal classification threshold to balance them.
We will also introduce how to visualize the classification threshold using the open-source Evidently Python library.
Before diving in, make sure you are familiar with how accuracy, precision, and recall work in machine learning. If you need a refresher, there is a separate chapter in the guide.
Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.
Want to keep tabs on your classification models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.
The classification threshold in machine learning is a boundary or a cut-off point used to assign a specific predicted class for each object.
You need to set this threshold when working with probabilistic machine learning models. These models do not assign the label directly. Instead, they predict the probability of a particular class or outcome. The model output is a score between 0 and 1.
To turn this probability score into a classification decision, you need to set a threshold above which a data point is classified as belonging to a specific category.
Here is an example:
Say you have an email spam classification model. The model output is a probability score for each email, representing its likelihood of being spam. Instead of saying something like "This email is spam," it might say, "The probability of this email being spam is 0.55."
When the model predicts probabilities, you must decide how to turn the probability into a concrete decision. Should a 0.55 chance of spam send the email to a spam folder, or should you only do it when the probability is over 0.90?
This choice is the choice of classification threshold. The threshold is the cut-off point determining whether an instance belongs to a particular class.
Do all classification models return probabilities? Nope. This does depend on the algorithm and choices made by the modeler. Some algorithms (like k-nearest neighbors or SVMs in classic implementation) directly assign a class or label to a new data point based on its proximity to the training examples. Others (like decision trees) predict the probability of a data point belonging to each possible class based on the provided features. Some methods, such as neural networks, can do both, depending on the specific architecture.
The classification threshold is an important parameter when building and evaluating classification models. It can significantly impact the model's performance and the decisions made based on its predictions.
A typical default choice is to use a threshold of 0.5.
In the spam example, that would mean that any email with a predicted probability greater than 0.5 is classified as spam and put in a spam folder. Any email with a predicted probability of less than or equal to 0.5 is classified as legitimate.
A real-world example. Gousto, a meal kit retailer, has a churn prediction model that flags how likely a user will stop using the service. Currently, the threshold for a customer churning is just below 50%, so anyone with a probability of 0.5 will be given a binary prediction of 1 and consequently be labeled a churner.
However, depending on the specific problem and the desired trade-off between precision and recall, you may want to choose a different threshold.
Here is a quick refresher on precision and recall.
Each metric has its limitations. Precision prioritizes “correctness” but may not account for the cost of missing positive cases. Recall emphasizes “completeness” but may result in falsely flagging some instances. Both types of errors can be expensive, depending on the specific use case.
Want to see a specific example with all the code? We made a complete tutorial using an employee attrition prediction problem as an example. It shows how to evaluate and compare two classification models and think through designing a decision threshold.
Of course, you typically want both precision and recall to be perfect, but they never are and are often in conflict. Higher precision leads to lower recall and vice versa.
Since precision and recall measure different aspects of the model quality, this leads to the precision-recall trade-off. You must balance their importance and account for it when training and evaluating ML models.
Let’s go back to the spam example.
Let’s say you care more about avoiding false positives (i.e., classifying a legitimate email as spam) than missing some spam emails. In this case, you’d optimize for precision. To achieve higher precision, you might set a higher threshold. For example, you’d assign the email to spam only when the predicted probability is over 90%.
Conversely, you might want to set a lower threshold if you care more about catching all spam emails. In this case, you’d optimize for recall. You will detect more spam, but be less precise. You can set your threshold at the usual 50%, for example.
To balance precision and recall, you should consider the costs of false positives and false negatives errors. This is highly custom and depends on the business context. You might make different choices when solving the same problem in different companies.
Let’s walk through a few examples to illustrate it.
Use case: predicting propensity to buy.
Say your task is to score the customers likely to buy a particular product. You then pass this list of high-potential customers to a call center team to contact them. You might have thousands of customers registering on your website every week, and the call center cannot reach all of them. But they can easily reach a couple of hundred.
Every customer that buys the product will make an effort well worth it. In this scenario, the cost of false positives is low (just a quick call that does not result in a purchase), but the value of true positives is high (immediate revenue).
In this case, you'd likely optimize for recall. You want to make sure you reach all potential buyers. Your only limit is the number of people your call center can contact weekly. In this case, you can set a lower decision threshold. Your model might have low precision, but this is not a big deal as long as you reach your business goals and make a certain number of sales.
Optimizing for recall typically means setting a lower decision threshold.
You can also set the number of people you can contact weekly. For example, you can reach out to the “top-200” customers scored by the model. In this case, the exact decision threshold would fluctuate every week.
Now, let’s take a look at an opposite example.
Use case: predicting fast delivery.
Let's say you are working for a food delivery company. Your team is developing a machine learning model to predict which orders might be delivered in under 20 minutes based on factors such as order size, restaurant location, time of day, and delivery distance.
You will use this prediction to display a "fast delivery" label next to a potential order.
In this case, optimizing for precision makes sense. False positives (orders predicted to be completed fast but actually delayed) can result in a loss of customer trust and ultimately lead to decreased sales. On the other hand, false negatives (orders predicted to take longer but completed in under 20 minutes) will likely have no consequences at all, as the customer would simply be pleasantly surprised by the fast delivery.
Optimizing for precision typically means setting a higher classification threshold.
To optimize for high precision, you may set a higher threshold for the model's prediction, only flagging orders as likely to be completed fast if the model is highly confident. This approach may result in fewer orders being flagged as likely to be completed quickly (low recall). But the flagged ones will be more likely to be completed in under 20 min.
Since customer satisfaction is the top priority, precision is a better metric.
Real-world example. GitHub has a feature that automatically tags some issues in open-source repositories as “good first issues” to contribute. For this ML model, they aim for high precision at the cost of the recall. Only a small minority of issues are accessible contribution opportunities. At the same time, there is no need to discover all good issues (recall). However, it is essential that the issues that are labeled as “good first issues” are, in fact, appropriate (precision). In this case, optimizing for precision makes sense.
Use case: payment fraud prediction.
Now consider a scenario where you are developing a model to detect fraudulent transactions in a banking system.
In this case, the cost of false positives is high, as it can lead to blocking legitimate transactions and causing inconvenience to the customers. On the other hand, the cost of false negatives is also significant, as it can result in financial loss and lost trust due to fraudulent transactions.
In this case, you need to strike a balance between precision and recall. While you need to detect as many fraudulent transactions as possible (high recall), you must also ensure that you don't flag legitimate transactions as fraudulent too often (high precision). The threshold for detecting a fraudulent transaction must be set carefully, considering the costs associated with both types of errors.
To create a balanced threshold, you can, for example, define the acceptable intervention rate ("How often are we OK to stop legitimate transactions?"). Since the fraud cost can differ, you can also set different thresholds based on the transaction amounts. For example, you can set a lower decision threshold for high-volume transactions since they come with a higher potential financial loss. For smaller amounts (which are also more frequent), you can set the threshold higher to ensure you do not inconvenience customers too much.
In other applications where you want to balance precision and recall, both equally important, you can also use an F1-score.
How to translate it to your use case?
You can work with business stakeholders to assign a monetary value to different errors. For example, you can set a ballpark of the “cost of falsely contacting the wrong customer” or the “cost of making an incentive offer to the customer.” Then you can use these numbers to relate to them when evaluating the model, discussing potential improvements, and setting the optimal decision threshold.
Different techniques can help visualize the trade-offs between classification thresholds and think through the choice.
Let’s review some of them! We’ll generate the plots using Evidently, an open-source Python library to evaluate and monitor ML models.
One approach is the precision-recall curve. It shows the value pairs between precision and recall at different thresholds.
To make it easier to digest, you can create a precision-recall table. It would show the specific values of precision and recall and the number of true and false positives at a particular step.
For example, you can take a 5% step every time. Meaning you first look at the top-5% of model predictions with the highest scores. Then, you look at the next 10-%, and so on.
In this case, you do not set the threshold directly but rather look at the top most confident predictions and how these translate into the model quality.
As another example, you can visualize the class separation quality. It shows how well the model separates two classes by mapping all predicted probabilities on a single plot. You can visualize the potential threshold as a horizontal line: everything above it would be assigned to the positive class.
You can also use the ROC curve that plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
If you want to understand it better, there is a separate guide about the confusion matrix and its elements, such as TPR and FPR, and ROC curve and ROC AUC.
In binary classification problems, there are two classes or labels. The classification threshold determines the point at which an object is considered to belong to the target class.
In multi-class classification problems, there are three or more classes. You need to design an approach that determines how the predicted probabilities for each category convert into the final class label for a given object.
One common approach is simply choosing the class with the highest probability. Another is using a threshold or using a combination of thresholds and probabilities.
Real-world example: Shopify has an ML-powered feature to automatically categorize products on the platform. There is a large number of categories, and they choose the category that has the highest confidence score. There is also a minimum confidence requirement. They also manually tuned thresholds for confidence to ensure high performance in these sensitive areas.
To quickly calculate and visualize the impact of different classification thresholds for your machine learning models, you can use Evidently, an open-source Python library to evaluate, test and monitor ML models in production.
You must prepare your dataset that includes predicted values for each class and true labels and pass it to the tool. You will instantly get an interactive report that includes accuracy, precision, recall, F1-score, ROC AUC score and other metrics and visualizations. You can also integrate these model quality checks into your production pipelines.
Evidently allows calculating various additional Reports and Test Suites for model and data quality. Check out Evidently on GitHub and go through the Getting Started Tutorial.
Try our open-source library with over 20 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.
Sign up free ⟶
Or try open source ⟶