on this page
Precision and Recall at K are common metrics that help evaluate the performance of ranking algorithms. If you are familiar with classification models, the approach is very similar.
In this guide, we’ll explain them in detail.
We also introduce Evidently, an open-source Python library for ML model evaluation and monitoring.
Imagine you're doing your usual weekly online grocery shop. As you are about to check out, you see a list of items conveniently presented with a suggestion to add to your cart. You might see fresh produce like avocados, spinach, and tomatoes. Another user might get a recommendation to buy ice cream, chocolate, and cheese.
Which set of recommendations is a better one? And, if you are developing the system to produce these suggestions for each and every user – how do you know if you are doing a good job?
Precision and Recall are the two classic metrics to evaluate the quality of recommendations. They behave the same as Precision and Recall in classification.
Let’s unpack them step by step.
Precision at K is the ratio of correctly identified relevant items within the total recommended items inside the K-long list. Simply put, it shows how many recommended or retrieved items are genuinely relevant.
Here is a formula:
You calculate Precision at K by taking the number of relevant items within the top-k recommendations and dividing it by K.
The K is the arbitrary cut-off rank you can choose to limit your evaluation. This is often practical when you expect the user to engage only with a limited number of items shown to them. Suppose you only display top-6 recommendations in the recommendation block next to the basket checkout on the e-commerce website. In that case, you can evaluate the recall and precision for the first six recommendations and set the "K" to 6.
The relevance is a binary label that is specific to your use case. For example, you might consider the items the user clicked on, added to the cart, or purchased to be relevant.
In our example above, we might see that the user bought two of the suggested items. The Precision at six would be 2/6 = 0.33.
Precision answers the question: Out of the top-k items suggested, how many are actually relevant to the user?
Once you compute Precision for the individual user lists or queries, you can average it across all users or queries to get the overall system performance assessment.
Let's take another simple example. Suppose you're dealing with a top-10 recommendation scenario. Out of 10 recommended items, the user interacted with 6 items – we will treat this as a binary signal of relevance. In this case, the Precision at 10 is 60%. Out of 10 recommendations, 6 were good ones. The system is doing a decent job.
What if we zoom in on the top 5 recommendations only? In this subset, the user interacted with 4 out 5 items. The Precision at 5 is then 80%.
As you can notice, the specific choice of K affects the calculation result – sometimes significantly.
What is K, and what is the relevance? Check out the introduction to evaluating recommendation systems.
Choosing Precision at K as an evaluation metric has its pros and cons.
Interpretability. It’s easy to understand Precision. This makes it a suitable metric for communicating the system quality to stakeholders and users.
Focus on accuracy. Precision effectively conveys the correctness of recommendations. Choose this metric to track how many recommended items inside K are relevant, regardless of the order.
No rank awareness. On the downside, Precision does not help much if you care about ranking quality inside the list. It does not consider which exact positions relevant items occupy. Precision will yield the same result as long as the total number of relevant items in K is the same.
Let’s take two lists with the same number of relevant results (5 out of 10). In the first list, the relevant items are at the very top. In the second, they are at the very bottom of the list. The Precision will be 50% in both cases.
This behavior is often not optimal. You might expect the system to be able to place more relevant items on top. In this case, you can choose a different evaluation metric that rewards such behavior: for example, rank-aware evaluation metrics like NDCG or MAP.
Sensitivity to the number of relevant items. The Precision value at K may vary across different lists since it depends on the total number of relevant items in each. You might have a few relevant items for some users while lots for others. Because of this, averaging the Precision across lists might be unpredictable and won’t accurately reflect the ability of the system to make suitable recommendations.
In addition, it is impossible to reach perfect Precision when the K is larger than the total number of relevant items in the dataset. Say you look at the top 10 recommendations while the total number of relevant items in the dataset is 3. Even if the system can find them all and correctly place them at the top of the list, the Precision at ten will be only 30%.
At the same time, this is the maximum achievable Precision for this list – and you can consider it a great result! Metrics like MAP and NDCG can address this limitation: in the case of perfect ranking, the MAP and NDCG at K will be 1.
Summing up. Precision at K is a good measure when you want to check how accurate the model recommendations are without getting into the details of how well the items are ranked.
Precision is usually suitable when the expected number of relevant items is large, but the user's attention is limited. For instance, if you have thousands of different products but plan to show only the top 5 recommendations to the user, Precision at K helps evaluate how good your shortlist is.
Recall at K measures the proportion of correctly identified relevant items in the top K recommendations out of the total number of relevant items in the dataset. In simpler terms, it indicates how many of the relevant items you could successfully find.
Here is a formula:
You can calculate the Recall at K by dividing the number of relevant items within the top-k recommendations by the total number of relevant items in the entire dataset. This way, it measures the system's ability to retrieve all relevant items in the dataset.
As with Precision, the K represents a chosen cut-off point of the top ranks you consider for evaluation. Relevance is a binary label specific to the use case, such as click, purchase, etc.
Recall at K answers the question: Out of all the relevant items in the dataset, how many could you successfully include in the top-K recommendations?
Let's use a simple example to understand Recall at K. Imagine you have a list of top 10 recommendations, and there are a total of 8 items in the dataset that are actually relevant.
If the system includes 5 relevant items in the top 10, the Recall at 10 is 62.5% (5 out of 8).
Now, let's zoom in on the top 5 recommendations. In this shorter list, we have only 3 relevant suggestions. The Recall at 5 is 37.5% (3 out of 8). This means the system captured less than half of the relevant items within the top 5 recommendations.
Here are some pros and cons of the Recall at K metric.
Interpretability. Recall at K is easy to understand. This makes it accessible to non-technical users and stakeholders.
Focus on coverage. If you want to know how many relevant items (out of their total possible number) the system captures in the top K results, Recall effectively communicates this.
There are scenarios when you prefer Recall to Precision. For example, if you deal with search engines where you expect many possible good results, Precision is a great metric to focus on. But if you deal with more narrow information retrieval, like legal search or finding a document on your laptop, Recall (the ability to detect all relevant documents that match a given query) could be more critical, even at the course of lower precision.
No rank awareness. Like precision, Recall is indifferent to the order of relevant items in the ranking. It cares about capturing as many relevant items as possible within the top K, regardless of their specific order.
Requires knowing the number of all relevant items. To compute Recall, you need to know all the actual ranks for items in the list. You cannot always do this; for example, you do not know the true relevance score for items the user did not see.
Sensitivity to the total number of relevant items. Similar to Precision, the achievable Recall at K varies based on the total number of relevant items in the dataset. If the total number of relevant items is larger than K, you cannot reach the perfect Recall.
Say you look at the top 5 recommendations for a dataset with 10 relevant items. Even if all of the recommendations in the top 5 are relevant, the Recall will be at most 50%. In such cases, choosing a metric like precision could be more appropriate.
Additionally, this variability of Recall can also cause issues when averaging Recall across users and lists with varying numbers of relevant items.
Summing up. Recall at K is a valuable measure to help assess how many relevant items the system successfully captured in the top K. It does not consider their rankings. Recall is well-suited for situations with a finite number of relevant items you expect to capture in the top results. For example, in information retrieval, when there is a small number of documents that match a specific user query, you might be interested in tracking that all of them appear in the top system responses.
As explained above, Precision and Recall at K help capture different aspects of the ranking or recommender system performance. Precision focuses on how accurate the system is when it recommends items, while recall emphasizes capturing all relevant items, even if it means recommending some irrelevant ones.
Sometimes, you want to account for both at the same time.
The F Beta score at K combines these two metrics into a single value to provide a balanced assessment. The Beta parameter allows you to adjust the importance given to recall relative to precision.
Here is the formula:
A Beta greater than 1 prioritizes recall, whereas a Beta less than 1 favors precision. When Beta is 1, it becomes a traditional F1 score, a harmonic mean of precision and recall.
The resulting score ranges from 0 to 1. A higher F Beta at K means better overall performance, considering both false positives and false negative errors.
Want to understand other metrics? Start with this overview guide.
Evidently is an open-source Python library that helps evaluate, test and monitor machine learning models, including ranking and recommendations. Evidently computes and visualizes 15+ different ranking metrics, from Precision and Recall to behavioral metrics like serendipity and diversity.
By passing your dataset, you can quickly generate a comprehensive report with multiple metrics and interactive visualizations out of the box.
You can also use Evidently to run CI/CD tests, for example, to evaluate the model quality after retraining and deploy a live monitoring dashboard to keep track of the model metrics and test results over time.
Would you like to learn more? Check out the open-source Getting Started tutorials.
Don’t want to deal with deploying and maintaining the ML monitoring service? Sign up for the Evidently Cloud, a SaaS ML observability platform built on top of Evidently open-source.
Get early access ⟶