on this page
Mean Average Precision (MAP) at K is one of the metrics that helps evaluate the quality of ranking and recommender systems. It measures both the relevance of suggested items and how good the system is at placing more relevant items at the top.
In this article, we explain it step by step.
We also introduce Evidently, an open-source Python library for ML model evaluation and monitoring.
Say you typed a query in Google and hit search. Chances are, this is how you landed on this article. What happens behind the scenes is a ranking system in action that processes your query (such as “What is MAP?”) and returns a sorted list of likely relevant results (links to articles on the topic).
If you are to build a ranking, recommendation, or search system like this – how do you evaluate if the results they return are any good? You can pick one of several ranking quality metrics to assess the ranking quality. One of them is MAP.
Mean Average Precision (MAP) at K is a quality metric that helps evaluate the ability of the recommender or ranking system to return relevant items in the top-K results while placing more relevant items at the top. We can express it as the following:
In the case of a search engine, you could look at K = 10 since this is how many results fit on the first page. If you are to evaluate how the system performs on a particular group of queries, U can include all searches on a specific topic. You would then aggregate the performance across related keywords for an overall score.
In the case of a recommender system, you can define K based on the expected number of recommendations an average user will see. For example, you can base it on the size of the recommendation block. U would then include all users in the dataset.
What’s left in the formula is the Average Precision (AP). What is it exactly? It is not the same as the “usual” Precision at K, reflecting a share of relevant items in the list. Instead, AP evaluates the ranking quality: to compute AP, you must average the Precision at all relevant ranks within a given list.
Read on for a step-by-step explanation.
Confusion alert! In computer vision, mean average precision (MAP) is often used to evaluate the accuracy of object detection algorithms. While the idea is similar, this article focuses on explaining MAP for ranking and recommendations.
Let’s refresh the idea of Precision. Precision evaluates the share of relevant results in all retrieved or recommended items. Simply put, it shows how many predictions are “correct” or high quality – based on your definition of relevance. Say your model returned 100 recommendations. Twenty of them were relevant. Precision is 20%.
What is the relevance? Check out the introduction to evaluating recommendation systems.
Precision at K is a common variation. In this scenario, you can look at the fraction of relevant items only in the top-K recommendations provided by the system. Applying such a cut-off is useful since users typically only interact with a limited number of items: you want to make sure that these are the ones the ranking system gets right. The value of the K parameter is entirely your choice.
Suppose you look only at the top 10 items. Inside this subset, five recommendations are relevant. Precision at 10 is 50%.
You can compute Precision for every user list (if you deal with recommendations) or each query (if you deal with information retrieval) in your dataset. You can also aggregate the Precision values across all lists to get a picture of the “overall correctness” of the model output. This is an easy-to-interpret metric with some caveats: the Precision values can vary across lists if the total number of relevant items is variable.
Want a deeper dive? Check out the guide to Precision and Recall in recommendations.
Precision at K works for many evaluation scenarios. Say you deal with e-commerce recommendations: the more relevant items get into each recommendation block, the more likely users will find something they’d like to buy. Optimizing for Precision in top-K would make sense.
However, Precision has a downside. This metric only considers the presence of the relevant items but does not take into account their order. Regardless of whether the 5 relevant items take positions 1 through 5 or 6 through 10, the Precision will be the same.
This is not always ideal. You might also care about the ranking order and expect the system to arrange the recommendations correctly, putting more relevant items ahead of less relevant ones. Average Precision is the metric that helps address this.
Average Precision (AP) at K is computed as an average of Precision values at all the relevant positions within K. We can express it as the following:
Let’s unpack it!
Let’s take an example to illustrate this further.
Say we have top-6 recommended items where three items are relevant. The usual Precision@6 would be 50%. However, the Average Precision value will vary based on the ranking order.
Suppose the relevant items are in positions 1, 4, and 5. In this case, the Average Precision will be 70%. Positions 2, 3, and 6 are irrelevant and do not contribute to the AP calculation.
What if all the relevant recommendations were at the top of the list instead? We get the ideal ranking in this case, and the Average Precision is 1.
Let’s come up with a few more combinations to illustrate the behavior of the metric. We’ll keep the total number of relevant items the same: 3 out of 6, but consider different ranking orders.
As you can notice, this metric favors getting the top recommendations right and penalizes the system for errors in the early positions. When we put all three relevant items into the second half of the list, the Average Precision is only 0.38.
The reason for this behavior lies in the Precision calculation at each relevant rank: errors in the early positions propagate downstream. You repeatedly factor them at each following computation.
Finally, we can get back to the initial MAP formula. We must aggregate the Average Precision (AP) values to get the Mean Average Precision. After computing the AP scores for each user lost or query, we can average them across all users or queries.
Here is our MAP:
If you have 100 users, you sum AP for each one and divide by 100. That’s it. Now we know something about the overall quality of a ranking!
The MAP values can range from 0 to 1. The higher the MAP, the better the system can place relevant items high in the list.
MAP metric rewards the system’s ability to place relevant items at the top.
Say you are looking at the top-10 search results. In an ideal scenario, all the documents on the page should be relevant. But what if only a couple of them are? In this case, they should appear at the top of the page rather than at the bottom. This is precisely the behavior the MAP metric encourages.
This intuitively matches the “good” behavior of a system like search. However, marginal changes in the MAP value between 0 and 1 can be less intuitive. Unlike simple Precision or Recall, MAP does not have immediate real-world interpretation.
Let’s try to add some more intuition to this metric.
In essence, Average Precision provides a single value aggregating the model Precision across different Recall levels. To remind, Recall reflects the share of correctly retrieved relevant items out of the total number of relevant items in the dataset.
Let’s walk through this computation:
Sound familiar? We did the same when introducing the AP formula at the beginning of the article. We just referred to computing Precision at “every relevant rank” instead of “when the value of Recall changes.” But after all, it is the same thing. This offers a fresh perspective on AP, treating it as a form of weighted Precision. You consider Precision at the points of increasing Recall and disregard it otherwise.
You can visualize this using the Precision-Recall curve. It plots the Precision values against different Recall values at each K. This helps you visualize the step changes as you move down the ranked list. Then, you can think of Average Precision as the interpolated area under the Precision-Recall curve.
For example, let’s map the values for a scenario with 6 ranks, and 3 total relevant items. Out of 3 relevant items, all three are at the top. This situation represents a perfect ranking. Our Precision equals 1 at each relevant position.
Since all relevant items are at the top, the curve remains flat at the top of the graph until you reach the point where you have found all 3 relevant items and reached the maximum Recall. It feels pretty intuitive that the Average Precision (the area of the resulting square) is also one.
How will it look for our second scenario?
In this case, we have a sharp fall after the second item – until we encounter the last relevant item at the bottom of the list. You can notice a typical zigzag pattern, where the Precision jumps a bit after we find the following relevant item: this is often how this Precision-Recall curve looks in practice. Still, in this case, the total surface of the area remains quite large (AP = 0.83): we got the first two items right, and their contribution is significant.
Finally, what if we look at our worst possible rank, where all 3 relevant items were at the bottom?
We started pretty low since the first predictions were not a match. At the first relevant position (rank 4), our Precision is only 0.25. Even though it recovers moving along, the total area under the curve remains smaller, noticeable in the AP value of 0.38.
Importantly, this illustration shows AP for the complete ranked list, not the AP at K. In practice, you might have relevant items outside K. Thus, you won't achieve the Recall of 1. However, illustrating this behavior gives an intuition of how the metric penalizes early errors. AP at K is a partial summary of the Precision-Recall curve. While you may not calculate AP at K as the area under the curve, you can think of it conceptually as the average height of the precision curve up to rank K.
Area under the PR curve (PR AUC). There are different ways to calculate the PR AUC, for example, the trapezoidal rule. Precision-Recall interpolation is another approach where you consider the maximum Precision at a given or higher Recall level. (Check out this visual explanation or a video for some more details). Average Precision (AP) is equivalent to the interpolated PR AUC, as the interpolation step in PR AUC captures the same concept of averaging Precision values at different recall levels.
Let’s sum up what we’ve just learned about the MAP behavior:
No single metric is perfect. Here are some pros and cons of MAP.
What are other ranking metrics? If you care about the ranking quality, there are other metrics to consider. Check out the deep dives into Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR).
Ultimately, in many scenarios, you might use several metrics simultaneously to get a well-rounded evaluation of your system.
Evidently is an open-source Python library that helps evaluate, test and monitor machine learning models, including ranking and recommendations. Evidently helps compute 15+ different ranking metrics, from MAP to behavioral metrics like serendipity and diversity.
By passing your dataset, you can quickly generate a comprehensive report with multiple metrics and interactive visualizations out of the box.
You can also use Evidently to run CI/CD tests, for example, to evaluate the model quality after retraining. You can also deploy a live monitoring dashboard to keep track of the model metrics and test results over time.
Would you like to learn more? Check out the open-source Getting Started tutorials.
Don’t want to deal with deploying and maintaining the ML monitoring service? Sign up for the Evidently Cloud, a SaaS ML observability platform built on top of Evidently open-source.
Get early access ⟶