📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Ranking and Recommendation Metrics Guide

Mean Average Precision (MAP) in ranking and recommendations

Last updated:

January 9, 2025

contents‍

Mean Average Precision (MAP) at K is one of the metrics that helps evaluate the quality of ranking and recommender systems. It measures both the relevance of suggested items and how good the system is at placing more relevant items at the top.

In this article, we explain it step by step.

We also introduce Evidently, an open-source Python library for ML model evaluation and monitoring.

TL;DR

Mean Average Precision (MAP) is a ranking quality metric. It considers the number of relevant recommendations and their position in the list.
MAP at K is calculated as an arithmetic mean of the Average Precision (AP) at K across all users or queries.
To compute the Average Precision (AP) at K, you must average the precision at each relevant position in the K-long ranked list.
MAP can take values from 0 to 1, where 1 corresponds to an ideal ranking with all relevant items at the top. Higher values of MAP mean better performance.

Evidently Classification Performance Report

Start with AI observability

Want to keep tabs on your ranking and recommendation models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

Start free ⟶Or try open source ⟶

Start with AI observability

Want to keep tabs on your ranking and recommendation models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

Start free ⟶Or try open source ⟶

Say you typed a query in Google and hit search. Chances are, this is how you landed on this article. What happens behind the scenes is a ranking system in action that processes your query (such as “What is MAP?”) and returns a sorted list of likely relevant results (links to articles on the topic).

If you are to build a ranking, recommendation, or search system like this – how do you evaluate if the results they return are any good? You can pick one of several ranking quality metrics to assess the ranking quality. One of them is MAP.

How to calculate MAP

Mean Average Precision (MAP) at K is a quality metric that helps evaluate the ability of the recommender or ranking system to return relevant items in the top-K results while placing more relevant items at the top. We can express it as the following:

Where:

K is a chosen cutoff point.
U is the total number of users (in case of recommendations) or queries (in case of information retrieval) in the evaluated dataset.
AP is the average Precision for a given ranking list.

In the case of a search engine, you could look at K = 10 since this is how many results fit on the first page. If you are to evaluate how the system performs on a particular group of queries, U can include all searches on a specific topic. You would then aggregate the performance across related keywords for an overall score.

In the case of a recommender system, you can define K based on the expected number of recommendations an average user will see. For example, you can base it on the size of the recommendation block. U would then include all users in the dataset.

What’s left in the formula is the Average Precision (AP). What is it exactly? It is not the same as the “usual” Precision at K, reflecting a share of relevant items in the list. Instead, AP evaluates the ranking quality: to compute AP, you must average the Precision at all relevant ranks within a given list.

Read on for a step-by-step explanation.

Confusion alert! In computer vision, mean average precision (MAP) is often used to evaluate the accuracy of object detection algorithms. While the idea is similar, this article focuses on explaining MAP for ranking and recommendations.

What is Precision?

Let’s refresh the idea of Precision. Precision evaluates the share of relevant results in all retrieved or recommended items. Simply put, it shows how many predictions are “correct” or high quality – based on your definition of relevance. Say your model returned 100 recommendations. Twenty of them were relevant. Precision is 20%.

What is the relevance? Check out the introduction to evaluating recommendation systems.

‍Precision at K is a common variation. In this scenario, you can look at the fraction of relevant items only in the top-K recommendations provided by the system. Applying such a cut-off is useful since users typically only interact with a limited number of items: you want to make sure that these are the ones the ranking system gets right. The value of the K parameter is entirely your choice.

Suppose you look only at the top 10 items. Inside this subset, five recommendations are relevant. Precision at 10 is 50%.

You can compute Precision for every user list (if you deal with recommendations) or each query (if you deal with information retrieval) in your dataset. You can also aggregate the Precision values across all lists to get a picture of the “overall correctness” of the model output. This is an easy-to-interpret metric with some caveats: the Precision values can vary across lists if the total number of relevant items is variable.

Want a deeper dive? Check out the guide to Precision and Recall in recommendations.

Precision at K works for many evaluation scenarios. Say you deal with e-commerce recommendations: the more relevant items get into each recommendation block, the more likely users will find something they’d like to buy. Optimizing for Precision in top-K would make sense.

However, Precision has a downside. This metric only considers the presence of the relevant items but does not take into account their order. Regardless of whether the 5 relevant items take positions 1 through 5 or 6 through 10, the Precision will be the same.

This is not always ideal. You might also care about the ranking order and expect the system to arrange the recommendations correctly, putting more relevant items ahead of less relevant ones. Average Precision is the metric that helps address this.

Average Precision (AP)

Average Precision (AP) at K is computed as an average of Precision values at all the relevant positions within K. We can express it as the following:

Where

N is the total number of relevant items for a particular user.
Precision(k) is the precision calculated at each position.
rel(k) equals 1 if the item at position k is relevant and 0 otherwise.

Let’s unpack it!

Step 1. First, you must know the total number of relevant items in the top-K results. This will be the N.
Step 2. Then, you must identify the positions of each relevant item within K and calculate Precision at each of those positions. The Precision computation goes as usual: you divide the number of relevant items by the total number of items until this position. There is one caveat: we only look at the Precision at ranks where the items are relevant. We skip all the positions where the items are not relevant.
Step 3. Finally, we sum up the Precision values at each relevant position and divide the sum by the total number of relevant items.

Let’s take an example to illustrate this further.

Say we have top-6 recommended items where three items are relevant. The usual Precision@6 would be 50%. However, the Average Precision value will vary based on the ranking order.

Suppose the relevant items are in positions 1, 4, and 5. In this case, the Average Precision will be 70%. Positions 2, 3, and 6 are irrelevant and do not contribute to the AP calculation.

What if all the relevant recommendations were at the top of the list instead? We get the ideal ranking in this case, and the Average Precision is 1.

Let’s come up with a few more combinations to illustrate the behavior of the metric. We’ll keep the total number of relevant items the same: 3 out of 6, but consider different ranking orders.

As you can notice, this metric favors getting the top recommendations right and penalizes the system for errors in the early positions. When we put all three relevant items into the second half of the list, the Average Precision is only 0.38.

The reason for this behavior lies in the Precision calculation at each relevant rank: errors in the early positions propagate downstream. You repeatedly factor them at each following computation.

Mean Average Precision (MAP)

Finally, we can get back to the initial MAP formula. We must aggregate the Average Precision (AP) values to get the Mean Average Precision. After computing the AP scores for each user lost or query, we can average them across all users or queries.

Here is our MAP:

If you have 100 users, you sum AP for each one and divide by 100. That’s it. Now we know something about the overall quality of a ranking!

Intuition behind MAP

The MAP values can range from 0 to 1. The higher the MAP, the better the system can place relevant items high in the list.

MAP equals 1 in the case of perfect ranking when all relevant documents or items are at the top of the list.
MAP equals 0 when no relevant objects are retrieved.
MAP can be between 0 and 1 in all other cases. The closer the MAP score is to 1, the better the ranking performance.

MAP metric rewards the system’s ability to place relevant items at the top.

Say you are looking at the top-10 search results. In an ideal scenario, all the documents on the page should be relevant. But what if only a couple of them are? In this case, they should appear at the top of the page rather than at the bottom. This is precisely the behavior the MAP metric encourages.

This intuitively matches the “good” behavior of a system like search. However, marginal changes in the MAP value between 0 and 1 can be less intuitive. Unlike simple Precision or Recall, MAP does not have immediate real-world interpretation.

Let’s try to add some more intuition to this metric.

In essence, Average Precision provides a single value aggregating the model Precision across different Recall levels. To remind, Recall reflects the share of correctly retrieved relevant items out of the total number of relevant items in the dataset.

Let’s walk through this computation:

You move down the ranked list, looking for the relevant items.
Each time you find a new relevant item, your Recall increases. (Since you add one more True Positive to the numerator in the Recall formula). Bingo! At each such rank, you compute the corresponding Precision.
If you don’t find a relevant item, Recall stays the same. These are the items you don’t consider. You skip the Precision computation since the Recall does not change.
Then, you aggregate the Precision computed at all positions where the item was relevant.

Sound familiar? We did the same when introducing the AP formula at the beginning of the article. We just referred to computing Precision at “every relevant rank” instead of “when the value of Recall changes.” But after all, it is the same thing. This offers a fresh perspective on AP, treating it as a form of weighted Precision. You consider Precision at the points of increasing Recall and disregard it otherwise.

You can visualize this using the Precision-Recall curve. It plots the Precision values against different Recall values at each K. This helps you visualize the step changes as you move down the ranked list. Then, you can think of Average Precision as the interpolated area under the Precision-Recall curve.

For example, let’s map the values for a scenario with 6 ranks, and 3 total relevant items. Out of 3 relevant items, all three are at the top. This situation represents a perfect ranking. Our Precision equals 1 at each relevant position.

Since all relevant items are at the top, the curve remains flat at the top of the graph until you reach the point where you have found all 3 relevant items and reached the maximum Recall. It feels pretty intuitive that the Average Precision (the area of the resulting square) is also one.

How will it look for our second scenario?

In this case, we have a sharp fall after the second item – until we encounter the last relevant item at the bottom of the list. You can notice a typical zigzag pattern, where the Precision jumps a bit after we find the following relevant item: this is often how this Precision-Recall curve looks in practice. Still, in this case, the total surface of the area remains quite large (AP = 0.83): we got the first two items right, and their contribution is significant.

Finally, what if we look at our worst possible rank, where all 3 relevant items were at the bottom?

We started pretty low since the first predictions were not a match. At the first relevant position (rank 4), our Precision is only 0.25. Even though it recovers moving along, the total area under the curve remains smaller, noticeable in the AP value of 0.38.

Importantly, this illustration shows AP for the complete ranked list, not the AP at K. In practice, you might have relevant items outside K. Thus, you won't achieve the Recall of 1. However, illustrating this behavior gives an intuition of how the metric penalizes early errors. AP at K is a partial summary of the Precision-Recall curve. While you may not calculate AP at K as the area under the curve, you can think of it conceptually as the average height of the precision curve up to rank K.

Area under the PR curve (PR AUC). There are different ways to calculate the PR AUC, for example, the trapezoidal rule. Precision-Recall interpolation is another approach where you consider the maximum Precision at a given or higher Recall level. (Check out this visual explanation or a video for some more details). Average Precision (AP) is equivalent to the interpolated PR AUC, as the interpolation step in PR AUC captures the same concept of averaging Precision values at different recall levels.

Let’s sum up what we’ve just learned about the MAP behavior:

It evaluates the Precision at different Recall levels and averages these values. The average in the metric name refers to this aggregation of the Precision values inside a single list.
The mean in the metric name comes from aggregating it across different users or queries to get an overall score for the dataset.
Due to underlying mathematics, MAP heavily favors getting the top results right.
Ultimately, this metric helps reflect a combination of valuable properties: how many relevant results the model can capture and how well they are usually ranked.

Pros and cons

No single metric is perfect. Here are some pros and cons of MAP.

The strongest “pro” or MAP is its sensitivity to rank. Remember the usual Precision and Recall we keep bringing up? They return a single number at a specific K and only consider whether the relevant items are present. In contrast, MAP evaluates how well the system ranks the items, placing the relevant ones on top.
Focus on getting the top results right. MAP heavily penalizes the system for early errors. If getting the top predictions right is essential, MAP is a valuable metric to consider. This makes it a popular choice for information retrieval tasks, where you want users to be able to find the relevant document quickly.

What are other ranking metrics? If you care about the ranking quality, there are other metrics to consider. Check out the deep dives into Normalized Discounted Cumulative Gain (NDCG) and Mean Reciprocal Rank (MRR).

Limited interpretability. On the positive side, MAP exists on a scale of 0 to 1 and clearly defines an “ideal ranking” corresponding to 1. However, the exact values of MAP might still be hard to interpret, especially when communicating to business stakeholders, compared to more straightforward Precision and Recall, or even MRR.

Ultimately, in many scenarios, you might use several metrics simultaneously to get a well-rounded evaluation of your system.

Evaluating ranking with Evidently

Evidently is an open-source Python library that helps evaluate, test and monitor machine learning models, including ranking and recommendations. Evidently helps compute 15+ different ranking metrics, from MAP to behavioral metrics like serendipity and diversity.

By passing your dataset, you can quickly generate a comprehensive report with multiple metrics and interactive visualizations out of the box.

You can also use Evidently to run CI/CD tests, for example, to evaluate the model quality after retraining. You can also deploy a live monitoring dashboard to keep track of the model metrics and test results over time.

Would you like to learn more? Check out the open-source Getting Started tutorials.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶