📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

Ranking and Recommendation Metrics Guide

Mean Reciprocal Rank (MRR) explained

Last updated:

January 9, 2025

contents‍

Mean Reciprocal Rank (MRR) is one of the metrics that help evaluate the quality of recommendation and information retrieval systems. Simply put, it helps understand the average position of the first relevant item across all user lists.

In this article, we explain it step by step.

We also introduce Evidently, an open-source Python library for ML model evaluation and monitoring.

TL;DR

Mean Reciprocal Rank (MRR) is a ranking quality metric. It considers the position of the first relevant item in the ranked list.
You can calculate MRR as the mean of Reciprocal Ranks across all users or queries.
A Reciprocal Rank is the inverse of the position of the first relevant item. If the first relevant item is in position 2, the reciprocal rank is 1/2.
MRR values range from 0 to 1, where "1" indicates that the first relevant item is always at the top.
Higher MRR means better system performance.

Evidently Classification Performance Report

Start with AI observability

Want to keep tabs on your ranking and recommendation models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

Start free ⟶Or try open source ⟶

Start with AI observability

Want to keep tabs on your ranking and recommendation models? Automate the quality checks with Evidently Cloud. Powered by the leading open-source Evidently library with 20m+ downloads.

Start free ⟶Or try open source ⟶

Picture opening a music streaming app and checking out the "songs you might like." In the background is a ranking system at work, curating a list of songs you might enjoy.

You want the recommender to suggest songs you will like right at the top, so you don't have to click "next" too many times until you stumble on a good one. The ability to quickly find a great match is a critical feature in many recommendation and ranking systems.

MRR is the metric that helps assess this.

How to compute MRR

Mean Reciprocal Rank (MRR) at K evaluates how quickly a ranking system can show the first relevant item in the top-K results. Here is the formula that defines MRR:

Where:

U is the total number of users (in case of recommendations) or queries (in case of information retrieval) in the evaluated dataset.
Rank i is the position of the first relevant item for user u in the top-K results.

In our music streaming example, you could calculate MRR across all users to assess how quickly, on average, the recommender system suggests the first song the users enjoy.

Do you need more details? Let’s go through the computation step by step.

Choose the K

To compute MRR, you need to prepare the dataset and decide on the K parameter, which is the number of top-ranked items you will use in your evaluation.

‍What is the K parameter? Check out the introduction to evaluating recommendation systems.

Let’s continue with our music streaming example. A recommender system can score every song in the catalog, predicting how likely a given user will enjoy them. The recommender system then returns a list of sorted items – which can be very long. The same applies to other cases, like e-commerce recommendations or internet searches.

In practice, you often care most about the top of the list since these are the items the users will immediately see. For example, you might only show top-5 songs in the recommendation block or top-10 search results on the first page.

Identify relevant items

Like other ranking metrics, MRR needs the ground truth. You must identify which items within the top-K recommendations were relevant to the user. For example, if you display multiple products to users, you can capture which ones they clicked on and treat these click events as a reflection of relevance.

In the music streaming example, you can have a more complex definition of relevance. For example, you might consider a song relevant if a user listened to it long enough or if they explicitly put a “thumbs up.” In e-commerce, the measure of relevance might be adding the product to the cart. Ultimately, all you need is a binary “yes” or “no” for every recommended item.

Let’s consider an example. Say you have a list of top music recommendations displayed to a user inside a music app. The items in the list are ordered by relevance as predicted by the recommender system. You consider users clicking and listening to the song for more than 60 seconds as a reflection of relevance. You then contrast the predictions against the actual user actions, where you know that the user listened to songs 2 and 3.

If you look for traditional accuracy, you can quickly count that it equals 40% - 2 out of 5 recommendations were correct. But we are here about MRR – and this is a different calculation.

Identify the first relevant rank

In computing MRR, we only care about the position at which the first relevant item appears. In our example, although we have two relevant songs in the top 5, we will focus only on the song in position 2.

Calculate the Reciprocal Rank

Now, you need to compute the so-called reciprocal rank. This is a straightforward concept: you must take the reciprocal (multiplicative inverse) of the position of the first relevant item.

Basically, this is one divided by the rank of the first relevant item:

You get 1 for first place, 1⁄2 for second place, 1⁄3 for third place and so on. If there is no relevant item or correct answer in top-K, the reciprocal rank is 0.

In our example above, the first relevant item is in 2nd place. Thus, the reciprocal rank of a given user list is 0.5.

Compute the MRR

Now, rinse and repeat. In a production scenario, you have multiple lists, so you must perform the same computation for each user or query. Once you know the reciprocal rank of each list, you can then find the average of all reciprocal ranks.

This is the Mean Reciprocal Rank we were after. Let’s review the complete formula once again:

To illustrate it, let’s continue with a simple example. Say we have 4 different recommendation sets. We show the positions of the relevant results in the image.

Here is how we compute the Reciprocal Rank (RR) for each user:

For user 1, the first relevant item is in position 1, so the RR is 1.
For user 2, the first relevant item is in position 3, so the RR is 1/3 ≈ 0.33.
For user 3, the first relevant item is in position 6, so the RR is 1/6 ≈ 0.17.
For user 4, the first relevant item is in position 2, so the RR is 1/2 = 0.5.

Then, the resulting Mean Reciprocal Rank is: (1+0.33+0.17+0.5)/4 = 2/4 = 0.5

How to interpret the MRR

MRR can take values from 0 to 1.

MRR equals 1 when the first recommendation is always relevant.
MRR equals 0 when there are no relevant recommendations in top-K.
MRR can be between 0 and 1 in all other cases. The higher the MRR, the better.

This metric is quite intuitive: MRR reflects how soon, on average, we can find a relevant item for each user within the first K positions. It emphasizes the importance of being correct early in the list.

Setting the K parameter allows you to customize the evaluation to prioritize a certain list depth. For example, if you put the K to 3, you only look at the top 3 ranks. If the first relevant item appears beyond that, you will ignore it.

‍What is a good MRR? This depends on the use case. For example, if you have a recommender system that suggests a set of five items out of many thousand possibilities, an MRR of 0.2 might be acceptable. This indicates that, on average, users find a relevant item at position 5.

However, if you build a question-answering system, you might have different quality criteria. For example, you’d expect that you must immediately get a correct answer for a specific group of questions (“What is the capital of France?”). In this case, you might expect MRR to be close to 1 for a highly effective system.

To sum up, here are some pros and cons of the MRR.

MRR is excellent for scenarios with a single correct answer. MRR is often relevant for information retrieval tasks. By looking at MRR, you can quickly grasp how close the correct answer is to the top of the average list. This metric might be less beneficial for applications with many relevant items, such as e-commerce recommendations where users might be interested in a wide range of products.
MRR is easily interpretable. MRR is easy to explain and communicate to product and business stakeholders. It tells us how fast a usual user can find a relevant item.
MRR disregards the relevance of items beyond the first one. MRR cares about the top relevant item, and this item only. It ignores all other ranks. The Reciprocal Rank will be the same even if no relevant items appear in the top K after the first one. If you want to assess it, you need other metrics beyond MRR.

Other ranking and recommendation metrics. You might want to supplement MRR with other metrics, such as Precision at K (to evaluate the share of relevant items), Mean Average Precision at K (to evaluate both relevance and ranking quality), or NDCG (to assess the quality of ranking).

Evaluating ranking with Evidently

Evidently is an open-source Python library that helps evaluate, test and monitor machine learning models, including ranking and recommendations. Evidently computes and visualizes 15+ different ranking metrics, from MRR to behavioral metrics like serendipity and diversity.

By passing your dataset, you can quickly generate a comprehensive report with multiple metrics and interactive visualizations out of the box.

You can also use Evidently to run CI/CD tests, for example, to evaluate the model quality after retraining. You can also deploy a live monitoring dashboard to keep track of the model metrics and test results over time.

Would you like to learn more? Check out the open-source Getting Started tutorials.

[fs-toc-omit]Get started with AI observability

Try our open-source library with over 25 million downloads, or sign up to Evidently Cloud to run no-code checks and bring all the team to a single workspace to collaborate on AI quality.

Sign up free ⟶

Or try open source ⟶