on this page
Mean Reciprocal Rank (MRR) is one of the metrics that help evaluate the quality of recommendation and information retrieval systems. Simply put, it helps understand the average position of the first relevant item across all user lists.
In this article, we explain it step by step.
We also introduce Evidently, an open-source Python library for ML model evaluation and monitoring.
Picture opening a music streaming app and checking out the "songs you might like." In the background is a ranking system at work, curating a list of songs you might enjoy.
You want the recommender to suggest songs you will like right at the top, so you don't have to click "next" too many times until you stumble on a good one. The ability to quickly find a great match is a critical feature in many recommendation and ranking systems.
MRR is the metric that helps assess this.
Mean Reciprocal Rank (MRR) at K evaluates how quickly a ranking system can show the first relevant item in the top-K results. Here is the formula that defines MRR:
In our music streaming example, you could calculate MRR across all users to assess how quickly, on average, the recommender system suggests the first song the users enjoy.
Do you need more details? Let’s go through the computation step by step.
To compute MRR, you need to prepare the dataset and decide on the K parameter, which is the number of top-ranked items you will use in your evaluation.
What is the K parameter? Check out the introduction to evaluating recommendation systems.
Let’s continue with our music streaming example. A recommender system can score every song in the catalog, predicting how likely a given user will enjoy them. The recommender system then returns a list of sorted items – which can be very long. The same applies to other cases, like e-commerce recommendations or internet searches.
In practice, you often care most about the top of the list since these are the items the users will immediately see. For example, you might only show top-5 songs in the recommendation block or top-10 search results on the first page.
Like other ranking metrics, MRR needs the ground truth. You must identify which items within the top-K recommendations were relevant to the user. For example, if you display multiple products to users, you can capture which ones they clicked on and treat these click events as a reflection of relevance.
In the music streaming example, you can have a more complex definition of relevance. For example, you might consider a song relevant if a user listened to it long enough or if they explicitly put a “thumbs up.” In e-commerce, the measure of relevance might be adding the product to the cart. Ultimately, all you need is a binary “yes” or “no” for every recommended item.
Let’s consider an example. Say you have a list of top music recommendations displayed to a user inside a music app. The items in the list are ordered by relevance as predicted by the recommender system. You consider users clicking and listening to the song for more than 60 seconds as a reflection of relevance. You then contrast the predictions against the actual user actions, where you know that the user listened to songs 2 and 3.
If you look for traditional accuracy, you can quickly count that it equals 40% - 2 out of 5 recommendations were correct. But we are here about MRR – and this is a different calculation.
In computing MRR, we only care about the position at which the first relevant item appears. In our example, although we have two relevant songs in the top 5, we will focus only on the song in position 2.
Now, you need to compute the so-called reciprocal rank. This is a straightforward concept: you must take the reciprocal (multiplicative inverse) of the position of the first relevant item.
Basically, this is one divided by the rank of the first relevant item:
You get 1 for first place, 1⁄2 for second place, 1⁄3 for third place and so on. If there is no relevant item or correct answer in top-K, the reciprocal rank is 0.
In our example above, the first relevant item is in 2nd place. Thus, the reciprocal rank of a given user list is 0.5.
Now, rinse and repeat. In a production scenario, you have multiple lists, so you must perform the same computation for each user or query. Once you know the reciprocal rank of each list, you can then find the average of all reciprocal ranks.
This is the Mean Reciprocal Rank we were after. Let’s review the complete formula once again:
To illustrate it, let’s continue with a simple example. Say we have 4 different recommendation sets. We show the positions of the relevant results in the image.
Here is how we compute the Reciprocal Rank (RR) for each user:
Then, the resulting Mean Reciprocal Rank is: (1+0.33+0.17+0.5)/4 = 2/4 = 0.5
MRR can take values from 0 to 1.
This metric is quite intuitive: MRR reflects how soon, on average, we can find a relevant item for each user within the first K positions. It emphasizes the importance of being correct early in the list.
Setting the K parameter allows you to customize the evaluation to prioritize a certain list depth. For example, if you put the K to 3, you only look at the top 3 ranks. If the first relevant item appears beyond that, you will ignore it.
What is a good MRR? This depends on the use case. For example, if you have a recommender system that suggests a set of five items out of many thousand possibilities, an MRR of 0.2 might be acceptable. This indicates that, on average, users find a relevant item at position 5.
However, if you build a question-answering system, you might have different quality criteria. For example, you’d expect that you must immediately get a correct answer for a specific group of questions (“What is the capital of France?”). In this case, you might expect MRR to be close to 1 for a highly effective system.
To sum up, here are some pros and cons of the MRR.
Other ranking and recommendation metrics. You might want to supplement MRR with other metrics, such as Precision at K (to evaluate the share of relevant items), Mean Average Precision at K (to evaluate both relevance and ranking quality), or NDCG (to assess the quality of ranking).
Evidently is an open-source Python library that helps evaluate, test and monitor machine learning models, including ranking and recommendations. Evidently computes and visualizes 15+ different ranking metrics, from MRR to behavioral metrics like serendipity and diversity.
By passing your dataset, you can quickly generate a comprehensive report with multiple metrics and interactive visualizations out of the box.
You can also use Evidently to run CI/CD tests, for example, to evaluate the model quality after retraining. You can also deploy a live monitoring dashboard to keep track of the model metrics and test results over time.
Would you like to learn more? Check out the open-source Getting Started tutorials.
Don’t want to deal with deploying and maintaining the ML monitoring service? Sign up for the Evidently Cloud, a SaaS ML observability platform built on top of Evidently open-source.
Get early access ⟶