📚 LLM-as-a-Judge: a Complete Guide on Using LLMs for Evaluations. Get your copy

A complete guide
to ranking and recommendations metrics

For data scientists, ML engineers, product managers, and all practitioners alike.

How do you judge the quality of ranking and recommender systems?

Ranking and recommendation systems often focus on the relevance and order of items rather than just the correctness of prediction, as it is in classification or regression. In this guide, we look into the key metrics and explain them step by step.

This guide is for data scientists, ML engineers, product managers, and anyone who deals with operating recommender systems in production.

‍What you will find in this guide:

How to evaluate ranking and recommendations. We cover the shared evaluation principles, explaining the data structure, choice of the top-K parameter, and different approaches.
Deep dives into specific metrics. We explain select metrics in-depth, from ranking metrics like NDCG or MAP to behavioral metrics like serendipity, novelty, and diversity.
Lots of visuals. All explanations have a lot of illustrations to simplify understanding, even if you are new to the topic.‍
Modular design. Each article is self-contained, so you can pick and explore the specific metrics and topics of interest. There is no need to read everything cover to cover: start at any point.

Get started with AI observability

Try open source

Explore topics

Mean Average Precision (MAP) in ranking and recommendations

Mean Average Precision (MAP)

Mean Average Precision (MAP) at K reflects both the share of relevant recommendations and how good the system is at placing more relevant items at the top of the list. You can compute it as the mean of Average Precision (AP) across all users or queries.

10 metrics to evaluate recommender and ranking systems

Metrics overview

There are different ways to evaluate the quality of ranking and recommendation systems. In this article, we explain the core evaluation principles and introduce different groups of metrics: predictive, ranking, behavioral, and business quality metrics.

Precision and recall at K in ranking and recommendations

Precision and Recall at K

Precision at K measures the share of relevant items within the top K positions. Recall at K evaluates how well the system captures all relevant items within the top K ranks. You can also use the F-score to get a balanced measure of Precision and Recall at K.

Normalized Discounted Cumulative Gain (NDCG) explained

Normalized Discounted Cumulative Gain (NDCG)

Normalized discounted cumulative gain (NDCG) at K reflects the ranking quality by comparing it to an ideal order where all relevant items are at the top. Unlike other ranking metrics, NDCG can work for both binary and graded relevance scores.

Mean Reciprocal Rank (MRR) explained

Mean Reciprocal Rank (MRR)

The Mean Reciprocal Rank (MRR) at K helps assess the ranking quality by considering the position of the first relevant item in the list. You can calculate it as the mean of the reciprocal ranks across all users or queries.

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.