For data scientists, ML engineers, product managers, and all practitioners alike.
How do you judge the quality of ranking and recommender systems?
Ranking and recommendation systems often focus on the relevance and order of items rather than just the correctness of prediction, as it is in classification or regression. In this guide, we look into the key metrics and explain them step by step.
This guide is for data scientists, ML engineers, product managers, and anyone who deals with operating recommender systems in production.
What you will find in this guide:
There are different ways to evaluate the quality of ranking and recommendation systems. In this article, we explain the core evaluation principles and introduce different groups of metrics: predictive, ranking, behavioral, and business quality metrics.
Precision at K measures the share of relevant items within the top K positions. Recall at K evaluates how well the system captures all relevant items within the top K ranks. You can also use the F-score to get a balanced measure of Precision and Recall at K.
Mean Average Precision (MAP) at K reflects both the share of relevant recommendations and how good the system is at placing more relevant items at the top of the list. You can compute it as the mean of Average Precision (AP) across all users or queries.
Normalized discounted cumulative gain (NDCG) at K reflects the ranking quality by comparing it to an ideal order where all relevant items are at the top. Unlike other ranking metrics, NDCG can work for both binary and graded relevance scores.