on this page
Normalized Discounted Cumulative Gain (NDCG) is a metric that evaluates the quality of recommendation and information retrieval systems. NDCG helps measure a machine learning algorithm's ability to sort items based on relevance.
In this article, we explain it step by step.
We also introduce Evidently, an open-source Python library for ML model evaluation and monitoring.
Consider an example: you are looking for a movie to watch, and there are many options! Some are a perfect match for your taste. Some are less so, but still are interesting and entertaining. Preferences, relevance, and tastes are not always as simple as binary “yes” or “no.” If you watch all the movies under consideration, you can probably assign a score to each and make an orderly list, sorting them from the best to worst.
Now, let’s turn the table and imagine that you are developing a recommender system for movie recommendations. Knowing what you know – your ground truth about movies you liked – how can the ideal output of such a system look? Intuitively, it should look the same: if you create a long list of movies to watch, you should sort it based on relevance: the best movies on the top, the fine choices after it, and so on.
NDCG is a metric that helps evaluate the ranking quality with a similar principle. We assume that there is some ideal ranking with all the items sorted in decreasing order of relevance. NDCG helps measure how close the algorithm output comes to this perfect order.
NDCG stands for Normalized Discounted Cumulative Gain. As usual with ranking, you often compute NDCG at K, defining a cutoff point for the top-ranked items you consider.
To calculate NDCG, you must divide a given list's discounted cumulative gain (DCG) by the ideal DCG representing a perfect ranking order.
The formula might not be easy to grasp initially, so let’s unpack it. We will explain this metric step by step by covering:
TL;DR. A relevance score is a ground truth or target value necessary for evaluating ranking system quality. It can be a binary label (relevant or not) or a numerical score (e.g., 1 to 5).
A ranking or recommender system typically returns a sorted list of the relevant items. If you deal with recommendations, this might be a list of items (products, articles, songs, etc.) for a specific user, ranked by their likely match. If you deal with search and retrieval, this can be a list of items or answers most relevant to a specific query.
To evaluate the ranking quality, you need some ground truth to this. The ground truth helps assess whether the recommended items matched the user’s preferences or query. When a recommender system is in production, you can capture the relevant user actions and interactions. Say, you can track if the user clicks on the item shown in the recommendation block.
Often, this relevance is binary: say, the user clicked on, watched, or purchased an item (or not). Sometimes, this can be a numerical score, such as a rating where different weights are assigned to specific user actions, reflecting a relative degree of match.
The way you get the relevance score depends on the case. For example, if you collect data about user actions in the app, you can develop a custom formula to define the relevance based on a particular user action. Say, a click is good (score 1), but adding the item to favorites is better (score 2). Putting a product in the basket is even more valuable (score 3), and finally, completing the purchase (score 5) best reflects the item's relevance.
However, this granularity can be pretty complex to implement in production, so having a binary “relevant or not” label based on actions like clicks is a popular approach.
Once you have the ground truth, you can define how good your system is by comparing the ranked list (what the recommender system predicted to be relevant) and the actual relevance score. You can use various metrics in this context, with NDCG being one of them. Unlike some other metrics, NDCG can handle binary and numerical relevance scores.
Want to understand which other metrics exist? Check out this introduction to recommendations metrics.
TL;DR. K is a user-assigned parameter that defines the cutoff point (list length) while evaluating ranking quality. For example, you can only look at top-10 items.
When evaluating the quality of recommendations or ranking, you must also define the depth of the ranked list you will look at.
The K parameter represents the cutoff point in the list where you’d look for relevant items.
For example, you might consider top-10 search results or top-3 recommendations since this is the app’s recommendation block size. Or, you might look at the top 30 recommendations if you expect the user behavior to be explorative and know this to be the average number of items to view.
This K is a use-case-specific parameter that you can set. You might also evaluate metrics at different levels of K, treating each as an additional quality measure: for example, look at top-3 and top-30 results in parallel.
TL;DR. Cumulative gain is a measure of the total relevance of the ranked list. It sums up the relevance scores of the individual items in the list.
Before explaining the discount, let’s introduce the idea of cumulative gain.
A gain of a single recommended item is its relevance score, be it binary or graded. The cumulative gain is a sum of scores of all relevant items among the top-K results in the list.
For example, if you have a binary notion of relevance, and out of 5 recommended items, 3 got a click, the cumulative gain of a given list will be 1+1+1=3. If you have graded scores from 1 to 5, and all your top 5 recommendations got a 5-star, the total gain will be 25.
However, this way of computing gain has a limitation – it does not take into account the ordering. If you change the position of the relevant items, the outcome will be the same.
Let’s consider two situations with binary relevance:
As we swap positions, the cumulative gain stays the same.
However, intuitively, the second result seems like a much better outcome. The ranking often matters: having the most relevant results at the top might be just what we expect. For example, the best search result should be the first, not the last, on the page.
We should reflect it in our evaluation: give credit to the system when it can place more relevant items higher and penalize it for placing relevant results lower on the list. To do that, we can introduce a discount.
TL;DR: A logarithmic discount helps assign lower gain when relevant items appear further down in the ranked list.
To compute the DCG - discounted cumulative gain – we introduce the discount that gives less weight to relevant items that appear lower. This way, instead of simply summing up the item’s relevance, we adjust the contribution of each item based on its position.
But how exactly do you come up with the discount?
A common way to introduce this discount in the industry is a logarithmic penalty. As you move down the list, you divide each item’s gain by a growing number, computed as an inverse logarithm of the position number. This discount helps factor in the diminishing value of relevant items further down. This is quite intuitive: a good match at rank 10 is less valuable than at rank 1.
DCG formula. For each item in the top K positions, we calculate the discounted gain using the formula:
Where rel (i) is the relevance score of the item at position i.
Note: there is also an alternative DCG formula that gives an even higher penalty.
Using the logarithmic discount, you can ensure that the difference between positions 1 and 2 is more substantial than between positions 2 and 3, and so on. This mirrors the intuition that highest-ranked items have a more significant impact on the user experience, and there is often a substantial drop-off as one moves down the list. Consider search engine results where users are less likely to explore the results beyond the first page.
In essence, each position in the fixed list has an assigned value factor. The second place is divided by 1.58, the 3rd place is divided by 2, and so on.
Let’s take an example of DCG computation. Suppose we have a ranked list with binary scores [1, 0, 1, 1, 0], and we want to calculate DCG@3.
We can sum the relevance scores of all items up to position K = 3, weighted by the corresponding logarithmic discount.
The resulting sum is 1 + 0 + 0.5 = 1.5. This is the discounted gain of the list at K = 3.
The cumulative gain of the same list would have equaled 1+0+1 = 2 without the discount.
What if we swap the order and put the 3rd relevant item in the second position? Our ranked list looks like a ranked list: [1, 1, 0, 1, 0].
The resulting sum equals 1 + 0.63 + 0 = 1.63. This is the new discounted gain of the list at K = 3. We improved the DCG @3 by moving the relevant item higher up (from position 3 to position 2).
The Discounted Cumulative Gain (DCG) helps address the ranking order. However, there is still a limitation. The absolute values of DCG depend on the number of items in the list and the relevance scores assigned. This makes it hard to compare the DCG scores across use cases and lists.
Longer lists have higher potential DCG values since more items can contribute to the score. DCG at 10 can be higher than DCG at 3 simply due to the length of the list rather than the inherently better ranking. Using binary or graded relevance will also lead to different gain values.
Finally, even an “ideal” recommendation algorithm won’t always be able to get DCG@k = 1 due to the applied discount.
To address this, you can normalize the DCG.
TLDR: To normalize DCG, you can divide it by the ideal ranking (IDCG@k), where every item appears in the order of relevance.
To get the normalized DCG (NDCG), you must divide the computed DCG by the ideal DCG (IDCG) for the given list. IDCG represents the maximum achievable DCG with the same set of relevance scores but in the perfect ranking order.
What exactly is ideal? In the case of binary relevance scores, all relevant items should be at the start of the list. In the case of graded relevance, you should place all items in a descending order of relevance.
Here is the final NDCG formula:
Normalized DCG enables fair comparisons between different lists, regardless of their length and the scale of relevance scores. As the final step, you typically compute an average NDCG for all users or queries to get a single score for the overall dataset.
NDCG can take values from 0 to 1.
In essence, NDCG@k is a ranking metric that helps consider both the relevance of items and their positions in the list. A higher NDCG@k value indicates that the system is doing a better job of presenting relevant items at the top of the list.
Like any metric, NDCG has its pros and cons.
NDCG might appear similar to Mean Average Precision (MAP), which is another rank-aware metric that considers the positions of relevant items up to a K. Both NDCG and MAP reflect the quality of ranking, but with some distinctions.
How to compute MAP? You must calculate and average the Precision (the share of relevant items up to a given position) at each relevant position in top K. Read the complete guide to MAP for a step-by-step explanation.
First, MAP works for binary relevance. MAP is designed to evaluate a ranking or recommender system with a binary relevance function. NDCG can handle both binary and graded relevance. This is useful when you deal with use cases like search: the documents are usually not wholly relevant or irrelevant but exist on a particular scale.
Second, NDCG and MAP account for diminishing ranks differently. Both NDCG and MAP at K consider the order of items and thus are suitable when you care about ranking quality. However, they treat decreasing ranks differently.
MAP gives more weight to the relevant items at the top of the list. Since the metric is based on Precision values at each relevant position in the ranked list, it is more sensitive to changes in the early positions. Any non-relevant items at the beginning of the list influence the aggregation at each subsequent Precision calculation and contribute to the overall MAP score. MAP drops more rapidly if there are non-relevant items at the top.
DCG assigns diminishing weights to items as you move down the list, but they are logarithmic. The contribution of items decreases, but not extremely rapidly. Check out this video for a walkthrough explanation: DCG has a sharper elbow and a heavier tail compared to the more rapid descent of MAP as you move down the K.
Ultimately, the choice between MAP and NDCG depends on the nature of the data and the importance of considering accounting for graded relevance, where NDCG excels. Due to MAP bias towards the top ranks, it is often suitable when the emphasis is on Precision at higher ranks.
MRR (Mean Reciprocal Rank) is another ranking metric used in information retrieval and recommendation systems. It focuses on the position of the first relevant item in the ranked list.
How to compute MRR? MRR is calculated by taking the reciprocal of the rank of the first relevant item and averaging these reciprocal ranks across multiple queries or instances. Read the complete guide to MRR for a step-by-step explanation.
Let’s compare MRR to NDCG.
MRR is very simple to explain. It provides a straightforward measure of how quickly you find a relevant item in the list.
MRR disregards the ranks after the first relevant item. However, MRR considers only the position of the first relevant item in the list. It does not care what happens after it. This is different from NDCG, which focuses on the entire ranking.
MRR handles only binary relevance. MRR can only handle binary relevance scores and treats the items as either relevant or not. NDCG can work with both binary and numerical relevance scores.
Depending on the scenario, you might prefer NDCG to MRR or vice versa. MRR aligns well with scenarios where the primary goal is to present the most relevant item as early as possible or when there is a single most relevant and correct answer. This is common in information retrieval. On the other hand, NDCG fits well with scenarios where you care about the order of many relevant items, such as recommendation systems.
You can also look at both metrics to evaluate different aspects of the system simultaneously: how quickly you hit the first correct answer (MRR) and how well-ranked the rest of the list is (NDCG).
Evidently is an open-source Python library that helps evaluate, test and monitor machine learning models, including ranking and recommendations. Evidently computes and visualizes 15+ different ranking metrics, from NDCG, MAP and MRR to behavioral metrics like serendipity and diversity.
By passing your dataset, you can quickly generate a comprehensive report with multiple metrics and interactive visualizations out of the box.
You can also use Evidently to run CI/CD tests, for example, to evaluate the model quality after retraining. You can also deploy a live monitoring dashboard to keep track of the model metrics and test results over time.
Would you like to learn more? Check out the open-source Getting Started tutorials.
Don’t want to deal with deploying and maintaining the ML monitoring service? Sign up for the Evidently Cloud, a SaaS ML observability platform built on top of Evidently open-source.
Get early access ⟶