nDCG
- nDCG
- nDCG (Normalized Discounted Cumulative Gain) is a graded-relevance ranking metric that scores how well a result list places the most relevant documents at the top, with a logarithmic position discount and normalization to the [0, 1] range.
nDCG (Normalized Discounted Cumulative Gain) is a ranking quality metric that scores a search result list from 0 to 1 based on how high the most relevant documents appear.
What It Is
When you compare rerankers, retrievers, or any system that returns a ranked list, you need one number that answers a simple question: did the right answers end up at the top? nDCG is that number. It is the metric every reranker leaderboard reports, and the one product teams reach for when comparing models like Cohere Rerank, Voyage Rerank, or Jina Reranker.
The intuition is closer to a playlist than to a search engine. Imagine a music app that picks ten songs for you. Putting your favorite first earns full credit. Burying it at position ten earns only partial credit, even if it still made the list. Skipping it entirely earns nothing. nDCG encodes exactly this: relevance still counts, but where you place the result counts too.
Three ingredients combine into the score:
- Gain. Each result has a relevance label, often graded on a scale — irrelevant, somewhat relevant, very relevant. The gain converts that label into a numeric reward.
- Discount. A logarithmic position penalty. According to the Springer Encyclopedia of Database Systems, the standard form uses
1 / log2(i+1)so each later position contributes less to the total. - Normalization. The raw score (Discounted Cumulative Gain) gets divided by the best possible score for that query — the Ideal DCG. According to scikit-learn docs, this places nDCG in the [0, 1] range, where 1 means the result list matches the perfect ordering and 0 means none of the relevant documents made it in.
The metric was introduced by Järvelin and Kekäläinen in their 2002 ACM TOIS paper, and it stuck because it does something simpler scores cannot: it handles graded relevance and rewards getting the top of the list right. Precision and recall treat every relevant document as equal. nDCG knows that the right document at position one is worth more than the right document at position twenty.
For RAG pipelines, that is exactly the property that matters. The LLM only reads the top few passages a reranker returns. If a reranker bumps a marginally relevant doc up over a directly relevant one, the model gets worse context — and nDCG penalizes that exact failure mode.
How It’s Used in Practice
The mainstream encounter is a leaderboard. When a team evaluates rerankers for a RAG pipeline, they pull up the BEIR or MTEB benchmark, sort by the nDCG@10 column, and shortlist the top few candidates. The numbers reported there come from running each reranker on a fixed query set, scoring its outputs against human-labeled relevance, and averaging across queries. According to scikit-learn docs, the same ndcg_score formula sits behind almost every reported figure.
Inside a production team, nDCG shows up the moment someone asks “is the new reranker actually better?” You hold the embedding model and the retriever fixed, swap in the new cross-encoder, score the same eval set, and compare nDCG before and after. A reranker that adds latency without lifting nDCG is not earning its place in the pipeline.
Pro Tip: Do not trust leaderboard nDCG@10 as your only number. Production RAG usually feeds only the top three or top five passages to the LLM, so nDCG@3 or nDCG@5 on your own labeled queries is what predicts real-world quality. Treat the leaderboard as a starting filter, not the final answer.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Comparing rerankers head-to-head on a benchmark | ✅ | |
| Picking a leaderboard winner without testing your own data | ❌ | |
| Evaluating ranking quality with graded relevance labels | ✅ | |
| Measuring user satisfaction or downstream answer accuracy | ❌ | |
| Tuning the cut-off K to match your production top-K | ✅ | |
| Off-policy evaluation for top-N recommender systems | ❌ |
Common Misconception
Myth: A higher nDCG always means a better RAG system. Reality: nDCG measures ranking quality against fixed labels at a fixed cut-off. If your relevance labels do not match what your users actually need, or if production feeds only the first few passages while the leaderboard reports a deeper cut-off, the headline number can mask real-world regressions. Recent work — notably Jeunen et al. (2024) at KDD — also flags reliability issues when nDCG is repurposed for off-policy recommender evaluation.
One Sentence to Remember
nDCG is the field’s shared yardstick for ranking quality — but always pick a cut-off K that matches what your pipeline actually consumes, then validate on your own labeled queries before trusting any leaderboard number.
FAQ
Q: What does nDCG@10 mean? A: nDCG@10 is the nDCG score calculated only over the top ten ranked results. It is the standard reporting depth on retrieval and reranking benchmarks, including BEIR, MTEB, and MIRACL.
Q: What is a good nDCG score? A: There is no universal threshold. On modern reranker leaderboards like BEIR, top systems cluster in roughly the 0.55 to 0.65 range on average, but absolute scores depend heavily on the dataset, label quality, and chosen cut-off K.
Q: Why is nDCG used for rerankers? A: Cross-encoder rerankers are designed to put the most relevant passage at position one and rank lower-quality passages further down. According to the Springer Encyclopedia of Database Systems, nDCG rewards exactly that behavior through its position discount and graded relevance.
Sources
- Järvelin & Kekäläinen (2002): Cumulated Gain-Based Evaluation of IR Techniques - Original ACM TOIS paper introducing the metric.
- scikit-learn docs: sklearn.metrics.ndcg_score reference - Practical reference for the formula and [0, 1] score range.
Expert Takes
Not magic. Math. nDCG combines a relevance gain at each position, a logarithmic discount that downweights items as rank grows, and a normalization step that divides by the ideal possible ordering. The result lives in a fixed range, so you can compare across queries and across models. It captures graded relevance — not just hit-or-miss — which is exactly what reranking is asked to deliver.
Treat nDCG as a contract between your retrieval and reranking layers. The diagnosis: most teams report the leaderboard depth but only feed the first handful of passages to the LLM in production — so the metric they tracked has nothing to do with the quality the model actually consumed. The fix: codify the cut-off in your eval spec to match production reality and rerun the comparison whenever a reranker swap is proposed.
Every reranker leaderboard now leads with an nDCG figure. That is not editorial. That is the market’s way of forcing apples-to-apples comparison across cross-encoders. Pick a vendor without checking their nDCG and you are flying blind to the ranking quality you will inherit downstream. Either you read the leaderboard or you take the vendor’s word for it. One of those positions ends in regret.
The relevance labels driving nDCG come from human annotators. Whose definition of relevant got encoded? Whose use case shaped the eval set? When a leaderboard reigns, models optimize for label-judgment alignment — which can quietly drift away from what your users actually needed answered. A clean number on a public benchmark says little about whether your specific readers were served.