Reranking

Reranking: Reranking is the second stage of a two-stage retrieval pipeline: a fast retriever returns a candidate set of documents for recall, then a slower, more accurate model — typically a cross-encoder transformer — rescores them and returns the top results for precision.

Reranking is a second pass over retrieved documents that uses a more accurate model — usually a cross-encoder transformer — to reorder the top candidates by true relevance to the query.

What It Is

If you have ever used a RAG-powered chatbot — a retrieval-augmented generation system that grounds a language model in your own documents — or an enterprise search tool, and wondered why the most relevant document showed up third instead of first, reranking is the missing piece. Vector search and keyword search are good at fishing the right answer out of millions of documents, but they are mediocre at deciding which of the top fifty is actually best. Reranking exists to close that gap before the answer reaches the user — or the language model writing the answer.

The mechanics are simple to picture. A retrieval pipeline runs in two stages. Stage one is fast and cheap: a vector search, a BM25 keyword search, or a hybrid of both pulls back a candidate set — usually the top fifty to two hundred documents. This stage is tuned for recall, meaning it tries to make sure the right answer is somewhere in the pile. Stage two is the reranker. It looks at the query and each candidate document together as a pair, scores how well they match, and reorders them. The top five to twenty go forward to the language model or the user.

The reason reranking works comes down to how the model reads the data. Standard vector search uses a bi-encoder: it compresses the query into one vector, compresses each document into another, and compares them with cosine similarity. Fast, but the query and document never meet inside the model. A cross-encoder — the architecture used by almost every reranker — feeds the query and document into a transformer together, letting self-attention compare every query word against every document word at every layer. According to Sentence Transformers Docs, that joint encoding is what gives cross-encoders their precision edge: the model sees term interactions a bi-encoder cannot.

This pattern is not new. According to arXiv, the foundational paper Passage Re-ranking with BERT by Nogueira and Cho, published in January 2019, showed that a BERT cross-encoder reranker delivered a sizeable jump on MS MARCO — a standard passage-ranking benchmark — over the previous state of the art. That result is the reason reranking is now a default layer in production retrieval systems.

How It’s Used in Practice

Most product managers and developers encounter reranking inside a RAG pipeline. The chatbot is grounded in a knowledge base — support tickets, product docs, internal wikis. A user asks a question. Vector search returns the fifty most semantically similar chunks. The reranker rescores those fifty using a cross-encoder hosted by Cohere, Voyage, Jina, or run locally with an open-weight model like BGE or MixedBread. The top five chunks land in the prompt, and the model writes an answer using only the highest-precision context.

The same pattern shows up in code search inside tools like Cursor or Claude Code, in legal e-discovery, in customer support deflection, and in recommendation systems. Anywhere precision at the top of the list matters more than raw retrieval speed, a reranker is doing work between the index and the user.

Pro Tip: Start with a hosted reranker — Cohere Rerank or Voyage rerank — before you self-host. The hosted versions are tuned, well-documented, and let you measure whether reranking actually helps your data before you invest in serving infrastructure. If it moves your offline NDCG, then optimize for cost and latency.

When to Use / When Not

Scenario	Use	Avoid
RAG chatbot grounded in a knowledge base where wrong context produces wrong answers	✅
Sub-100ms latency budget on every query and recall is already excellent		❌
Heterogeneous corpus (docs, tickets, code) where vector similarity alone misranks results	✅
Tiny corpus (fewer than a few hundred documents) where the retriever already returns near-perfect order		❌
Long-document search where queries hit different sections than retrieval keys	✅
Cost-sensitive workload with millions of low-stakes queries per day		❌

Common Misconception

Myth: A reranker replaces vector search — pick one or the other. Reality: A reranker is a complement, not a substitute. Stage one (retrieval) optimizes for recall across a large corpus. Stage two (reranking) optimizes for precision over a small candidate set. Removing the retriever and reranking everything is computationally infeasible; removing the reranker leaves precision on the table.

One Sentence to Remember

Reranking is the precision pass that turns “the right answer is somewhere in the top fifty” into “the right answer is in the top three” — and in a RAG pipeline, that difference is the difference between a useful assistant and a hallucinating one.

FAQ

Q: What is reranking in RAG? A: Reranking is a second-stage model that rescores the top candidates returned by vector search or keyword search, then passes the most relevant chunks into the language model’s prompt — improving answer quality without changing the retriever.

Q: What’s the difference between a retriever and a reranker? A: A retriever scans the full corpus quickly using bi-encoders or keyword indexes for recall. A reranker rescans only the retriever’s top results using a cross-encoder for precision. Retrievers are fast and broad; rerankers are slow and accurate.

Q: Do I need a reranker if I already have good vector search? A: Often yes. Vector search optimizes recall, not precision-at-top. A reranker typically lifts NDCG@10 measurably on the same retriever output, especially on heterogeneous corpora where semantic similarity and true relevance diverge.

Sources

arXiv: Passage Re-ranking with BERT (Nogueira & Cho, 2019) - Foundational paper that established BERT cross-encoders as the default reranker architecture.
Sentence Transformers Docs: Cross-Encoders documentation - Reference implementation and explanation of cross-encoder vs bi-encoder tradeoffs.

Expert Takes

MONA

Reranking is not magic. It is what happens when you stop compressing a query and a document into separate vectors and instead let a transformer read them together. Self-attention spans the pair, so the model can compare every query token against every document token at every layer. A bi-encoder collapses that signal into one number per side and loses the term-level interaction. The cross-encoder keeps it. That is the whole asymmetry.

MAX

Reranking is a second specification layer in your retrieval contract. Stage one tells the system what to fetch. Stage two tells it what to keep. If your retriever spec is “give me the broadest plausible candidate set” and your reranker spec is “return only the documents that actually answer this query”, you have a clean separation of recall and precision. Skip stage two and you are asking one model to optimize two competing objectives.

DAN

Reranking moved from research curiosity to default infrastructure layer in roughly eighteen months. Every serious RAG vendor now ships a hosted reranker. Open-weight rerankers from BAAI and MixedBread match the hosted ones on most benchmarks. If you are building a retrieval product without one, you are competing with one hand tied. The window for “we’ll add reranking later” has closed.

ALAN

A reranker decides which documents reach the user and which do not. That is editorial power, not just engineering. The model encodes whatever bias was in its training pairs — query-document relevance judgments collected by humans with their own assumptions. We rarely audit those judgments. We rarely log which documents were silently demoted. When the reranker decides what counts as relevant, the user never sees what was filtered out.

Back to Glossary