Cross-Encoder

Cross-Encoder
A cross-encoder is a transformer that processes a query and a candidate document jointly through a single network and outputs a relevance score (typically 0–1), capturing fine-grained interactions between every query and document token. It is the standard architecture for reranking shortlisted results.

A cross-encoder is a transformer model that takes a query and a candidate document together and produces a single relevance score, used to rerank shortlisted results from a faster first-stage retriever.

What It Is

Search systems that use vector similarity have a quiet failure mode: two documents can have nearly identical embeddings yet very different relevance to a specific query. That happens because a bi-encoder — the model that built those embeddings — encodes the query and the document separately, then compares the resulting vectors. Subtle term-level cues (a negation, a specific date, a rare entity) get averaged out before the comparison ever happens. Cross-encoders close that gap by scoring the pair together. They are the precision tool sitting on top of fast first-stage retrieval, and they are what every modern reranker — the rescore step in any retrieve-then-rerank pipeline — is doing under the hood.

Architecturally, a cross-encoder takes the query and a candidate document, joins them into a single sequence with a separator token, and runs the result through one transformer with shared self-attention. According to Sentence Transformers Docs, the output is a single similarity score — typically between 0 and 1 — and not a sentence embedding. Every query token can attend to every document token at every layer, so the model sees term overlap, paraphrase, contradiction, and word order in ways two independent vectors cannot reproduce.

The cost of that fidelity is linear in candidate count. According to Sentence Transformers Docs, scoring N candidates requires N full transformer forward passes — clustering ten thousand sentences pairwise would need around fifty million comparisons. That is why cross-encoders are not used as the retriever: you cannot precompute and index a cross-encoder representation the way you can a bi-encoder embedding. Instead, a cheap retriever (BM25, a bi-encoder, or hybrid) fetches the top fifty to two hundred candidates, and the cross-encoder rescores only that shortlist. According to arXiv (Nogueira & Cho, 2019), this two-stage pattern delivered the first major leap in passage ranking quality and remains the foundation of every modern reranker shipped today.

How It’s Used in Practice

The mainstream encounter is inside RAG pipelines that already use a vector database. Teams discover quickly that vector similarity returns documents that “look right” by embedding distance but miss the actual answer. Adding a cross-encoder rerank stage between the vector retriever and the LLM closes that gap. The flow is straightforward: the bi-encoder fetches the top fifty to one hundred candidates in tens of milliseconds, the cross-encoder rescores them, and only the top five to ten are passed to the model as context. Hosted APIs from Cohere, Jina, Voyage, and Mixedbread make this a single HTTP call.

The second common scenario is enterprise search and FAQ matching, where intent-specific phrasing matters more than topical overlap. A cross-encoder catches the difference between “how do I cancel my subscription” and “why was my subscription cancelled” — a distinction that bi-encoder embeddings often blur.

Pro Tip: Set the rerank top-k by your LLM’s context budget, not the retriever’s. The point of the rerank stage is to be aggressive with first-stage recall (retrieve a hundred or more) and ruthless with second-stage precision (keep five to ten). If you are still passing fifty documents to the LLM after reranking, the rerank stage is doing nothing.

When to Use / When Not

ScenarioUseAvoid
RAG pipeline where retrieval recall is fine but answer quality is patchy
Pure first-stage retrieval over millions of documents with no shortlist
Search where users complain that the right answer is on page two
Hard sub-100ms latency budget on every single query
FAQ matching where a slight paraphrase changes the correct answer
Bulk batch reranking of every result for every user across the entire index

Common Misconception

Myth: A cross-encoder is just a bigger or more accurate version of an embedding model — pick the better one and use it everywhere. Reality: They produce different things. A bi-encoder outputs a vector you can store and search; a cross-encoder outputs a score for one specific pair and stores nothing. You cannot index a cross-encoder, and a bi-encoder cannot capture the same fine-grained interactions. The architectures are complementary, not interchangeable.

One Sentence to Remember

A cross-encoder is the second pass of a two-pass retrieval system — slow per item, sharp on small lists, and useless as a primary retriever.

FAQ

Q: What is the difference between a cross-encoder and a bi-encoder? A: A bi-encoder embeds query and document independently, then compares vectors — fast and indexable. A cross-encoder scores the pair jointly in one transformer pass — slower but far more accurate.

Q: Why are cross-encoders only used for reranking and not retrieval? A: According to Sentence Transformers Docs, a cross-encoder needs one full transformer pass per candidate, so scoring millions of documents per query is computationally prohibitive. Use a fast retriever first, then rerank the shortlist.

Q: Do I need a cross-encoder if my RAG already uses hybrid search? A: Often yes. Hybrid search improves recall, but a cross-encoder reorders the shortlist using full query–document attention, which usually lifts answer quality even when retrieval already returns the right documents in the top fifty.

Sources

Expert Takes

Cross-encoders and bi-encoders are not two flavours of the same model — they are different functions. A bi-encoder learns a metric space where independent embeddings can be compared by cosine similarity. A cross-encoder learns a relevance function that requires both inputs at inference time and produces nothing reusable. Conflating them costs precision exactly where you most need it: at the boundary between “topically related” and “actually answers the question.”

Treat the rerank stage as a contract. The retriever’s job is recall, the cross-encoder’s job is precision, and the LLM’s job is synthesis. Spec each stage’s input and output explicitly. If your retriever returns a hundred candidates and your prompt still passes them all to the model, the rerank stage is decorative. Define the top-k cutoff in your config, log the score distribution, and let the boundaries do the work.

The reranker market just consolidated around cross-encoder architectures because the alternatives could not match the precision lift. Every serious vendor — Cohere, Voyage, Jina, Mixedbread, BGE — ships a cross-encoder under the hood, and the differentiation has moved to backbone choice, listwise scoring, and context length. If you are designing a retrieval stack today and skipping rerank entirely, you are competing against pipelines that retire ambiguity by default.

A cross-encoder reorders what the user sees, which means it shapes which documents an LLM ever reasons over. The retriever still casts the wide net, but the rerank step quietly decides whose evidence makes it into the final answer. Who audits that score? When a critical document is dropped from the top-k because the cross-encoder weighted some lexical pattern higher than relevance, the failure is invisible to everyone downstream. The model never sees what it never sees.