Bi-Encoder

Q: What is the difference between a bi-encoder and a cross-encoder?

A bi-encoder encodes query and document independently into vectors that can be precomputed and indexed — fast but coarse. A cross-encoder processes the pair jointly through shared attention — slow but precise. Most retrieval systems use a bi-encoder first, then a cross-encoder to rerank the shortlist.

Q: How do I choose which bi-encoder model to use?

Match the model to your domain and dimension budget. According to the MTEB Leaderboard, multilingual models like gte-multilingual-base work well across languages, while English-only models like all-MiniLM-L6-v2 are smaller and faster. Check MTEB retrieval benchmarks for your language and task, and test on your own data — benchmark rankings do not always transfer.

Q: Can I fine-tune a bi-encoder on my own data?

Yes, and it usually helps significantly. According to Sentence Transformers Docs, contrastive fine-tuning with domain-specific query–document pairs is the single highest-impact improvement for retrieval quality. Even a few thousand labeled pairs can close the gap between a general-purpose model and a domain-specific one.

Also known as: Dual encoder, Two-tower model, Siamese encoder

Bi-Encoder: A bi-encoder is a transformer architecture that encodes a query and a document independently into fixed-size vectors, enabling fast similarity search via precomputed embeddings. It is the standard first-stage retriever in vector search and RAG pipelines.

A bi-encoder encodes queries and documents separately into fixed-size vectors, so you can precompute document embeddings once and search them in milliseconds — the architecture behind every vector database and dense retrieval system.

What It Is

The core idea is independence: one transformer encodes the query, another (or the same with shared weights) encodes the document, and the two never see each other during encoding. Each input becomes a fixed-length vector — typically 384 to 1024 dimensions — and relevance is measured by cosine similarity or dot product between the two vectors. According to Sentence Transformers Docs, this separation is what makes bi-encoders fast: document embeddings can be precomputed, indexed in a vector database like FAISS or Qdrant, and searched in sub-millisecond time over millions of entries.

The architecture traces back to Siamese networks, but the modern version was established by Sentence-BERT (Reimers & Gurevych, 2019). The key insight was that BERT’s cross-attention — powerful for pair classification — was computationally prohibitive for retrieval at scale. By training two BERT towers with a contrastive objective (pulling matching pairs closer, pushing non-matching pairs apart in embedding space), Sentence-BERT achieved semantic similarity search that scaled linearly with corpus size instead of quadratically.

The tradeoff is precision. Because the query and document never interact during encoding, a bi-encoder cannot capture fine-grained token-level dependencies. It sees that two texts are “about the same topic” but can miss that one answers a question the other merely mentions. A query like “does Python support async generators” and a document about “Python’s generator protocol” will land close in embedding space, even though the document never discusses async. This is exactly the gap that cross-encoder rerankers fill in a two-stage retrieve-then-rerank pipeline.

How It’s Used in Practice

Every vector search system runs a bi-encoder under the hood. When you call an embedding API — OpenAI’s text-embedding-3-small, Cohere’s embed-v4, or a local Sentence Transformers model — you are running a bi-encoder. The document side is encoded at ingestion time and stored in the index. At query time, only the query needs encoding, and the nearest-neighbor search over precomputed vectors returns candidates in single-digit milliseconds.

In RAG pipelines, the bi-encoder is the first stage: it retrieves the top 50 to 200 candidates by vector similarity. A cross-encoder reranker then rescores that shortlist with full query–document attention, and only the top 5 to 10 documents are passed to the LLM as context. The bi-encoder handles recall (find everything plausibly relevant), the cross-encoder handles precision (keep only what actually answers the question).

Beyond search, bi-encoders power semantic deduplication (find near-duplicate documents by embedding similarity), clustering (group documents by topic without labels), and recommendation systems (embed users and items into the same space).

Pro Tip: The choice of pooling strategy matters more than model size for most retrieval tasks. Mean pooling over all token embeddings usually outperforms CLS-token pooling, and Matryoshka embeddings let you truncate vectors to smaller dimensions at search time without retraining.

When to Use / When Not

Scenario	Use	Avoid
First-stage retrieval over millions of documents	✅
Real-time semantic search with sub-100ms latency	✅
Precision-critical reranking of a small shortlist		❌ Use cross-encoder
Sentence similarity for deduplication or clustering	✅
Pairwise relevance scoring where token-level interaction matters		❌ Use cross-encoder
Building a vector index that can be searched offline	✅
Tiny corpus (< 1000 documents) where brute-force cross-encoding is feasible		❌ Cross-encoder is fast enough

Common Misconception

Myth: A bi-encoder and a cross-encoder are interchangeable — just pick the more accurate one. Reality: They produce fundamentally different outputs. A bi-encoder outputs a reusable vector that can be stored, indexed, and searched without the original text. A cross-encoder outputs a score for one specific query–document pair and stores nothing. You cannot index a cross-encoder, and a bi-encoder cannot match cross-encoder precision on short lists. Production systems use both: bi-encoder for recall, cross-encoder for precision.

One Sentence to Remember

A bi-encoder turns text into a vector once and searches it forever — fast and scalable, but blind to the fine-grained match that only a cross-encoder can see.

FAQ

Q: What is the difference between a bi-encoder and a cross-encoder? A: A bi-encoder encodes query and document independently into vectors that can be precomputed and indexed — fast but coarse. A cross-encoder processes the pair jointly through shared attention — slow but precise. Most retrieval systems use a bi-encoder first, then a cross-encoder to rerank the shortlist.

Q: How do I choose which bi-encoder model to use? A: Match the model to your domain and dimension budget. According to the MTEB Leaderboard, multilingual models like gte-multilingual-base work well across languages, while English-only models like all-MiniLM-L6-v2 are smaller and faster. Check MTEB retrieval benchmarks for your language and task, and test on your own data — benchmark rankings do not always transfer.

Q: Can I fine-tune a bi-encoder on my own data? A: Yes, and it usually helps significantly. According to Sentence Transformers Docs, contrastive fine-tuning with domain-specific query–document pairs is the single highest-impact improvement for retrieval quality. Even a few thousand labeled pairs can close the gap between a general-purpose model and a domain-specific one.

Sources

Sentence Transformers Docs: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks — canonical library and documentation for bi-encoder training and inference.
arXiv (Reimers & Gurevych, 2019): Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks — foundational paper establishing bi-encoder architecture for sentence similarity.
MTEB Leaderboard: Massive Text Embedding Benchmark — benchmark for comparing bi-encoder model quality across retrieval, clustering, and classification tasks.

Expert Takes

MONA

The bi-encoder’s independence assumption is both its power and its limit. Encoding query and document separately means you lose the cross-attention that captures token-level interactions — negation, coreference, temporal qualifiers all get averaged into a single vector. The contrastive training objective partially compensates by learning to map semantically similar inputs to nearby regions of the embedding space, but it cannot recover information that was never computed. Understanding this boundary is what separates informed retrieval design from embedding-and-pray.

MAX

In production, your bi-encoder choice determines three things: index size, query latency, and recall quality. Pick a model, run it on your actual queries against your actual corpus, and measure recall@50 before you touch anything else. If recall is below 85%, the problem is the embedding model or your chunking, not the reranker. If recall is above 95% but answer quality is still poor, add a cross-encoder. Do not skip the measurement step — the MTEB leaderboard measures academic benchmarks, not your data.

DAN

The embedding model market is consolidating fast. OpenAI, Cohere, Google, and Alibaba all ship bi-encoders as commodity APIs, and the differentiation has moved to multimodal support, Matryoshka dimensions, and late-interaction hybrids like ColBERT. The bet for 2026 is that bi-encoders become infrastructure — invisible, interchangeable, and priced per million tokens — while the value moves to the orchestration layer above them.

ALAN

A bi-encoder decides what your system can find. Everything downstream — the reranker, the LLM, the final answer — operates only on what the bi-encoder retrieved. If the embedding model systematically underrepresents certain languages, domains, or perspectives, those blind spots propagate silently through the entire pipeline. The retriever is the least audited and most consequential component in any RAG system, and treating embedding quality as a solved problem is how invisible bias enters production.

Back to Glossary