ALAN opinion 9 min read

Approximate by Design: What Gets Lost When Vector Indexing Decides Which Results You See

Conceptual illustration of approximate search results with missing documents representing recall gaps in vector indexing

The Hard Truth

Imagine a hiring system that never rejects your resume — it simply never finds it. Not because you were unqualified, but because the search algorithm decided, before any human reviewed anything, that your record was close enough to skip. How would you know? How would anyone?

Every major retrieval system in production — the ones powering AI-assisted hiring, medical question-answering, legal discovery — runs on a principle most users never encounter: controlled imprecision. Vector Indexing does not find the best results. It finds results that are probably close to the best, fast enough to matter. The distance between “best” and “probably close” is where this essay lives.

The Silence Between Results

The question that rarely surfaces in engineering discussions about Similarity Search Algorithms is not whether approximate nearest neighbor search works — it demonstrably does — but what it means when a system designed to skip results operates in a domain where the skipped result was the one that mattered.

When an Embedding places your medical symptom description in a high-dimensional space and the retrieval layer returns the twenty nearest documents, the system has already made a decision about what you will not see. That decision was not made by a clinician, a policy, or even a rule anyone wrote down. It was made by the geometry of an index — by the structure of a graph that connects vectors through navigable small-world properties, or by the boundaries of an IVF (Inverted File Index) partition that slices the space into clusters a query may or may not visit.

The gap is not a failure of the system — it is the architecture functioning as intended. The gap between what exists in the index and what the system returns carries ethical weight that nobody has been asked to account for, and in high-stakes domains — hiring, healthcare triage, criminal justice — that weight falls on specific people, not abstract metrics.

The Reasonable Case for Controlled Imprecision

The engineering defense of approximate search is sound, and it deserves to be stated at full strength before we complicate it. Exact nearest neighbor search does not scale. At billions of vectors, brute-force comparison becomes computationally absurd. Libraries like Faiss, DiskANN, and ScaNN exist because they solved a real problem — making similarity search fast enough to be useful at the scale modern applications demand.

The production recall sweet spot sits around 90 to 95 percent, and pushing from 95 to 99 percent may triple query latency with minimal perceived quality improvement (Proptimise AI). For a music recommendation engine or an e-commerce search bar, that tradeoff is not just reasonable — it is elegant. The missing 5 to 10 percent of results, in those contexts, costs almost nothing.

The assumption embedded in that tradeoff is that what gets dropped does not matter much. For most applications, the assumption holds. The question is what happens when it does not.

What the Missing Results Actually Are

Here is where the comfortable narrative develops a crack. The assumption that dropped results distribute randomly — that approximation shaves a thin, uniform layer off the bottom of every query — does not survive contact with the data.

In one HNSW benchmark, recall at k=40 measured 0.73, meaning 27 percent of true nearest neighbors were absent from the result set (OpenSource Connections). That figure comes from a single dataset, and different data distributions yield different gaps. But the more troubling finding is not the aggregate number — it is the pattern beneath it. Certain embeddings produce what researchers describe as “unfortunate” graph representations, causing specific documents to vanish from results regardless of how many neighbors you request. The disappearance is not random. It is structural.

Locality-sensitive hashing research confirms the pattern theoretically. Standard LSH does not guarantee equal return probability for equidistant neighbors — closer points are systematically favored, which means the algorithm violates individual fairness by design (Aumüller et al.). Fair alternatives exist in theory, and LSH-based structures can be made equitable without significant efficiency loss, but practical deployment of fair approximate nearest neighbor search is not yet mainstream.

What this means is that the results a system drops are not a random sample of irrelevance. They are a systematic exclusion, shaped by the geometry of the index and concentrated in the queries where precision matters most — rare terminology, niche domains, edge-case formulations that do not cluster neatly in vector space. Vector search struggles specifically with specialized terminology common in legal, healthcare, and scientific contexts.

The Library That Edits Itself

When a library acquires books but its catalog system surfaces only popular titles efficiently, we do not say the library has censored anything. We say the catalog has a bias — invisible to the patron who trusts the results they receive. The patron does not see the books the system never returned. They experience completeness where there is, in fact, a curated absence.

Approximate retrieval in high-stakes domains operates the same way. A RAG system built on dense vector search will miss exact regulatory codes, part numbers, and domain-specific jargon — the most common accuracy failure in production RAG systems (Redis Engineering). The system does not flag the omission. It returns its best approximate answer with the same confidence it would display for a perfect match. The user — a clinician, a judge, a hiring manager — receives no signal that something was missing.

Combining keyword and vector search improves retrieval recall by a factor of 3 to 3.5 and end-to-end answer accuracy by 11 to 15 percent on complex reasoning tasks (Redis Engineering). Hybrid approaches are not exotic. They exist. The question is why they are not the default in domains where the cost of a missed result is borne by a person, not a metric.

Approximation as Invisible Policy

Thesis: when approximate retrieval operates in domains where missing information causes harm to individuals, it functions as an unaudited policy decision — and should be governed as one.

The EU AI Act designates employment, education, essential services, healthcare triage, law enforcement, and justice as high-risk domains requiring conformity assessments by August 2026 (EU AI Act, Annex III). The Act does not mention approximate nearest neighbor search directly — but AI systems using ANN retrieval in these domains inherit the Act’s compliance obligations. The retrieval layer is part of the system, and the system is what gets assessed.

The DIREC research project — a DKK 3.57 million initiative running from 2022 to 2025 — studied precisely this intersection: how ANN approximation biases affect discrimination discovery, diversity maximization, and algorithmic fairness in decision-making (DIREC). That this research exists signals that the question has entered academic consciousness. What it has not yet entered is engineering practice — the space where indexes are configured, recall thresholds are set, and nobody writes down who decided what recall threshold was good enough.

The Obligations We Have Not Named

If approximate search in high-stakes contexts is a policy decision, then several uncomfortable obligations follow. Who is responsible for characterizing the recall profile of a retrieval system — not as an aggregate number, but as a distribution across query types and populations? Who audits whether the documents that vanish are disproportionately relevant to marginalized groups, rare medical conditions, or minority legal precedents?

These are not rhetorical flourishes. They are governance gaps — persistent because the engineering frame treats recall as a performance metric rather than an accountability surface. The teams that tune index parameters rarely sit in the same room as the teams responsible for the decisions those indexes inform.

Where This Argument Is Weakest

Intellectual honesty requires naming the limits of this position. No peer-reviewed study has directly quantified harm caused by ANN recall gaps in healthcare, legal, or hiring contexts. The ethical risk argument presented here is inferential — constructed from the documented properties of approximate search combined with the known sensitivity of high-stakes domains. It is a plausible chain of reasoning, not an empirical finding. If recall gaps in these domains turn out to distribute benignly, or if hybrid retrieval becomes the default before harm materializes at scale, this argument loses much of its force.

The Question That Remains

We build retrieval systems that are approximate by design and place them at the center of decisions about people’s health, employment, and legal standing. The systems work — in the aggregate sense that engineers measure. But the aggregate is not where harm occurs. Harm occurs at the edges, in the specific query that returned four results when there should have been five, in the regulatory precedent the system never surfaced, in the resume that existed in the index but never appeared in the results.

Who is accountable for the results that were never returned — and how would anyone know they were missing?

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors

Share: