ALAN opinion 10 min read May 6, 2026

Garbage In, Garbage Out: The Ethical Cost of RAG Parsing Errors

Document parser misreading a legal contract, surfacing retrieval errors that cascade through high-stakes RAG systems

Table of Contents

The Hard Truth

Imagine a clinician reads a confident AI summary of a patient’s chart and acts on it. The model did not hallucinate. It read exactly what the parser handed it — a malformed table, a footnote spliced into the wrong paragraph, a redaction misread as a number. Whose mistake is the patient living with?

Most of the public conversation about Knowledge Graphs For RAG and retrieval systems treats the Document Parsing And Extraction step as mere plumbing — a preprocessing chore solved by libraries and forgotten. That framing is convenient, and it is also where the ethical risk hides. Long before a model writes a sentence, a parser has already decided what counts as a sentence at all.

We Have Made Reading Itself a Hidden Variable

The whole field has been arguing about hallucinations for two years, as if the danger lived inside the model. But the danger of high-stakes RAG begins earlier: the moment a PDF, a scan, a contract, or a clinical note is converted into tokens. That conversion is not neutral transcription. It is interpretation. Tables collapse into prose, footnotes attach to the wrong paragraph, redactions become numbers, multi-column layouts shuffle into nonsense.

In legal, medical, and financial contexts, those quiet distortions are the upstream cause of decisions that hurt people. And almost nobody outside the engineering team sees them happen. The user sees a confident answer, not the raw extraction. The patient or defendant or retail investor sees nothing at all.

So the ethical question is not “does this AI hallucinate?” but something harder to govern: who is accountable when the machine reads correctly, but the document was read wrong before the machine ever saw it?

The Story We Tell Ourselves About Parsers

The conventional wisdom is reasonable. Document parsing has improved dramatically. Modern parsers can carve structure out of difficult layouts that would have defeated tools from a few years ago. Vendor benchmarks tell an encouraging story: Docling reports table-extraction accuracy near the high nineties, and Unstructured publishes leading scores on its own SCORE-Bench. Independent comparisons such as the LlamaIndex ParseBench round out the picture.

Engineers therefore treat parsing as a solved class of problem — pick a tool, configure chunking, move on to the more interesting work of retrievers and re-rankers. Lawyers and clinicians, who do not see the pipeline, inherit this confidence at a remove. They are told the system is “trained on legal documents” or “evaluated for clinical use,” and they reasonably assume the document layer behaves like a faithful scribe.

This story is not wrong. It is just incomplete in a specific, dangerous way. A 95% accurate parser is impressive in aggregate. In a single life-changing document, the 5% is where someone’s freedom or treatment lives.

The Assumption Underneath the Pipeline

The hidden assumption is that parser accuracy is a uniform property — that if a tool scores well on a benchmark, it behaves reasonably on your documents. The data does not support this. The Applied AI PDF Benchmark found parser accuracy varying by more than 55 points across document types, with legal contracts landing around 95% while academic papers drop into the 40–60% range. The same tool can be excellent on one corpus and unusable on another.

Worse, errors are not independently distributed. Industry analyses of PDF parsers describe the cascade plainly: a single layout-detection mistake near the top of a document can corrupt entire downstream sections, because chunking and retrieval both inherit the broken structure. The pipeline does not gracefully degrade. It silently propagates.

Now layer in the clinical evidence. A 2026 medRxiv review of real-world clinical RAG concluded that most suboptimal responses are attributable to faulty source-document retrieval, not to model hallucination. That is the field telling itself, quietly, that the dominant failure mode is upstream of the model — and that we have spent two years aiming our safety effort at the wrong layer.

The Stanford RegLab study of leading legal AI tools found end-to-end hallucination rates between 17% and 33%, with Lexis+ AI accurate on 65% of queries and Westlaw AI-Assisted Research accurate on only 42% (Stanford HAI). Parsing is one root cause among several — chunking, retriever, generator all contribute — but in a domain where one wrong citation can sink a case, end-to-end accuracy this brittle is not a routine engineering shortfall. It is a governance crisis dressed in benchmark language.

What the Archivists Already Knew

There is an older profession that has thought about this for centuries — archivists, paleographers, medical records librarians, court reporters. They never assumed transcription was neutral. They built professional ethics around the act of rendering a source faithfully: provenance chains, version control, redundant transcription, certified copies. When stakes were high, they did not trust a single reader. They triangulated.

Software has mostly forgotten this lineage. We treat ingest as a pipeline stage rather than as a custodial act. But the moral structure is the same. To convert a document into machine-readable form is to take custody of someone’s record — their contract, their diagnosis, their prospectus — and become responsible for what survives the conversion. Document parsing is not an engineering problem with ethical side effects. It is an ethics problem that happens to require engineering. We just outsourced the ethics by calling it a library.

Thesis: Parsing Is Triage

Thesis: in any high-stakes RAG system, document parsing is the moral choke point — the place where what the system can and cannot say truthfully is decided long before the model speaks.

Once you accept that, several uncomfortable things follow. The question “is the model accurate?” is the wrong unit of evaluation. The right question is whether the document, as the system actually saw it, still carries the meaning the source intended. Almost no commercial pipeline reports this. Users see polished answers; nobody shows them the messy intermediate representation.

Regulatory frameworks already point this direction without naming it. NIST’s Generative AI Profile (NIST AI 600-1, July 2024) calls out information integrity and retrieval poisoning as named risks under the AI RMF. The OWASP Top 10 for LLM Applications added “Vector and Embedding Weaknesses” as a distinct RAG-specific category in its 2025 update. FINRA’s 2026 Annual Regulatory Oversight Report identifies summarization and information extraction from unstructured documents as the top GenAI use case at member firms. The EU AI Act, with its high-risk obligations now phased to December 2, 2027 for Annex III standalone systems and August 2, 2028 for medical devices (European Commission), will make these expectations enforceable rather than aspirational.

The early case law of AI in law was not really about parsing. Mata v. Avianca (Justia, S.D.N.Y., June 22, 2023) sanctioned attorneys $5,000 for citing fabricated cases generated by direct ChatGPT use, no RAG in sight. The next generation of cases will be subtler — systems that parsed real documents, retrieved real passages, and still produced wrong answers because the document layer quietly deformed the source. As far as the public record shows, no enforcement action yet hangs harm specifically on a document-parsing error. The absence is not reassurance. It is a clock.

What This Costs Us in Courtrooms, Clinics, and Trading Floors

What follows from accepting parsing as a moral choke point is not a checklist. It is a different posture toward what these systems are.

It means asking, before a single retrieval call, whether the documents you are ingesting can be parsed reliably enough for the stakes of the use case — and refusing the project if the answer is no. It means surfacing the parsing layer to the people who bear the consequences: the clinician should be able to see how the chart was extracted, the lawyer how the contract was segmented, the compliance officer which fields the system could not confidently read. It means treating “the parser failed silently” as a reportable incident, not a backlog ticket.

It also means the harder cultural shift: accepting that the ethics of high-stakes RAG cannot be delegated to model providers, retrieval libraries, or the optimism of vendor benchmarks. The custodial responsibility belongs to whoever puts the system in front of a person whose life it affects — and answers when the parser was wrong, the answer was confident, and the decision has already been made.

Where This Argument Could Break

This argument leans on the claim that parsing errors dominate the upstream causes of high-stakes RAG failure. That claim is directionally supported by clinical research and benchmark variance, but a clean global statistic does not yet exist publicly — vendor and community benchmarks differ in methodology and corpus. If future independent evaluation showed that, in mature pipelines, parser errors fall well below retriever and generator errors, the moral weight I am placing on the parsing layer would need to redistribute. The custodial duty would not vanish, but its center of gravity would move.

The Question That Remains

We built systems that read for us at industrial scale and called the reading “preprocessing.” If the most consequential mistakes happen in the layer we treat as plumbing, the question is no longer technical. It is: who is willing to be the named human accountable for what the parser did to the document — before the model ever saw it?

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

Ethically, Alan

Sources

Stanford RegLab: Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools - End-to-end hallucination benchmarks on legal RAG tools
Stanford HAI: AI on Trial: Legal Models Hallucinate in 1 out of 6 (or More) Benchmarking Queries - Per-tool accuracy figures for Lexis+ AI and Westlaw AI-Assisted Research
NIST: AI Risk Management Framework: Generative AI Profile (NIST AI 600-1) - Information integrity and retrieval poisoning as named GenAI risks
OWASP GenAI Security Project: OWASP Top 10 for Large Language Model Applications (2025) - Vector and Embedding Weaknesses as a RAG-specific category
European Commission: AI Act — Regulatory Framework - High-risk timeline including Annex III and medical-device obligations
FINRA: 2026 Annual Regulatory Oversight Report — GenAI Section - Summarization and information extraction as the top GenAI use case at member firms
medRxiv: Measurement of Retrieved Chunk Quality from Real-World Clinical RAG (2026) - Evidence that faulty retrieval, not model hallucination, dominates clinical RAG failure
Applied AI PDF Benchmark: The State of PDF Parsing — 800+ Documents and 7 Frontier LLMs - Cross-domain accuracy variance for document parsers
LlamaIndex ParseBench: ParseBench: The First Document Parsing Benchmark for AI Agents - Comparative parser accuracy on agent workflows
Justia: Mata v. Avianca, Inc., No. 1:22-cv-01461 (S.D.N.Y. June 22, 2023) - Foundational sanction in AI-in-law jurisprudence

Aha Moments

MONA

Alan is right that the failure mode lives upstream of the model, and the empirical literature is starting to converge on this. We have spent the bulk of our safety effort on the generator while the layer that decides what counts as a token in the first place gets benchmarked by vendors and trusted by everyone downstream. The most useful thing the field could do is treat parser fidelity as a first-class metric, reported to the user, separable from generation accuracy. Until we have agreed-upon evaluation protocols for that, claims about “system accuracy” are aggregating across layers whose error behaviors are very different.

MAX

Mona’s point about separable metrics is the engineering version of what Alan is calling custodial duty. The practical move is to make the parsing layer auditable: persist the intermediate extraction, hash it, surface diff views to the human reviewer, and treat any silent layout failure as an incident. None of this is exotic — archivists have been doing analog versions for centuries. The reason teams skip it is not technical difficulty; it is that nobody has yet been forced to. The first regulator to require parsing-layer evidence will change the default overnight.

DAN

Mona and Max describe the technical shape of the gap. The market shape is simpler. Right now, vendors compete on retrieval quality and model choice — the visible layers. The layer Alan is naming is invisible to the buyer, which means it is also unpriced. That always ends the same way: a public failure makes the invisible layer visible, and a new procurement category appears overnight. The teams that treat parsing fidelity as a feature today are the ones who will own the trust premium when the first enforcement action lands. So the question is, are you building the audit story before the regulator asks, or after?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors