ALAN opinion 10 min read

When the Graph Decides What's True: Bias in Knowledge Graph RAG

Knowledge graph nodes and edges arranged like a courtroom diagram, suggesting a system that quietly decides which facts count.

The Hard Truth

Vector RAG was accused of being a black box. The industry’s answer was to add a graph — nodes, edges, schemas, citations you can trace. But what if the graph isn’t more accountable than the vectors? What if it just dresses old judgments in formal clothes and makes them harder to argue with?

A Knowledge Graph feels like progress. You can point at a node, follow an edge, read the relationship in plain language — “Marie Curie, Field, Physics” — and the answer arrives with what looks like a paper trail. Compared to a vector store coughing up a fragment of similarity score, this feels grown-up. It feels governable. That feeling is doing more work than the architecture deserves.

The Question Hidden Inside the Schema

There is a question that almost never appears in the architecture diagrams: who decides which entities and relationships make it into the graph in the first place? Schema design is presented as a neutral technical task — pick your node types, define your edges, run Entity Extraction over your corpus, validate. But every choice in that pipeline is a choice about what the system will be able to think about. An entity type that is not defined cannot be retrieved. A relationship that is not modelled cannot be reasoned over. A community the Community Detection algorithm did not surface effectively does not exist for the model downstream.

This is not a bug to be patched. It is the nature of formalisation. The trouble is that Knowledge Graphs For RAG pipelines now scale this formalisation across enterprise sensemaking, public-sector summarisation, and consumer assistants. The schema becomes infrastructure. And infrastructure, once it ossifies, decides what is true by deciding what is even visible.

The Case for the Graph as Honest Architecture

The steelman is real, and I want to be fair to it. Vendors and researchers who advocate Microsoft GraphRAG and similar systems are not naive about hallucination. They argue, with reason, that a retrieval path through a graph is auditable in a way that a Multi-Hop Reasoning chain over raw embeddings is not. The retrieval trace becomes an explicit sequence of nodes and edges that can be logged, replayed, and aligned with W3C provenance standards (Neo4j Blog). On its own BenchmarkQED sensemaking suite, Microsoft Research has reported accuracy on the order of 86% versus roughly 32% for baseline RAG — a gap that is hard to ignore.

If you have ever tried to defend a vector-RAG output to a regulator or an internal auditor on a Neo4j substrate or its peers, the appeal is obvious. A graph gives you something to point at. You can show the path. You can store it. You can argue about it.

That is genuine progress, and it would be intellectually dishonest to pretend otherwise. The question is what kind of progress it is.

What the Audit Trail Actually Audits

Here is the assumption hiding inside the steelman: that an auditable path is the same as an accountable answer. It is not. An audit trail tells you which nodes the system traversed. It does not tell you why those nodes exist, who decided their boundaries, whose relationships were considered worth modelling, or which entities were quietly dropped during extraction because they fell below a confidence threshold or did not fit the schema.

The provenance argument is, for now, mostly a property of the architecture rather than a measured outcome of deployments. The 2025 ACM Web Science study on Wikidata representations found that the canonical public knowledge graph systematically separates Global North from Global South entities — not because anyone designed it to, but because the contributors, sources, and modelling priorities shaped what got encoded. An ACL Anthology survey documents that Wikidata contains more than four times as many male as female entities. Knowledge-graph embeddings inherit and propagate these patterns: research surfaced by Amazon Science finds that KGE models encode that men are more likely to be predicted as “bankers” and women as “homekeepers”, and the same pattern repeats across religion, ethnicity, and nationality.

A Cypher Query Language log is not going to surface any of that. The audit trail audits the traversal, not the worldview the graph was built to express.

And then we layer LLMs on top. Recent surveys of bias in LLM-augmented KG systems are blunt: extraction pipelines compound social bias, hallucination, prompt sensitivity, and coverage gaps on top of whatever structural bias the underlying graph already had. Each layer is locally defensible. The composition is not.

A Different History Tells a Different Story

There is a useful historical parallel that the field rarely invokes. Bureaucracies have been building knowledge graphs for centuries — they just called them registers, taxonomies, classifications, codes. The colonial census, the medical nosology, the labour ministry’s job catalogue, the credit bureau’s category tree. Each was sold as a neutral instrument for organising what already existed. Each, in practice, decided what counted as a person, a profession, a disease, a risk.

Historians and sociologists of classification — Bowker and Star most famously — have shown that these systems do not merely describe; they constitute. Once a classification is wired into administrative infrastructure, it becomes the ground truth that downstream institutions defer to. The classification’s contingencies harden into facts. Disputing a classification means disputing the institution, which means most people stop disputing.

GraphRAG is not separate from this lineage. It is the latest chapter in it, written at machine speed and machine scale, with fewer of the slow human checkpoints that occasionally forced earlier systems to reform.

The Thesis the Vendors Won’t Frame

Thesis: GraphRAG is not necessarily more accountable than vector RAG; under realistic deployment conditions it is often just more confidently wrong, because the formal apparatus of nodes and edges launders contested choices into infrastructure.

This is not a rejection of the technology. It is a rejection of the framing. The provenance benefits are real and worth pursuing, but they are potential — they materialise only when the deployment actually logs provenance, exposes the schema to scrutiny, treats extraction confidence as a first-class governance signal, and lets downstream users contest the modelling. Independent analysis at GraphRAG-Bench (ICLR'26) found that on many real-world tasks the graph variant underperforms vanilla RAG; the benefit is task-specific, not a free lunch. Even LightRAG’s efficiency gains, reported by its own authors, do not change the underlying epistemic question — they just make the contested infrastructure cheaper to run.

When a system can cite a path through a graph that itself encodes a four-to-one gender skew, the citation makes the bias more authoritative, not less. Confidence has been added. Accountability has not.

Substrate caveats worth holding alongside this argument:

  • Neo4j CVEs (2026): Recent advisories — CVE-2026-1337 (XSS via unicode in query logs), CVE-2026-1622 (info disclosure), CVE-2026-1497 (auth bypass in composite-db namespace resolution) — are reminders that the trust assumptions go deeper than the schema. The substrate has its own security posture.
  • Microsoft GraphRAG pipeline shift: The original exhaustive community-summarisation pipeline is being supplanted by LazyGraphRAG for cost reasons. Older guides describe a system that is partially outdated.

The Questions We Owe the People Downstream

I do not think the answer is to abandon graphs. Graphs are useful precisely because they force explicitness. The answer is to take the explicitness seriously — to treat schema choices, extraction thresholds, and community partitions as governance artefacts, not as engineering details.

What would it mean for a deployment to do that honestly? It would mean publishing the schema and the extraction prompts. It would mean logging not only the retrieval path but the entities that were considered and dropped. It would mean periodic external review of who is over- and under-represented in the graph relative to the population the system is meant to serve. It would mean treating coverage gaps as findings worth disclosing, not embarrassments to hide.

None of this is technically hard. Most of it is institutionally hard, because it transfers power from the team that builds the graph to the people the graph is built about.

Where This Argument Is Most Vulnerable

I want to name where I could be wrong. If a generation of GraphRAG deployments emerges in which provenance is genuinely logged, schemas are open, and extraction biases are tracked and remediated — and if independent benchmarks show that this transparency materially reduces harmful outputs in production — then my critique softens considerably. The architecture would have earned its accountability claim rather than borrowing it from marketing copy. I have not seen that yet, but I would welcome being shown it.

I would also reconsider if the field developed convincing evidence that LLM-augmented extraction can surface and disclose its own coverage failures — not just compound them silently. That would be a meaningful shift in what these systems are.

The Question That Remains

The graph is not the problem. The pretence that the graph is neutral is the problem. If we cannot answer who decided which entities count, we have not built a more accountable system — we have only built one whose unaccountability is harder to see.

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors