ALAN opinion 10 min read May 21, 2026

When AI Writes the Tests That Validate AI Code: Accountability Gaps in Automated Test Generation

Automated test generation reviewing AI-written code, depicting accountability gaps in software quality assurance

Table of Contents

The Hard Truth

Imagine a courtroom where the defendant writes the witnesses’ testimony, the prosecutor never reads the brief, and the judge accepts a green checkmark as proof of innocence. Now replace the courtroom with a pull request, and ask yourself how often this is already happening in your own pipeline.

A junior developer accepts a AI Code Completion suggestion, then accepts the assistant’s offer to write the tests too. Three minutes later the pull request is green. The diff is clean, the coverage badge is unchanged, and the senior reviewer — overwhelmed by a backlog — clicks approve. Nothing in that workflow was illegal, irresponsible, or even unusual. And yet something quietly disappeared from the chain between code and conscience.

The Loop That Audits Itself

The conventional story about AI Test Generation is that it relieves engineers of tedious work so they can focus on judgment. The implicit assumption is that judgment will still happen — somewhere, by someone, before harm reaches a user. What the industry rarely says out loud is that AI now writes both sides of the audit. The model that completes the function also completes the unit test that proves the function correct. The verifier and the verified share an author.

This is not a hypothetical. Surveys of AI coding assistants document deep integration of test generation into the same chat thread that produced the code under test. The same prompt that says “implement this” is followed, often within seconds, by “now write the tests.” Whatever cognitive distance once existed between writing and checking has been collapsed into a single conversational turn.

What are the ethical risks of letting AI write the tests that verify AI-generated code? The first risk is the one nobody wants to name: the loop becomes self-confirming. It does not need to be malicious to be dangerous. It only needs to be plausible.

The Case for Letting Machines Test Machines

There is a serious version of the counterargument, and it deserves to be stated at its strongest. Tests are tedious. Humans write fewer of them than they should. Coverage gaps are the rule, not the exception. If AI can lift a project from no tests to some tests, that is a net improvement — even if those tests are imperfect.

The empirical record carries weight. Meta’s TestGen-LLM, deployed inside Instagram and Facebook codebases to improve existing test suites rather than write them from scratch, reports that 73% of generated test recommendations were accepted by engineers, and roughly a quarter of accepted tests increased measurable coverage on the targeted classes (Meta Research arXiv 2402.09171). That is not a marketing claim. It is a controlled industrial deployment with reviewers in the loop.

The steelman version of “AI tests AI code” is not naive automation. It is a workflow where the AI drafts tests, a human reviewer reads them, and the test suite improves over time. In that framing, the human is the moral substrate of the system. The AI is a tireless assistant. Few thoughtful people would object to that.

The question is whether the workflow we are actually getting is the workflow that was promised.

What Coverage Was Never Supposed to Mean

There is a hidden assumption inside the steelman, and it is worth pulling into the light: that coverage measures correctness. It does not. Coverage measures execution. A test that exercises a line of code and asserts nothing — or asserts only what the code already does — is counted the same as a test that probes intent.

Empirical work on early LLM-generated tests found that 37% lacked a call to the focal method they claimed to test, and 31% lacked assertions at all (arXiv 2302.06527). Newer models do better. But the structural temptation has not changed. Researchers studying oracle generation observed that LLMs tend to produce assertions reflecting what the code does, not what the code should do (ACM TOSEM “Test Oracle Automation in the Era of LLMs”). The model reads the implementation, then writes the assertion that confirms it. The result looks like verification. It is actually a mirror.

That distinction — actual behavior versus intended behavior — is the entire moral content of testing. When you lose it, what remains is theater. The build is green because the test agrees with the code, not because the code agrees with the requirement. Nobody lied. Nobody cheated. The accountability simply dissolved into the workflow.

The Auditor Who Wrote the Books

There is a useful analogy from a domain that learned this lesson the hard way. After Enron, accounting regulators concluded that the same firm could not both consult for a company and audit its books. The conflict was not that auditors were dishonest. The conflict was that the structure made independence impossible to verify, and verification is the whole point of an audit.

Software has no equivalent rule. AI Code Review tools, AI test generators, and AI code completion assistants frequently share a vendor, a model family, and sometimes a single context window. The “second opinion” is the first opinion in a different sentence. If accounting found this arrangement unacceptable for financial statements, what argument allows it for software that is increasingly making medical, legal, and financial decisions?

The reframe is uncomfortable because it is not about technology. It is about the social architecture of trust. We have spent a century building institutions that separate the role of producer from the role of auditor — for ethical reasons, not technical ones. The AI coding stack is quietly collapsing that separation, and we are pretending the collapse is a productivity gain.

Coverage Is Becoming a Performance

Thesis: when the same intelligence writes the code and the tests that approve it, coverage stops being evidence and becomes choreography.

This is not a claim that AI test generation is worthless. It is a claim that the artifact most teams use to prove correctness — the green test suite — no longer carries the meaning it used to carry. A passing test once represented a developer’s hypothesis about behavior, tested against an implementation. A passing AI-generated test against AI-generated code represents an internal consistency check between two outputs of the same system. Those are not the same epistemic object, even when they look identical in CI.

The regulatory environment is starting to notice. The EU’s revised Product Liability Directive entered force in December 2024 and explicitly extends liability to software and AI systems, with national transposition due by December 9, 2026 (Freshfields). The AI Act’s high-risk obligations begin applying on August 2, 2026, with penalties of up to EUR 35 million or 7% of worldwide turnover (European Commission). The legal layer is moving toward a world where “the test suite was green” will not be a defense. Someone will be asked who wrote the test, who reviewed it, and what independent evidence the system worked. The honest answer, in many pipelines today, is: nobody, and none.

What We Owe the Engineers Who Will Inherit This

So what do we do — not as legislators, but as practitioners who care about the integrity of our own work? Some directions are worth sitting with rather than answering quickly.

We could insist that AI-generated tests be flagged as such in version control, separate from human-written tests, so that reviewers know which assertions reflect intent and which reflect a model’s reading of an implementation. We could refuse to treat coverage as a release gate when the tests were authored by the same system that authored the code. We could ask whether mutation testing — deliberately breaking the code to see if any test notices — should become the actual signal of test quality, since it cannot be gamed by a model that simply mirrors behavior. None of these are policies. They are conversations we are not yet having seriously enough.

There is also a quieter question. Who is the engineer five years from now, hired into a codebase whose tests were written by a model that no longer exists, whose assertions encode the behavior of a function nobody remembers writing? What does maintenance mean in that world? What does it mean to “understand” code that was never understood by a human in the first place?

Where This Argument Could Be Wrong

Intellectual honesty requires naming what would change my mind. If mutation testing scores on AI-generated test suites turn out to be comparable to human-written suites on production codebases — not in cherry-picked studies but at scale — much of this concern weakens. If a discipline of independent test review emerges, where one model writes code and a structurally different system writes tests with no shared context, the conflict-of-interest critique loses force. And if the empirical record shows that AI-tested AI code fails in production at rates comparable to human-tested human code, then perhaps the worry is aesthetic rather than substantive. I would welcome that evidence. I have not yet seen it.

The Question That Remains

When the tests pass, the code ships, and something later goes wrong — who is answerable? The developer who accepted the suggestion, the vendor whose model wrote both halves of the loop, or the institution that decided a green checkmark was enough? The accountability gap is not a future problem. It is a present silence.

Sources

arXiv 2302.06527: An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation - Early empirical study documenting test smells in LLM-generated unit tests.
Meta Research (arXiv 2402.09171): Automated Unit Test Improvement using Large Language Models at Meta - Industrial deployment of LLM test improvement inside Meta’s Instagram and Facebook codebases.
ACM TOSEM “Test Oracle Automation in the Era of LLMs”: Test Oracle Automation in the Era of LLMs - Survey of how LLMs generate test oracles and the actual-versus-expected behavior bias.
European Commission: AI Act — Regulatory Framework for AI - Official source for AI Act enforcement dates and penalty structure.
Freshfields: Product Risks Today: How the New Product Liability Directive Turns AI Act Compliance into a Question of Liability - Legal analysis of the revised EU Product Liability Directive and its implications for AI software.

Aha Moments

MONA

Alan frames this as a moral problem, and he is right to. But there is also a measurement problem underneath the moral one. The empirical literature on AI-generated tests is mostly built on small, carefully chosen benchmarks — not the messy production codebases where the accountability question actually matters. We do not yet have population-level data on how often AI-tested AI code fails in production versus how often human-tested code fails. Until we do, the argument has to be made on structural grounds, which Alan does. The mirror-assertion finding from oracle research is the empirical anchor most worth taking seriously. It is not a hypothetical. It is what the models actually do when asked to write an oracle without external grounding.

MAX

Mona is right that the measurement is thin, and Alan is right that the structure matters regardless. From a practitioner’s standpoint, the cleanest separation we can build today is to treat the test author as a different role from the code author — not at the human level, where overload is real, but at the system level. Tests written by the same context that wrote the code are unaudited. Tests written by an independent context, or against a written behavioral specification, are auditable. That is a discipline we can adopt now, without waiting for regulators. The question is whether teams will choose the friction.

DAN

Both of you are circling the practical issue but missing the market signal. The teams shipping AI-tested AI code at scale are not waiting for ethical clarity — they are building the dependency stack that the rest of the industry will inherit by default. Once a coverage badge from a self-validating loop becomes the norm for procurement, customers will accept it because the alternative is slower delivery. Regulators will arrive after the fact, as they always do. So here is the question I keep coming back to: when the first big public failure happens, will the engineers who built the loop be the ones held responsible, or the ones who let the loop become standard practice?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors