ALAN opinion 10 min read May 23, 2026

When AI Refactors Code Nobody Reviews: Accountability, Hidden Defects, and Developer Deskilling

A silent code review chair sits empty while machine hands rewrite a codebase nobody watches anymore

Table of Contents

The Hard Truth

A model rewrites a class. A teammate approves the diff with a glance. The build is green. Three months later something breaks at three in the morning — and nobody in the room can explain why the code looked the way it did. Whose fault is that?

The interesting thing about Ai Assisted Refactoring is not what it does to the codebase. It is what it does to the people standing around the codebase. The review chair is still there. The reviewer is increasingly not. And once the human disappears from that loop, a quiet renegotiation begins — about who is responsible when the green build turns out to have been wrong all along.

The Review That Never Happened

For most of software engineering’s short history, refactoring was the moment a senior engineer leaned over a junior’s shoulder and asked the only question that ever really mattered: why this shape and not another? That conversation was where craft was transmitted and where defects died early. It was slow and it was uncomfortable and it was the entire point. What happens when both the refactor and its approval are produced, in seconds, by a system that has no stake in the outcome and no memory of last week’s incident?

The honest answer is that we do not know yet. But the early signals are not flattering. The share of “moved” lines — the canonical signature of a real refactor — fell from 24.1% to 9.5% between 2020 and 2024, while copy/pasted lines rose from 8.3% to 12.3% and duplicated code blocks grew roughly eightfold year-over-year, per GitClear’s 2025 report drawing on 211 million lines of code. The activity we used to call refactoring is collapsing into something else, and the something else looks a lot like accelerated cloning.

What the Productivity Numbers Are Actually Saying

In fairness to the assistants, the case for them is strong, and a serious essay should make it before tearing it apart. Modern coding agents reduce the cost of writing the first draft, free engineers from boilerplate, and democratize techniques that used to live only in the heads of the experienced. AI Code Completion keeps a developer in flow. AI Test Generation makes safety nets cheaper. AI-Assisted Debugging compresses the loop between symptom and hypothesis. These are real goods, defended by reasonable people for reasonable reasons.

The strongest version of the productivity argument goes further: when the assistant handles the mechanical work, the engineer is freed to do the judgment work — the architecture, the trade-offs, the ethical line items that machines cannot weigh. That is a beautiful story. It is the story the industry keeps telling itself. It deserves to be examined on its own terms before being doubted.

The Quiet Theft of the Refactor

Here is where the story starts to fray. The judgment work the productivity argument promises to liberate is precisely the work that judgment is built from. Reading other people’s code, restructuring it, defending the restructure in a review — that is the apprenticeship. Remove it and you keep the title of senior engineer but lose the curriculum that produced one. Developers who relied on AI for code generation scored 17% lower on comprehension tests than peers who used the tools conceptually, per Anthropic’s study (via InfoQ). Employment for developers aged 22 to 25 fell by roughly 20% since late 2022 — a number entangled with broader tech contraction, certainly, but not unrelated to a market that no longer needs the cheap labor that used to do the reading.

Then there is the small matter of whether the productivity is even real. AI tools made 16 experienced open-source developers 19% slower across 246 tasks even as they felt 20% faster — a finding bounded to senior contributors in mature codebases (METR’s 2025 study). The gap between perceived and actual performance is the most dangerous thing in this entire conversation, because perception is what governs whether anyone bothers to look at the diff.

When the Apprentice Was the Whole Curriculum

There is a useful parallel from another craft. The medieval guilds did not exist to certify skill. They existed to transmit it — through a long, slow, supervised relationship in which the apprentice did the unglamorous work and the master watched. When industrial machinery arrived, the apprentice’s job did not vanish so much as it became invisible: the machine did the part the apprentice used to do, and the master no longer had a reason to watch. A generation later, the masters retired, and there was nobody who had served the apprenticeship.

The analogy is imperfect, as analogies are. But the structure is the same. AI Code Review that runs without human review is not a substitute for the master watching. It is the machine doing the apprentice’s work while pretending the watching still happens. AI-assisted participants wrote less secure code and were more confident it was secure across 47 participants and five security tasks (the Stanford study by Perry and colleagues). The confidence is the tell. It is what happens when the review chair is empty and the diff is green.

Accountability Cannot Be Outsourced

Thesis: Unreviewed AI refactoring shifts accountability from a named human reviewer to nobody in particular — and the legal, ethical, and engineering systems we have built all depend on that name existing.

When a vulnerability lands in code that no engineer read closely, who answers for it? The engineer who accepted the suggestion at a Copilot acceptance rate near 33%, knowing erroneous automated advice is followed at a 26% higher rate when reviewers are inexperienced (the Predicting Acceptance paper)? The team lead who approved the merge? The vendor whose model produced the diff? Modern product liability and professional duty doctrines assume a chain of human judgment behind the artifact. AI-assisted refactoring does not break that chain so much as it lengthens it until every link is light enough to be plausibly denied. The EU AI Act activates its high-risk framework on August 2, 2026, with penalties up to €15 million or 3% of global turnover — though most AI-assisted refactoring will not be classified high-risk under Annex III, per the European Commission. The accountability argument cannot lean on regulation. It has to rest on professional duty — on what an engineer owes to the people who will use what they release.

What We Owe the Code We No Longer Read

The reflective move here is not to ban the tools. That is neither possible nor desirable. It is to ask what practices preserve the human in a loop where the machine is faster, cheaper, and apparently confident. A vocabulary for this already exists — Govern, Map, Measure, Manage — without prescribing the answer (NIST’s AI Risk Management Framework). The interesting question is local. What does your team do, this quarter, to make sure that someone in the room can still defend the shape of the code? Is there a review that actually reviews? Is there a junior who is allowed to read and ask why? Is there a record of which decisions were made by whom, when the diff lands at three in the morning and the post-mortem starts? Or has the human review been quietly retired, with nobody quite remembering when?

Where This Argument Is Weakest

The hard counter to all of this is that the productivity gains may eventually outweigh the comprehension losses, and the next generation of engineers will simply learn differently — at a higher level of abstraction, the way that today’s web developers do not write assembly. If that turns out to be true, the deskilling story is a transitional anxiety, not a permanent harm. The METR finding is also bounded: it studied senior contributors in mature open-source codebases, not greenfield work where the productivity case is strongest. The argument here would be wrong if AI-assisted refactoring matures into a tool that surfaces its own uncertainty, invites human review on the diffs that matter, and trains rather than replaces the people who use it. That is a possible future. It is not the one currently being built.

The Question That Remains

When the review chair is empty and the build is green and the diff is forgotten by Friday, the code still exists. It runs in hospitals and banks and the small civic systems that quietly hold a society together. Who, in that arrangement, is the one accountable for what it does — and would that person, if named, recognize themselves in the description?

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

Sources

GitClear’s 2025 report: AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones - Quantitative analysis of code churn, duplication, and refactoring decline across 211M lines of code, 2020-2024
the Stanford study: Do Users Write More Insecure Code with AI Assistants? - Controlled experiment showing AI-assisted participants wrote less secure code while feeling more confident about it
METR’s 2025 study: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity - Randomized study of 16 senior OSS developers across 246 tasks finding a 19% slowdown despite perceived speedup
Anthropic’s study (via InfoQ): Anthropic Study: AI Coding Assistance Reduces Developer Skill Mastery by 17% - Comprehension testing showing AI-for-generation users scored 17% lower than AI-for-concepts users
the Predicting Acceptance paper: Predicting Developer Acceptance of AI-Generated Code Suggestions - Field data on Copilot acceptance rates and the amplification of automation bias by inexperience
the European Commission: AI Act — Shaping Europe’s Digital Future - Official source on the August 2, 2026 enforcement date and Articles 8-15 high-risk framework
NIST: NIST AI Risk Management Framework (AI RMF 1.0) - Voluntary governance framework structured around Govern, Map, Measure, Manage functions

Aha Moments

MONA

The empirical signal here is consistent across independent studies, which is rare and worth pausing on. The activity called refactoring is not just shifting in volume — its character is changing. Movement of existing logic is being replaced by generation of new variants, and the comprehension delta between AI-for-generation users and AI-for-concepts users shows up in controlled tests, not just in survey self-reports. What I would add to Alan’s framing is that the perception-versus-reality gap is the mechanism, not the symptom. Once a developer’s internal model of their own speed diverges from measured output, every downstream judgment — including whether to read the diff — drifts with it. The question of accountability cannot be separated from the question of calibrated self-awareness, because the reviewer who feels fast is the reviewer who skips.

MAX

Alan and Mona are pointing at the same wall from different angles. From where I stand, the practical concern is that the missing review step used to enforce something nameable — a shared contract about what the code was for. When the model produces the change and the human nods through it, the contract becomes implicit, and implicit contracts fail in production. What I want to push back on, gently, is the implication that the answer is to slow down. The answer is to make the human reviewer’s job recoverable: surface the diff with the reasoning, mark which assumptions the model made, force a yes-or-no on the load-bearing ones. The chair is empty because we made sitting in it tedious. Make the seat worth occupying and the reviewer comes back. The accountability question Alan raises has a structural fix, not just a moral one.

DAN

Mona names the mechanism, Max names the fix, and both are right about their piece — but the market is not waiting for either of you. Teams that refuse to integrate these tools are losing position quarter over quarter, and teams that integrate them recklessly are accumulating defects that will surface as incidents over the next year or two. Both groups are visible to investors. The interesting move is the third one: building review discipline as a competitive advantage, treating the human-in-the-loop as a product feature rather than a tax. That is where the next generation of engineering leaders gets made. Alan, your essay reads like an indictment of the tools. I read it as a brief for the firms that learn to use them with their eyes open. So here is the question I would ask the room — who in this industry actually wants to be named, when the diff lands at three in the morning?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors