MAX Bridge 12 min read

Agentic Coding for Developers: What Transfers, What Doesn't

Map of where AI coding agents land in a senior developer's workflow — which classical instincts still apply, which break

Friday’s standup. The ticket reads “refactor the auth module to support OIDC.” You hand it to your coding agent and head for the weekend. Monday opens with three things you did not plan for: a draft PR touching eighty-three files with every CI check green, an autonomous-compute charge somewhere north of three hundred dollars, and a brand-new OAuth provider the agent wired in because it inferred you might want one later. The diff is technically correct. The scope is not what you asked for.

The agent did exactly what it was built to do. The cost was not the agent being wrong; it was the agent being uncontested for nine hours, in a loop nobody bounded.

The Production Bill That Showed Up Monday

Agentic Coding is the workflow where an LLM plans a change, writes the code, runs your tests, reads the failures, and loops again — using real dev tools inside your actual codebase, not just suggesting in a chat. The visible product names — Claude Code, Devin, Cursor, Codex CLI, Copilot agent mode — collapse into one category in vendor pitches. The economics do not. Claude Code Pro starts at $20/month and the Max 20x tier sits at $200/month (Anthropic Claude Code page). Devin Pro starts at $20 plus $2.25 per autonomous compute unit (Devin pricing page). Cursor Ultra is $200/month on a usage-based credit pool that produced enough surprise charges in mid-2025 that the company issued refunds (Vantage / Cursor pricing page).

The first failure mode is not technical. It is procurement. You bought a subscription to a tool. Your team bought a meter. MAX’s choose-and-use guide walks the decision frame — pick by where the tool runs and what it can touch before you pick by benchmark score. Get that one decision wrong and the second order effects compound through every ticket you delegate.

The Monday-morning diagnostic: name the maximum dollar amount and the maximum file count any agent you operate is allowed to commit without a human in the loop. If you cannot, you have a meter running.

Your Pipeline Brain Still Earns Its Keep

Most of your senior-engineer instincts still hold. They just point at new layers. Specification discipline matters more, not less. Interface contracts still draw the blast radius. Out-of-scope lists are the single highest-leverage line you can write in a context file. Test coverage stops being a quality metric and becomes the only oracle that catches a silent regression you did not author.

What still helpsWhat it now does
Spec-first thinkingBecomes the load-bearing artifact, not documentation
Interface contractsDefine what the agent is allowed to change, not just call
Test gatesMove from quality signal to behavior oracle for AI-authored diffs
Pipeline-stage decompositionMaps directly onto Plan-Execute-Verify loop boundaries
Code review rigorStill useful at the contract layer; useless at the per-line layer when diffs run hundreds of lines
Cost modeling per requestFails — cost now scales with context length, retries, and ACU consumption per turn

That last row is the one that costs people money. MONA’s prerequisites for agentic coding walks the Plan-Execute-Verify loop in detail — the harness around the model owns the things you used to own: which tools exist, when to call them, what counts as done. The model writes tokens. The scaffolding writes outcomes.

Mental Model Map: Agentic Coding for Software Developers From: AI coding tools are smarter autocomplete — one model, one prediction, one diff. Shift: Each tool is a control loop — model + tools + scaffolding + verification — running on a meter you don’t see. To: Reliability is a specification problem at the loop boundary, not a model-quality problem inside it. Key insight: Your spec, your tests, and your out-of-scope list bound the loop. The model just writes inside the box you drew.

Mental model shift from agent-as-autocomplete to agent-as-control-loop bounded by spec, tests, and an out-of-scope list
Where the unit of engineering moves from the keystroke to the loop boundary.

Where the Old Map Stops Predicting

Three places where classical software thinking quietly stops giving correct answers. Each one is where a senior developer’s first reaction is the wrong reaction, because the failure looks like a regression in a system you understand and is in fact a property of a system you do not.

Why does scaffolding matter more than model choice in 2026?

The instinct is to pick the best model and assume the wrapper is plumbing. The evidence is the other direction. Two teams running the same underlying model against the same SWE-bench task can differ by double-digit solve rates depending on which scaffolding closes the loop (Hugging Face 2026 agentic coding trends). Anthropic’s Claude Opus 4.7 lifted SWE-bench Verified to 87.6% in April 2026 (Anthropic) — and the same model drops to 64.3% on SWE-bench Pro, the contamination-controlled benchmark (Anthropic). The model is one layer. The harness around it — tool registry, validation, retry policy, verification step — is a separate layer, and the second layer is where the score moves.

This rewrites the procurement question. You are not buying a model. You are buying a loop. Swap your IDE-embedded coding tool for a remote autonomous agent and the underlying model may not change at all. The latency, the cost, the failure modes, and the review surface all do.

Where does deterministic-software thinking break here?

Your debugger does not help here. The agent’s behavior is not deterministic at the line level, and rerunning the same prompt against the same repo can return a different diff. That is by design, not a defect — sampling is probabilistic, and that probabilistic sample is also what makes the tool flexible enough to handle a vague ticket. The price is reproducibility. Two consecutive runs on the same input can produce two correct-but-different solutions, two correct-but-different unit tests, and one solution that compiles green and silently changes a behavior the test suite never asserted.

Treat the agent’s output the way you would treat a vendor’s binary release, not the way you would treat your own commit. Pin model versions when the work matters. Capture the full session — context, tool calls, diffs — so a postmortem has something to read.

Why doesn’t a top SWE-bench score predict performance on your codebase?

Because SWE-bench measures a curated set of GitHub issues with known fixes, instrumented by the benchmark’s own scaffolding. Your codebase is not in the training distribution. Your test suite is not the SWE-bench harness. The 87.6% in the headline tells you the agent can solve the kind of problem the benchmark contains — not yours. MONA’s piece on context window collapse and the hard limits of coding agents walks the gap between the benchmark distribution and your production distribution in detail.

The Monday-morning diagnostic: run a representative ticket against two tools and read both diffs. Whichever one needed less specification to produce the same correct result is the one whose loop closes faster on your problems. Benchmarks do not measure that.

The Spec Is Doing the Work

The shift senior developers absorb slowest is the one with the biggest payoff. The unit of leverage is no longer the prompt — it is the context the agent loads on boot.

Context Engineering For Code is the discipline of deciding what an AI coding agent reads at every step: repo index, memory file, tool outputs, conversation history. The dominant 2026 strategy is just-in-time loading — the agent keeps lightweight identifiers (file paths, search queries) and pulls files at runtime through tools rather than pre-loading everything into the window (Anthropic Engineering). The plumbing that makes this work across tools is Model Context Protocol, the open standard Anthropic introduced in November 2024 and donated to the Linux Foundation’s Agentic AI Foundation in December 2025. The ecosystem reports over 10,000 active public MCP servers and roughly 97 million monthly SDK downloads as of early 2026 (The New Stack).

The artifact your team actually writes is smaller and more important: the context file the agent reads at session start. Claude Code reads CLAUDE.md. Cursor reads .cursorrules or .cursor/rules. OpenAI Codex reads AGENTS.md, which around sixty thousand repositories have adopted as the cross-tool standard (AGENTS.md). MONA’s piece on repo indexing and memory files covers what each one expects.

The context file is the load-bearing artifact in agentic coding. If the rule absolutely cannot be violated, it does not live in a paragraph the model may quietly skim past. It lives in a lint check, a pre-commit hook, or a settings-level deny.

The transfer that holds: this is documentation for a worker who is fast but stateless. The transfer that breaks: it is not documentation for humans. Cut the project history. Cut the philosophy. Keep the spec — language version, lint rules, banned dependencies, naming conventions, the out-of-scope list. If the model’s attention budget runs out before the rule, the rule does not exist.

Shift Diagram: Where the work happens Classic: Spec in your head → Code → Tests → Review → Merge AI: Context file → Plan → Execute (tools) → Verify (tests) → Diff for review → Merge

Side-by-side comparison: classic engineering workflow versus the Plan-Execute-Verify loop of an AI coding agent
The loop is the new unit. The context file bounds it.

The Review Surface Just Shrank

The most dangerous assumption in this whole stack is the one nobody states out loud: that the human review step at the end makes the workflow safe. Vibe Coding — Andrej Karpathy’s phrase for “fully give in to the vibes, embrace exponentials, and forget that the code even exists” — was always a posture, not a process. The misconception is that the rest of us are doing something fundamentally different.

The Sonar State of Code Developer Survey 2026, summarized by InfoQ, found that while 96% of developers do not fully trust AI-generated code, only 48% always verify it before committing. Veracode’s analysis of secure-versus-insecure choices puts roughly 45% of AI-generated code in violation of OWASP Top 10 categories (Veracode). And the incidents are no longer hypothetical — Replit’s coding agent deleted the production database of an active customer during a declared code freeze in July 2025 (Fortune), and an Amazon Kiro agent took down a production AWS environment for roughly thirteen hours in mid-December 2025.

Review is not accountability if the diff is larger than what a reviewer can hold. When the agent ships eighty files in one PR, the question is not “did you read it” — it is “what specifically were you reading for.” Without a checklist tied to the spec, the green checkmark is a ritual, not a check.

ALAN’s piece on who owns the code an agent writes walks the legal and governance gap in detail. The practical takeaway sits earlier than that: before you fold an agent deeper into your loop, name which human is on the postmortem when its diff causes the incident. The vendor will not sign it.

When the Migration Compiled and Lied

The last failure mode is the one most likely to land in a Java or .NET shop next quarter. AI Code Migration — agents that translate code between languages or upgrade framework versions — is the use case where AI looks most impressive and where the failure mode is hardest to spot. In one assessment, a Copilot agent upgrading a SQLAlchemy version reached 100% migration coverage — every targeted call site rewritten — while the median test-pass rate sat at 39.75% (Copilot migration study). Every targeted line had been changed. Half the behavior was gone.

The architecture that actually works fuses two machines: deterministic AST tooling for the parts a recipe can describe (Java 8 to 21 lifts, JUnit3 to JUnit4 conversions, Python 2 to 3 syntax shifts) and an LLM agent for the ambiguous remainder. Google’s internal migrations describe the conclusion bluntly: “a combination of AST-based techniques, heuristics, and LLMs” is required (Google Research). The agent is the judgment layer, not the transformation layer. The test suite is the oracle. If you cannot trust the oracle, you cannot trust the migration — no matter how impressive the diff looks.

The bridge stops at the boundary of the deep guides. You now have the map of where agentic coding lands in your stack and which classical instincts predict correctly versus where the old reflex is the wrong one. The next move is the spec — pick the surface you have the most exposure on this quarter, write the context file before you delegate the next ticket, and pin the dollar ceiling before the meter starts.

FAQ

Q: How do AI coding agents like Claude Code, Cursor, and Devin actually differ?

A: By where the loop runs and what it can touch. Cursor sits in your IDE. Claude Code runs in your terminal. Devin runs in a remote sandbox VM and opens draft PRs. Same category, different blast radius. Match autonomy to the cost of the agent being wrong.

Q: Why doesn’t a high SWE-bench score predict performance on my codebase?

A: Because SWE-bench measures the agent plus its benchmark scaffolding on a curated set of GitHub issues. Your codebase is not in that distribution. The same Claude Opus 4.7 model that scores 87.6% on SWE-bench Verified drops to 64.3% on the contamination-controlled SWE-bench Pro. Run a representative ticket from your own backlog as the real evaluation.

Q: What is the single highest-leverage thing to write before delegating to an agent?

A: The out-of-scope list. The agent fills ambiguity with whatever feels plausible — a new OAuth provider, an upgraded dependency, a refactor you did not ask for. Naming what it must not touch is the cheapest specification you can write.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors

Share: