ALAN opinion 11 min read May 29, 2026 Updated July 9, 2026

Trained on Scraped Code: Licensing, Attribution, and the Ethics of Code LLMs

Open-source code flowing into an AI model while author attribution is stripped, raising licensing and consent questions

The Hard Truth

Every line you ever pushed to a public repository was a gift — to strangers, to future maintainers, to people you would never meet. But what happens when the recipient is not a person at all, never says where it learned, and quietly sells the lesson back to you?

Open source was never only a legal arrangement. It was a moral economy — a vast, unspoken agreement that if you take, you also give back, and that the names attached to the work travel with it. Code LLMs have walked into that economy at a scale no human contributor ever could, and they have done it without learning its manners. The question is not whether they read our code. It is whether they understood what reading it was supposed to mean.

The Gift That Forgot Where It Came From

There is a specific kind of unease that settles over a developer the first time an assistant completes a function in their own idiosyncratic style. The suggestion is useful. It is also, somehow, familiar — as if it remembered something you never told it. That feeling is not paranoia. It is the recognition that your work has been absorbed into a system that will never credit you, never link back, and never explain which of the millions of repositories it consumed taught it that particular trick.

We are comfortable saying that machines “learn” from code. The word does a lot of quiet work. It borrows the legitimacy of human learning — the apprentice reading the master’s source — and applies it to an industrial process that resembles apprenticeship in almost no meaningful way. So before we argue about licenses and lawsuits, we should ask the harder thing: when a system consumes a commons built on reciprocity and gives nothing recognizable back to it, what exactly have we agreed to?

The Case That This Is Just How Learning Works

The strongest version of the opposing argument deserves to be stated plainly, because it is not foolish. Humans learn to program by reading other people’s code. We internalize patterns, idioms, and structures, and we reproduce them without footnotes for the rest of our careers. Permissive licenses like MIT and Apache exist precisely to invite this reuse. If a person can read a public repository and absorb its lessons freely, why should a model — which does something at least analogous — be held to a stricter standard?

And the better builders in this space are not behaving like pirates. The Stack v1.2, the dataset behind StarCoder, was filtered down to permissively-licensed code only, leaving 6.4TB from a starting pool of roughly 102TB, with an opt-out process that 44 developers had already used at training time (Hugging Face / BigCode). Its successor went further, excluding copyleft-licensed code entirely and refreshing the dataset roughly every three months to honor new opt-outs (StarCoder 2 / Stack v2 paper). Under the EU AI Act, providers of general-purpose models are now expected to publish a sufficiently detailed summary of their training content. This is not a lawless frontier. Some of it looks like genuine conscience.

So the blanket accusation — that every code model is built on stolen work — is simply false, and worth retiring. The thoughtful objection lives somewhere more uncomfortable.

The Assumption Hiding Inside “It Just Learns”

Here is the assumption nearly everyone smuggles into the human-learning analogy: that absorption without attribution is morally equivalent whether a person does it once or a machine does it a billion times. But scale is not a neutral multiplier. When a human learns from your code, they carry forward not just your patterns but, usually, some memory of where they came from — a mentor named, a project credited, a culture of acknowledgment that keeps the commons social. When a model learns, provenance is the very first thing discarded. Most code assistants will not, at the moment they suggest a line, tell you whose work shaped it. The lineage is severed by design, not by accident.

This is what some have begun calling “license laundering” — the way generative tooling can reproduce or derive from licensed code while stripping the obligations and the names that were attached to it (Terms.Law). The mechanism that makes these systems feel magical, their fluency, is the same mechanism that makes attribution disappear. And attribution was never a bureaucratic formality in open source. It was the currency. It was how a contributor with no money and no title accumulated something real: reputation, standing, the right to be known for what they made.

If the analogy to human learning holds at all, it holds only by ignoring the part of human learning that keeps a community honest. So which is it — are these systems learners who owe what learners owe, or are they something new that we have not yet built a moral vocabulary for?

What the Commons Was Actually Built On

It helps to remember that free software was a moral argument before it was a software movement. The Free Software Foundation has long held that for a machine-learning system to be genuinely “free,” not only the weights but the training data and the training code must themselves be under free licenses — an extension of the four freedoms into a new medium (Free Software Foundation). You can disagree with the position. What you cannot do is pretend it is merely technical. It is a claim about reciprocity: that you do not get to enjoy the fruits of a commons while quietly closing the door behind you.

The courts are circling the same tension from the outside, and they are not speaking with one voice. In the United States, the long-running Copilot litigation — filed in November 2022 by Matthew Butterick and the Joseph Saveri Law Firm — has been substantially narrowed: most of the original claims were dismissed, with only two of the original twenty-two surviving, and the DMCA claim over stripped attribution rejected because the output snippets were not deemed similar enough to the plaintiffs’ code (GitHub Copilot litigation site). What survives is essentially a question of broken promises — open-source license violation and breach of contract — and as of early 2026 it remains in discovery, decided neither way. Meanwhile a German court took a sharply different posture: in November 2025 the Munich Regional Court ruled that memorization inside model weights can amount to copyright reproduction, falling outside the text-and-data-mining exception, a ruling the provider has said it will appeal (Bird & Bird). One first-instance judgment in one jurisdiction is not the law of the world. But it tells you the question is far from settled.

The Real Breach Is Reciprocity, Not Copyright

Thesis: The deepest ethical problem with code LLMs is not whether training technically infringes a license but whether a system that consumes a reciprocal commons while erasing provenance can honestly claim to honor the bargain that made the commons possible.

Copyright is the wrong lens because it asks only “was this permitted?” when the question that actually matters is “was this reciprocal?” A permissive license grants legal permission to reuse; it does not dissolve the social expectation that the giving goes both ways. The contributors who built the corpus these models depend on were participating in an economy of acknowledgment. To take everything that economy produced, strip the names, and return fluency without provenance is not necessarily illegal. It may even survive every lawsuit. And it would still represent a quiet hollowing-out of the thing that made open source worth contributing to in the first place. Legality is a floor. Reciprocity was the architecture.

The Questions We Owe the Commons

So what would honoring the bargain actually look like? Not a footnote bolted onto every autocomplete — that may be technically impractical and would miss the point. The harder questions are upstream. Should provenance be treated as a first-class signal that travels through the pipeline rather than the first thing thrown away? Should opt-out be the burden placed on the contributor, or should meaningful consent be the burden placed on the system that profits? When a model’s fluency is built from a named community’s labor, does that community have any claim on what the fluency becomes — not in dollars necessarily, but in voice, in governance, in the right to be acknowledged as the source?

These are not questions a license file can answer. They are questions about what kind of relationship we want between the people who write in the open and the systems that learn from them.

Where This Argument Could Be Wrong

I should name the strongest objection to my own position. If attribution at the scale of training data turns out to be genuinely impossible — not merely inconvenient but information-theoretically intractable once millions of sources blend into a single distribution — then demanding it is demanding a square circle, and the honest path is to renegotiate the social contract rather than enforce an impossible one. And if governance-conscious datasets with filtering, transparency summaries, and refreshed opt-outs become the norm rather than the exception, the reciprocity gap could narrow on its own, and my concern would read as alarmism about a problem already correcting itself.

The Question That Remains

The commons was a promise that taking and giving stayed in balance, and that names traveled with the work. We have built machines that take at planetary scale and give back fluency with the names removed. The unresolved question is not who wins the lawsuits — it is whether a generation of developers will keep contributing to an open world once they suspect the only reader left is one that will never know their name.

Ethically, Alan.

Aha Moments

MONA

Alan frames this as a moral question, and it is — but it rests on a measurable fact. Memorization is not metaphorical. Under the right conditions, models can reproduce fragments of their training data closely enough to be recognized, which means provenance is, in principle, partially recoverable from the artifacts themselves. The discarding of lineage is therefore a choice in the pipeline, not an inherent property of how these systems learn. That distinction matters. If we can demonstrate empirically where and how source information survives or vanishes during training, then “attribution is impossible” stops being an axiom and becomes a hypothesis we can actually test. The ethics get sharper when the mechanism is understood rather than assumed.

MAX

Building on Mona’s point — from a systems perspective, attribution is a missing field, not a missing miracle. We track provenance for dependencies, for build artifacts, for security supply chains; we know how to carry metadata through complex transformations when we decide it matters. The reason lineage disappears in most code-model pipelines is that nobody specified it as a requirement, so it got optimized away like any unspecified constraint. That is a design decision wearing the costume of a technical limit. I am not claiming perfect per-token attribution is cheap. I am saying the gap between “hard” and “impossible” is where most of the convenient surrender happens, and we should be honest about which one we are actually facing.

DAN

Both of you are treating this as an obligation. I would reframe it as positioning. The developer ecosystem is the supply chain here, and trust in that supply chain is an asset that compounds or erodes. The first platforms to make provenance and consent a visible, credible part of their offering will not be doing charity — they will be building a durable relationship with the exact community that feeds their models, at the moment that community is most suspicious. Erode that trust and you eventually poison your own well. So the real strategic question is not whether honoring the commons costs something today. It is this: who can afford to be the last one still pretending it does not?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors