Online Experimentation
Also known as: A/B testing, online controlled experiment, OCE
- Online Experimentation
- Online experimentation assigns live users randomly to control and treatment groups to measure the causal effect of a change. It is the standard method for validating prompt variants, model versions, or retrieval configurations in production AI systems, producing statistically reliable effect estimates rather than correlational observations.
Online experimentation is the practice of testing changes on live production traffic by randomly assigning real users to control and treatment groups to measure causal effects.
What It Is
Knowing that something improved after a change is not the same as knowing the change caused it. Think of it like a clinical trial: a drug isn’t approved by giving it to everyone and observing the population afterward. Patients are split randomly into drug and placebo groups, and the difference between them is what proves the drug works. Online experimentation applies the same logic to software: run both versions simultaneously with users randomly assigned to each, so any outcome difference can only come from the change itself.
According to Analytics ToolKit Glossary, an online controlled experiment — the formal name for this practice — divides live traffic into a control group, which sees the current behavior, and a treatment group, which sees the changed version. Because users are randomly assigned, any difference in outcomes between the groups can be attributed to the treatment rather than to pre-existing differences in who each group contains. Analytics ToolKit Glossary also describes its dual purpose: testing whether an effect exists at all (hypothesis testing) and estimating how large that effect is (effect estimation). According to ScienceDirect (JOSS 2024), the terms A/B testing and online controlled experimentation are fully synonymous across industries.
Where this gets complicated is in AI quality measurement. Traditional online experimentation was built for binary outcomes — a user clicks or doesn’t, subscribes or doesn’t, completes a purchase or doesn’t. The outcome is deterministic and observable at the network layer. According to Cameron Wolfe (Deep Learning Focus), LLM output distributions are non-stationary and high-variance: the same prompt can produce different-quality responses across requests, and text quality does not reduce to a single number. That gap between a clean binary signal and a multi-dimensional quality score is why measuring LLM output quality within an online experiment demands larger sample sizes, longer run times, and a dedicated scoring pipeline — the core problem that the article this term supports examines directly.
How It’s Used in Practice
The most common version in AI product work is prompt variant testing. You have a prompt that generates customer responses, content summaries, or code suggestions. You want to know whether a revised prompt produces better output. Traffic is split at the application layer: some requests go to the existing prompt, others to the candidate. Quality scores accumulate on both branches, and a statistical test runs when enough data is collected. Platforms like LangSmith, Arize Phoenix, and Braintrust provide this kind of experiment management alongside the evaluation infrastructure needed to score LLM outputs at inference time.
A second scenario is model version comparison. When an LLM provider releases a new version, switching all traffic at once is risky — and informally monitoring the results afterward gives you correlation, not causation. An experiment routes a portion of traffic to the new model, collects quality scores and latency metrics on both, then produces a statistically grounded decision before full rollout.
Pro Tip: Calculate your required sample size before you start, not after. LLM output quality scores carry far more variance than click-through rates, which means experiments need to run much longer than most teams expect. An experiment that feels complete after a few days may still lack the observations needed to distinguish a real quality difference from background noise.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Comparing prompt variants on a high-traffic production endpoint | ✅ | |
| Validating a new model version before full rollout | ✅ | |
| Checking whether a quality scoring method is itself reliable | ✅ | |
| Low-traffic endpoint where sample sizes cannot reach significance | ❌ | |
| Quick post-deploy check to confirm nothing is broken | ❌ | |
| Measuring outcomes when user assignment cannot be controlled | ❌ |
Common Misconception
Myth: Any online experimentation setup built for click-through rate testing transfers directly to LLM output quality measurement.
Reality: According to Cameron Wolfe (Deep Learning Focus), LLM output distributions are non-stationary and high-variance. Standard impression-to-binary-click infrastructure doesn’t translate — you need a text quality scoring pipeline running at inference time, and you need considerably more traffic to detect the same effect size.
One Sentence to Remember
Online experimentation is the only framework that proves a change caused an outcome rather than coincided with one — but in LLM systems, building a reliable quality signal for the experiment to measure is often the harder problem than running the experiment itself.
FAQ
Q: What is the difference between online and offline experimentation? A: Offline experimentation tests changes against historical or synthetic datasets. Online experimentation tests against real users in production — the only way to confirm that a change actually affects real user behavior, not just benchmark scores.
Q: Why do LLM quality experiments require more traffic than CTR experiments? A: LLM output quality scores have much higher variance than binary click signals. Higher variance means the statistical test needs more observations to reliably distinguish a real effect from background noise.
Q: Is shadow testing the same as online experimentation? A: Shadow testing routes live traffic to both model versions but serves only one response to users. It measures computational behavior, not user outcomes — a useful precursor to online experimentation, not a substitute.
Sources
- Analytics ToolKit Glossary: What is an Online Controlled Experiment (OCE)? - Definition and dual-purpose framework for online controlled experiments
- ScienceDirect (JOSS 2024): A/B testing: A systematic literature review - Confirms A/B testing and online controlled experimentation are synonymous across industries
Expert Takes
Random assignment in online controlled experiments neutralizes confounding variables — that’s the mechanism that makes causal inference possible. Without it, you’re measuring correlations between users who happened to see the new model and users who didn’t. The problem with LLM outputs is the outcome variable itself. Text quality is multi-dimensional and high-variance compared to a binary click signal. More variance in the outcome means more users needed to detect the same effect size.
The hard part of online experimentation for LLM systems isn’t traffic splitting — most production routing infrastructure handles that. It’s building a quality signal you can actually compare. Click-through rates are observable at the network layer. Text quality is not. You need a scoring pipeline that runs at inference time — whether that’s LLM-as-judge, human annotation, or task-specific metrics — and that pipeline’s own reliability becomes a second variable in your experiment.
Most teams deploying LLM features never run a real experiment. They release, monitor for disasters, and call it validated. That’s not validation — it’s hope. The teams building durable AI products have internal experimentation capability that tells them what actually moved the needle and why. That advantage compounds over time. The teams without it are still debating which model version performs better based on anecdote.
Online experimentation tells you the treatment group got “better” responses — but who defined better, and at what cost? The quality metric chosen to run the experiment encodes values. An LLM that wins on task completion might lose on honesty. A model that scores higher on engagement might be doing so by being more agreeable rather than more accurate. The experiment gives statistical confidence in the outcome you measured, not necessarily the one that matters.