Scaling Laws

Also known as: neural scaling laws, compute scaling, Chinchilla scaling

Scaling Laws
Empirical power-law relationships showing how a language model’s performance predictably improves as you increase model size, training data, or compute budget, enabling teams to forecast results before committing resources.

Scaling laws are empirical rules showing that a language model’s accuracy improves predictably as model size, training data, or compute budget increase, helping teams choose the right pretrained model before spending resources.

What It Is

If you’re choosing a pretrained transformer model — decoder-only or otherwise — the first question is usually “how big does it need to be?” Scaling laws give you a data-driven answer instead of a guess. They describe the repeatable, mathematical relationship between three inputs (parameters, training tokens, and compute) and one output (model quality, measured as cross-entropy loss). Cross-entropy loss is simply a number that captures how surprised the model is by the next word — lower loss means better predictions.

Think of it like a recipe yield curve. If you know that doubling flour and doubling oven time produces 1.8 times more bread (not double), you can plan your bakery’s output without baking every possible batch first. Scaling laws do the same for model training: they let you predict how good a model will be at a given size without actually running the full experiment.

The concept took shape in two landmark studies. According to Kaplan et al., the 2020 paper “Scaling Laws for Neural Language Models” showed that loss decreases as a smooth power law when you increase any of the three factors independently, and that larger models are more sample-efficient — they reach lower loss with fewer training tokens relative to their size. This finding accelerated the trend toward building ever-larger models.

Then came a significant correction. According to Wikipedia, the 2022 “Chinchilla” paper by Hoffmann et al. at DeepMind demonstrated that many large models were under-trained relative to their parameter count. The study proposed a roughly 20-to-1 ratio of training tokens to parameters for compute-optimal training. A model with one billion parameters, for example, should see about twenty billion tokens during training. This single insight shifted industry practice away from simply making models larger and toward balancing size with the volume of training data.

For anyone selecting a decoder-only transformer today, scaling laws explain why you can look at published benchmarks and make informed trade-offs: a smaller model trained on the right amount of data can match or outperform a larger but under-trained model, often at a fraction of the inference cost.

How It’s Used in Practice

The most common place you’ll encounter scaling laws is during model selection. When a team evaluates pretrained decoder-only transformers — deciding between a smaller model trained with more data versus a larger model trained with less — scaling laws provide the framework for that comparison. Instead of running expensive head-to-head experiments, engineers plot expected loss against their available compute budget and pick the model that lands on the most favorable point of the curve.

This matters outside the lab, too. Product managers reviewing vendor pitches will often see claims like “trained at optimal compute” or “Chinchilla-optimal.” These phrases reference the scaling law research directly, meaning the model was trained with a token-to-parameter ratio that maximizes quality per dollar of compute spent.

Pro Tip: When comparing two pretrained models of different sizes, don’t assume the bigger one wins. Check how many tokens each was trained on relative to its parameter count. A model that followed compute-optimal scaling (roughly twenty tokens per training parameter) often outperforms a model with twice the parameters but half the recommended training data.

When to Use / When Not

ScenarioUseAvoid
Choosing between pretrained models of different sizes
Estimating compute budget for a new training run
Predicting exact benchmark scores on a specific task
Comparing models trained on very different data distributions
Planning hardware procurement for a training cluster
Fine-tuning a small model on a narrow domain dataset

Common Misconception

Myth: A model with more parameters always performs better than a smaller one. Reality: Performance depends on the balance between parameters and training data. A smaller model trained on the right volume of tokens can outperform a much larger model that was under-trained. The Chinchilla study proved that size alone is not the deciding factor — the ratio of data to parameters matters just as much.

One Sentence to Remember

Scaling laws tell you that doubling a model’s parameters without proportionally increasing its training data is a waste of compute — balance both and you can predict performance before you spend a dollar.

FAQ

Q: Do scaling laws apply to fine-tuned models or only to pretraining? A: The original research focused on pretraining loss. Fine-tuning involves smaller datasets and different dynamics, so the same power-law curves do not transfer directly to fine-tuning outcomes.

Q: Is bigger always better according to scaling laws? A: No. The Chinchilla study showed that a smaller model trained on proportionally more data often matches a larger under-trained model while costing less to run at inference time.

Q: Can scaling laws predict performance on a specific downstream task? A: They predict aggregate loss (cross-entropy), not task-specific accuracy. A model with lower pretraining loss generally performs better, but results on any single benchmark can vary.

Sources

Expert Takes

Scaling laws formalize what was previously intuition: loss follows a power-law function of compute, data, and parameters. The exponents differ across studies, which means the curves are directionally reliable but not precision instruments. The Chinchilla correction showed that prior assumptions about optimal model size were off by a significant margin, favoring more data per parameter than earlier work suggested.

When you pick a decoder-only transformer, scaling laws are your spec sheet for cost-versus-quality. A model trained at the right token-to-parameter ratio gives you better inference economics — lower latency, smaller hardware footprint, same output quality. If a vendor can’t tell you their model’s training ratio, that’s a red flag worth investigating before you sign anything.

Scaling laws changed how AI companies allocate capital. Before Chinchilla, the strategy was straightforward: build the biggest model you can afford. After Chinchilla, the winning move became training smaller models on massive datasets, then deploying them cheaply. The companies that absorbed this lesson early now run leaner infrastructure with competitive performance.

The assumption behind scaling laws is that more data and more compute always improve a model. But the data itself carries biases, omissions, and cultural gaps that scale right along with performance. A model that is “compute-optimal” is not bias-optimal. The efficiency gains are real, but they don’t address what the model learned — only how efficiently it learned it.