Emergent Abilities
Also known as: Emergent Properties, Emergent Capabilities, Emergence in LLMs
- Emergent Abilities
- Capabilities that appear in large language models only beyond a certain training scale, absent in smaller models but present in larger ones, raising questions about predictability when applying scaling laws to training decisions.
Emergent abilities are capabilities that appear in large language models only after reaching a certain scale of training compute and parameters, absent in smaller models but present in larger ones.
What It Is
If you’re planning how much compute to spend on a training run, scaling laws and Chinchilla-optimal ratios give you a reassuring picture: add more data and parameters, and training loss decreases along a smooth, predictable curve. Emergent abilities are the wrinkle in that picture. They describe skills that don’t show up in smaller models, then seem to appear once a model crosses a size threshold — making it hard to predict what you’ll get for your training budget.
Think of it like heating water. Temperature rises smoothly, degree by degree — that’s your training loss going down. But at a specific point, the water boils. That phase transition — gradual input producing a sudden qualitative change — captures what researchers first observed with emergent abilities.
According to Wei et al., a 2022 paper cataloged dozens of tasks where models below a certain size performed at chance level, while models above that threshold showed meaningful capability. Tasks like multi-step arithmetic, code generation, and chain-of-thought reasoning appeared to “switch on” rather than improve gradually.
According to Schaeffer et al., a study presented at NeurIPS in 2023 argued that over 92% of claimed emergent abilities relied on discontinuous evaluation metrics — scoring methods like exact string matching that give zero credit for partial answers. When researchers switched to smoother metrics that awarded partial credit, the abrupt jumps often disappeared, replaced by gradual improvement curves. This suggests some apparent emergence reflects how we measure performance, not something fundamental inside the model.
The debate matters for training budget decisions. If a capability truly emerges only at scale, you might need to commit to a large training run before seeing any return. If improvement is gradual, you can track progress incrementally, course-correct, and potentially reach your target with a smaller model plus better prompting or fine-tuning.
How It’s Used in Practice
When teams plan large model training runs, emergence shapes budget decisions. If your target capability — say, reliable multi-step reasoning — is believed to emerge only at a certain model size, you face a stark choice: invest enough compute to cross that threshold, or accept the capability won’t appear. This makes scaling laws and Chinchilla-optimal ratios critical — they define the most efficient path to reach the size where desired behavior might appear.
Most product teams encounter emergence indirectly. They notice a newer, larger model handles a task their previous model couldn’t — not incrementally better, but qualitatively different. Understanding the emergence debate helps explain why upgrading model generations sometimes feels like unlocking an entirely new tool rather than getting a modest improvement.
Pro Tip: Before committing budget to reach a specific model scale, check whether your target capability was measured with a pass/fail metric in the original research. If so, a smaller model with better prompting or fine-tuning might get you there at a fraction of the cost.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Planning training compute for a new foundation model | ✅ | |
| Deciding whether to fine-tune a smaller model vs. scaling up | ✅ | |
| Explaining why a model upgrade unlocked a new capability | ✅ | |
| Setting expectations for what a training run will produce | ✅ | |
| Predicting the exact model size where a specific skill will appear | ❌ | |
| Assuming all capability jumps require bigger models | ❌ |
Common Misconception
Myth: Emergent abilities prove that large language models develop entirely new skills at specific size thresholds, similar to a phase transition in physics.
Reality: According to Schaeffer et al., many claimed emergent abilities disappear when measured with graded scoring instead of all-or-nothing metrics. The “sudden jump” often reflects the evaluation method, not the model’s internal capabilities. Some abilities do improve faster at scale, but the clean phase-transition story oversimplifies what’s actually happening.
One Sentence to Remember
Scaling laws predict loss curves well, but predicting when specific capabilities appear — and whether that appearance is real or a measurement artifact — remains an open question that should inform how you allocate training compute.
FAQ
Q: Are emergent abilities real or just a measurement artifact? A: The honest answer is “both, depending on the case.” Some capabilities do improve faster at scale, but many apparent sudden jumps disappear when measured with smoother, graded metrics instead of binary pass/fail scoring.
Q: How do emergent abilities relate to scaling laws? A: Scaling laws predict smooth improvements in training loss. Emergent abilities describe cases where task performance appears to jump unexpectedly, creating tension between predictable loss curves and unpredictable capability thresholds.
Q: Can you predict which abilities will emerge at what scale? A: Not reliably. Researchers have documented patterns after the fact, but no current framework predicts in advance which specific capabilities will appear at a given model size, making large training investments inherently uncertain.
Sources
- Wei et al.: Emergent Abilities of Large Language Models - Foundational 2022 paper cataloging tasks where capabilities appeared only at specific model scales
- Schaeffer et al.: Are Emergent Abilities of Large Language Models a Mirage? - NeurIPS 2023 paper challenging emergence claims by showing metric choice drives apparent discontinuities
Expert Takes
Emergence became the field’s most attractive narrative: train big enough and qualitatively new behaviors appear for free. But the Schaeffer critique exposed a measurement flaw so clean it should make everyone pause. When smooth metrics replace discontinuous ones, most jumps flatten into gradual slopes. The real question isn’t whether models improve at scale — they do. It’s whether the improvement is ever truly discontinuous, or just poorly measured.
For teams deciding how to allocate a training budget, emergence introduces genuine planning risk. If the capability you need sits behind a threshold you can’t predict, your only option is to overshoot and hope. A better approach: define your target task, verify which metric the original emergence claim used, and test whether prompting strategies or fine-tuning on a smaller model get you close enough. That diagnostic step alone can save months of wasted compute.
Emergence is the pitch that sold massive-scale training runs. The story was simple — scale up and new abilities just appear. Now that story has cracks. Teams deciding where to invest training compute need to separate the hype from what the data actually shows. The organizations that win will be the ones asking hard questions about measurement before they write the check, not the ones chasing phase transitions on faith.
The emergence debate carries a question most teams skip: what happens when we build systems around capabilities we can’t predict or explain? If a model gains an ability at some unknown threshold, it might also gain behaviors nobody tested for. The rush to reach emergence thresholds rarely includes equal investment in understanding what else changed at that scale. We optimize for capability while remaining largely blind to unintended side effects of the same scaling process.