LLM Training & Pre-Training

LLM pre-training is the foundational phase where large language models learn from raw text — objectives, scaling laws, and compute economics that shape every frontier model.

Authors 29 articles 288 min total read Updated Mar 26, 2026

This theme is curated by our AI council — see how it works.

What topics does this domain cover?

5 topics

Each topic below is a key concept in this domain. Pick any for the full picture: foundations, implementation, what's changing, and risks to consider.

Fine-Tuning →

Fine-tuning takes a pre-trained large language model and trains it further on a smaller, task-specific dataset so it …

6 articles

Pre-Training →

Pre-training is the foundational phase where a large language model learns language patterns from massive text corpora …

7 articles

Reward Model Architecture →

A reward model is a neural network trained on human preference comparisons to score language model outputs by quality. …

5 articles

RLHF →

Reinforcement Learning from Human Feedback (RLHF) is an alignment technique that fine-tunes large language models using …

6 articles

Scaling Laws →

Scaling laws are empirical relationships that predict how large language model performance changes as you increase model …

5 articles

Four perspectives on this domain

MONA's articles build your mental model — how things work, why they work that way, and what intuition to develop.

Updated Mar 26, 2026

Diverging optimization curves where proxy reward climbs while gold reward collapses past a critical threshold

MONA explainer 10 min Mar 26, 2026

From Loss Functions to Reward Hacking: Prerequisites and Technical Limits of Reward Models

Reward models compress human preference into a scalar signal. Learn the Bradley-Terry math, the RLHF pipeline, and why overoptimization makes outputs worse.

Geometric visualization of pairwise preference comparisons converging into a scalar reward signal for LLM alignment

MONA explainer 11 min Mar 26, 2026

What Is Reward Model Architecture and How Bradley-Terry Scoring Shapes LLM Alignment

Reward models turn human preferences into scores that guide LLM alignment. Learn how Bradley-Terry scoring and pairwise comparisons drive RLHF training.

Geometric visualization of power-law curves approaching asymptotic ceilings on a logarithmic grid

MONA explainer 11 min Mar 25, 2026

Diminishing Returns, Data Exhaustion, and the Hard Technical Limits of Neural Scaling

Scaling laws predict how AI models improve with compute, but power-law exponents guarantee diminishing returns. Learn where the ceilings are — and why.

Power-law curves on logarithmic axes showing predictable scaling patterns across neural network model sizes

MONA explainer 10 min Mar 25, 2026

What Are Scaling Laws and How Power-Law Curves Predict LLM Performance

Scaling laws predict LLM performance from model size, data, and compute via power-law curves. Learn the math behind Kaplan, Chinchilla, and Densing Law.

Diagram showing the three-stage RLHF training pipeline with reward signal flows and KL divergence constraint loops

MONA explainer 10 min Mar 25, 2026

From Reward Modeling to KL Penalties: Every Stage of the RLHF Training Pipeline Explained

RLHF aligns language models through human preferences in three stages. Learn how reward models, PPO, and KL penalties interact to prevent reward hacking.

Abstract diverging optimization paths visualizing reward signal failure during RLHF alignment training

MONA explainer 10 min Mar 25, 2026

Reward Hacking, Mode Collapse, and the Unsolved Technical Limits of RLHF Alignment

Reward hacking, mode collapse, and KL divergence failure — the three unsolved technical limits of RLHF alignment and why they resist simple fixes.

Human preference rankings flowing through a reward model to reshape large language model alignment

MONA explainer 10 min Mar 25, 2026

What Is RLHF and How Human Preferences Train Large Language Models to Follow Instructions

RLHF uses human preferences and reward models to train language models to follow instructions. Learn the three-stage PPO pipeline, why it works, and what replaced it.

Data flowing through filtering and deduplication stages into a distributed training cluster producing model checkpoints

MONA explainer 10 min Mar 25, 2026

From Data Curation to Checkpoints: The Building Blocks of a Modern Pre-Training Pipeline

Pre-training pipelines run from data curation to checkpointing. Learn how FineWeb, Dolma, and Megatron-Core build the foundation every LLM depends on.

Abstract visualization of exponential compute curves flattening against a finite data boundary

MONA explainer 10 min Mar 25, 2026

Scaling Walls, Data Exhaustion, and the Technical Limits of Pre-Training in 2026

Pre-training compute grows 4-5x yearly while data runs out. Learn the three scaling walls — cost, data exhaustion, and diminishing returns — reshaping AI in 2026.

Neural network absorbing streams of raw text as layered language structure crystallizes from prediction patterns

MONA explainer 9 min Mar 25, 2026

What Is Pre-Training and How LLMs Learn Language from Raw Text at Scale

Pre-training teaches LLMs to predict text, not understand it — yet prediction at scale produces something that resembles comprehension. Here's the mechanism.

$Neural network weight connections fracturing as new training data overwrites prior knowledge during model adaptation$

MONA explainer 10 min Mar 25, 2026

LLM Training & Pre-Training

What topics does this domain cover?

Fine-Tuning →

Pre-Training →

Reward Model Architecture →

RLHF →

Scaling Laws →

Four perspectives on this domain

From Loss Functions to Reward Hacking: Prerequisites and Technical Limits of Reward Models

What Is Reward Model Architecture and How Bradley-Terry Scoring Shapes LLM Alignment

Diminishing Returns, Data Exhaustion, and the Hard Technical Limits of Neural Scaling

What Are Scaling Laws and How Power-Law Curves Predict LLM Performance

From Reward Modeling to KL Penalties: Every Stage of the RLHF Training Pipeline Explained

Reward Hacking, Mode Collapse, and the Unsolved Technical Limits of RLHF Alignment

What Is RLHF and How Human Preferences Train Large Language Models to Follow Instructions

From Data Curation to Checkpoints: The Building Blocks of a Modern Pre-Training Pipeline

Scaling Walls, Data Exhaustion, and the Technical Limits of Pre-Training in 2026

What Is Pre-Training and How LLMs Learn Language from Raw Text at Scale

Catastrophic Forgetting, Overfitting, and the Hard Technical Limits of LLM Fine-Tuning

LoRA vs. QLoRA vs. Full Fine-Tuning: Methods, Trade-Offs, and What You Need to Know First

What Is Fine-Tuning and How Gradient Updates Adapt Pre-Trained LLMs to Specific Tasks

How to Train and Evaluate a Reward Model with OpenRLHF, TRL, and RewardBench 2 in 2026

LLM Training for Developers: Which Instincts Help, Which Mislead

How to Apply Scaling Laws and Chinchilla-Optimal Ratios to LLM Training Decisions in 2026

How to Train a Language Model with RLHF Using OpenRLHF and TRL in 2026

How to Pre-Train a Language Model with Megatron-LM, DeepSpeed, and NeMo in 2026

How to Fine-Tune an Open-Source LLM with Hugging Face PEFT, Unsloth, and Axolotl in 2026

QRM-Gemma, Skywork Reward, and the LM-as-a-Judge Pivot: The Reward Model Race in 2026

From ChatGPT's PPO to DeepSeek's GRPO: How RLHF Alternatives Reshaped Alignment Through 2026

GLM-5, FineWeb2, and the 28-Trillion-Token Race: Pre-Training Breakthroughs Reshaping AI in 2026

Together AI at $0.48/M, Unsloth 5x Speedups, and the Fine-Tuning Platform Race in 2026

DeepSeek-v3, OpenAI o3, and the Data Wall: How Scaling Laws Are Shifting in 2026

Whose Preferences Count: How Reward Models Encode Bias and Shape What LLMs Refuse to Say

Annotator Exploitation, Preference Bias, and the Hidden Human Cost of RLHF Alignment

Biased Training Data, Copyright Gray Zones, and Accountability Gaps in Fine-Tuned LLMs

Copyright, Carbon, and Consent: The Ethical Price of Training on Trillions of Tokens

The Scaling Tax: Energy Consumption, Data Monopolies, and Concentrated AI Power

Cookie Settings