LLM Foundations

Core mechanics of large language models — training, inference, tokenization, and the mathematics of next-token prediction.

Power-law curves on logarithmic axes showing predictable scaling patterns across neural network model sizes
MONA explainer 10 min

What Are Scaling Laws and How Power-Law Curves Predict LLM Performance

Scaling laws predict LLM performance from model size, data, and compute via power-law curves. Learn the math behind …

Geometric visualization of power-law curves approaching asymptotic ceilings on a logarithmic grid
MONA explainer 11 min

Diminishing Returns, Data Exhaustion, and the Hard Technical Limits of Neural Scaling

Scaling laws predict how AI models improve with compute, but power-law exponents guarantee diminishing returns. Learn …

Human preference rankings flowing through a reward model to reshape large language model alignment
MONA explainer 10 min

What Is RLHF and How Human Preferences Train Large Language Models to Follow Instructions

RLHF uses human preferences and reward models to train language models to follow instructions. Learn the three-stage PPO …

Abstract diverging optimization paths visualizing reward signal failure during RLHF alignment training
MONA explainer 10 min

Reward Hacking, Mode Collapse, and the Unsolved Technical Limits of RLHF Alignment

Reward hacking, mode collapse, and KL divergence failure — the three unsolved technical limits of RLHF alignment and why …

Diagram showing the three-stage RLHF training pipeline with reward signal flows and KL divergence constraint loops
MONA explainer 10 min

From Reward Modeling to KL Penalties: Every Stage of the RLHF Training Pipeline Explained

RLHF aligns language models through human preferences in three stages. Learn how reward models, PPO, and KL penalties …

Neural network absorbing streams of raw text as layered language structure crystallizes from prediction patterns
MONA explainer 9 min

What Is Pre-Training and How LLMs Learn Language from Raw Text at Scale

Pre-training teaches LLMs to predict text, not understand it — yet prediction at scale produces something that resembles …

Abstract visualization of exponential compute curves flattening against a finite data boundary
MONA explainer 10 min

Scaling Walls, Data Exhaustion, and the Technical Limits of Pre-Training in 2026

Pre-training compute grows 4-5x yearly while data runs out. Learn the three scaling walls — cost, data exhaustion, and …

Data flowing through filtering and deduplication stages into a distributed training cluster producing model checkpoints
MONA explainer 10 min

From Data Curation to Checkpoints: The Building Blocks of a Modern Pre-Training Pipeline

Pre-training pipelines run from data curation to checkpointing. Learn how FineWeb, Dolma, and Megatron-Core build the …

Neural network weight connections fracturing as new training data overwrites prior knowledge during model adaptation
MONA explainer 10 min

Catastrophic Forgetting, Overfitting, and the Hard Technical Limits of LLM Fine-Tuning

Fine-tuning can destroy what your LLM already knows. Learn why catastrophic forgetting and overfitting define the hard …

Weight matrices with gradient arrows converging toward a specialized probability distribution for task-specific outputs
MONA explainer 10 min

What Is Fine-Tuning and How Gradient Updates Adapt Pre-Trained LLMs to Specific Tasks

Fine-tuning adapts pre-trained LLMs by updating weights on task-specific data. Learn how gradient descent reshapes model …

Weight matrices with highlighted low-rank decomposition pathways showing parameter-efficient adaptation of a large language
MONA explainer 10 min

LoRA vs. QLoRA vs. Full Fine-Tuning: Methods, Trade-Offs, and What You Need to Know First

LoRA, QLoRA, and full fine-tuning each change different parts of an LLM. Learn which method fits your GPU budget, data …

Diagram of raw text splitting into subword tokens through three parallel algorithmic pathways
MONA explainer 11 min

What Is Tokenizer Architecture and How BPE, WordPiece, and Unigram Encode Text for LLMs

Tokenizer architecture determines how LLMs read text. Learn how BPE, WordPiece, and Unigram split text into subword …

Neural network projecting words into a geometric vector space with visible distance relationships between meaning clusters
MONA explainer 9 min

What Is an Embedding and How Neural Networks Encode Meaning into Vectors

Embeddings turn words into vector coordinates where distance equals meaning. Learn the geometry, training mechanics, and …

Abstract geometric visualization of attention weight matrices connecting token sequences through parallel pathways
MONA explainer 10 min

Self-Attention vs. Cross-Attention vs. Causal Masking: Attention Variants and Their Limits

Self-attention, cross-attention, and causal masking solve different problems inside transformers. Learn the math, …

Sequential chains breaking apart into parallel attention grids with quadratic scaling curves rising behind them
MONA explainer 10 min

Prerequisites for Understanding Transformers: From RNNs to Quadratic Scaling Limits

Understand why RNNs failed, how transformer self-attention trades parallelism for quadratic cost, and what these …

Geometric visualization of multi-head attention connecting tokens across transformer layers with positional encoding waves
MONA explainer 9 min

Multi-Head Attention, Positional Encoding, and the Encoder-Decoder Structure Explained

Multi-head attention, positional encoding, and encoder-decoder structure: the three mechanisms inside every transformer, …

Fractured subword fragments orbiting a merge tree with gaps revealing non-Latin script disparity
MONA explainer 10 min

Glitch Tokens, Fertility Gaps, and the Unsolved Technical Limits of Subword Tokenization

BPE tokenizers produce glitch tokens and penalize non-Latin scripts with fertility gaps. Learn where the math breaks — …

Abstract visualization of vectors in high-dimensional space with measurement rulers overlaid on a geometric grid
MONA explainer 9 min

Dense vs. Sparse, Cosine vs. Dot Product, and the Technical Limits of Vector Representations

Dense vs. sparse embeddings encode meaning differently. Learn how cosine similarity, dot product, and Euclidean distance …

Abstract geometric visualization of query key and value vectors converging through a scaled dot-product attention matrix
MONA explainer 10 min

Attention Mechanism Explained: How Queries, Keys, and Values Power Modern AI

Attention mechanisms let neural networks weigh input relevance dynamically. Learn how queries, keys, and values compute …

Geometric visualization of attention matrices expanding quadratically as sequence length grows
MONA explainer 10 min

Why Transformers Hit a Wall: Quadratic Scaling and the Memory Bottleneck

Transformer self-attention scales quadratically with sequence length. Understand the O(n²) memory wall, KV cache costs, …

Geometric matrix grid expanding quadratically with heat-map intensity fading at the edges to visualize attention cost scaling
MONA explainer 9 min

Why Standard Attention Breaks at Long Contexts: The O(n²) Bottleneck and Attention Sinks

Standard attention scales quadratically with sequence length. Learn why O(n²) breaks at long contexts, what attention …

Geometric attention matrix with query-key vectors converging across a sequence of tokens
MONA explainer 10 min

What Is the Transformer Architecture and How Self-Attention Really Works

The transformer architecture powers every major LLM. Learn how self-attention computes token relationships, why …

Geometric visualization of vector spaces and matrix operations underlying transformer attention mechanisms
MONA explainer 10 min

Prerequisites for Understanding Transformers: From Embeddings to Matrix Multiplication

Master the math behind transformers: embeddings, matrix multiplication, positional encoding, and multi-head attention …

Geometric visualization of vector spaces converging through dot product alignment into attention weight distributions
MONA explainer 9 min

From Embeddings to Attention: The Math You Need Before Studying Transformers

Master the math behind attention mechanisms — dot products, softmax, QKV matrices, and multi-head projections — before …

Abstract geometric visualization of weighted token connections flowing through a neural attention grid
MONA explainer 9 min

Attention Mechanism: Scaled Dot-Product, Self vs Cross

Transformers use weighted averaging, not human-like focus: scaled dot-product, self-attention vs cross-attention, and …