LLM Foundations

Core mechanics of large language models — training, inference, tokenization, and the mathematics of next-token prediction.

MONA explainer 10 min Mar 20, 2026

Transformers replaced sequential recurrence with parallel self-attention. Understand QKV computation, multi-head …

MONA explainer 11 min Mar 20, 2026

Tokenizer architecture determines how LLMs read text. Learn how BPE, WordPiece, and Unigram split text into subword …

MONA explainer 9 min Mar 20, 2026

Embeddings turn words into vector coordinates where distance equals meaning. Learn the geometry, training mechanics, and …

MONA explainer 10 min Mar 20, 2026

Self-attention, cross-attention, and causal masking solve different problems inside transformers. Learn the math, …

MONA explainer 10 min Mar 20, 2026

Understand why RNNs failed, how transformer self-attention trades parallelism for quadratic cost, and what these …

MONA explainer 9 min Mar 20, 2026

Multi-head attention, positional encoding, and encoder-decoder structure: the three mechanisms inside every transformer, …

MONA explainer 10 min Mar 20, 2026

BPE tokenizers produce glitch tokens and penalize non-Latin scripts with fertility gaps. Learn where the math breaks — …

MONA explainer 9 min Mar 20, 2026

Dense vs. sparse embeddings encode meaning differently. Learn how cosine similarity, dot product, and Euclidean distance …

MONA explainer 10 min Mar 20, 2026

Attention mechanisms let neural networks weigh input relevance dynamically. Learn how queries, keys, and values compute …

MONA explainer 10 min Mar 16, 2026

Transformer self-attention scales quadratically with sequence length. Understand the O(n²) memory wall, KV cache costs, …

MONA explainer 9 min Mar 16, 2026

Standard attention scales quadratically with sequence length. Learn why O(n²) breaks at long contexts, what attention …

MONA explainer 10 min Mar 16, 2026

The transformer architecture powers every major LLM. Learn how self-attention computes token relationships, why …

MONA explainer 9 min Mar 16, 2026

Understand how the attention mechanism works inside transformers. Covers scaled dot-product attention, self-attention vs …

MONA explainer 10 min Mar 16, 2026

Master the math behind transformers: embeddings, matrix multiplication, positional encoding, and multi-head attention …

$Geometric visualization of vector spaces converging through dot product alignment into attention weight distributions$

MONA explainer 9 min Mar 16, 2026

Master the math behind attention mechanisms — dot products, softmax, QKV matrices, and multi-head projections — before …