DAN Analysis 8 min read March 25, 2026

GLM-5, FineWeb2, and the 28-Trillion-Token Race: Pre-Training Breakthroughs Reshaping AI in 2026

Digital tokens flowing into competing neural network architectures representing the global pre-training data race

Table of Contents

TL;DR

The shift: Pre-training token counts have tripled in eighteen months, but the real competition has moved to data quality, synthetic augmentation, and post-training investment.
Why it matters: Labs pouring compute into raw scale are hitting a data ceiling that open datasets like FineWeb2 and synthetic pipelines were built to solve.
What’s next: Pre-training becomes a commodity input. Post-training becomes the moat.

A year ago, a 15-trillion-token Pre Training run was frontier-grade. Today, three separate labs have blown past that number — and the one that grabbed headlines did it without a single NVIDIA chip.

The game did not just get bigger. It changed shape.

The Token Arms Race Already Has a Ceiling

Thesis: The pre-training scale war peaked — what follows is a data quality fight, not a data quantity fight.

Zhipu AI’s GLM-5 launched on February 11, 2026. The specs: 744 billion parameters in a mixture-of-experts architecture, 40 billion active per token, trained on 28.5 trillion tokens (HuggingFace Blog). Up from 23 trillion in GLM-4.5.

GLM-5 is not the biggest run. Not close.

Qwen3 hit 36 trillion tokens across 119 languages back in April 2025 (Qwen Blog). Meta’s Llama 4 Scout claimed roughly 40 trillion multimodal tokens around the same period — though multimodal counts mix text, image, and video data, making direct comparison with text-only runs imprecise (Meta AI Blog).

The numbers keep climbing. The ceiling is already visible.

Epoch AI estimates the total stock of available human-generated text at roughly 300 trillion tokens — but with a confidence interval spanning 100 trillion to 1,000 trillion tokens (Epoch AI). Even the generous end has a finish line.

That constraint makes FineWeb2 strategic. HuggingFace’s open dataset spans approximately 3 trillion words across over 1,000 languages and 5 billion documents, drawn from 96 CommonCrawl snapshots (HuggingFace). It is not a training run. It is infrastructure — the Data Deduplication and curation engineering that will separate serious labs from ones that just rent more GPUs.

From Token Volume to Training Architecture

Pre-training has evolved far beyond the Masked Language Modeling objectives that defined earlier architectures. Today’s frontier runs are autoregressive, multimodal, and multilingual. The frameworks changed too.

The traditional stack — Megatron-LM and Deepspeed on NVIDIA clusters — is no longer the only path to frontier performance. GLM-5 ran on 100,000 Huawei Ascend 910B chips using the MindSpore framework. Zero NVIDIA dependency. It scored 77.8% on SWE-bench Verified and 86.0% on GPQA-Diamond — self-reported numbers, pending independent verification (HuggingFace Blog).

Those benchmarks matter less than the hardware signal. A frontier model trained entirely off NVIDIA silicon is a supply-chain fact the industry cannot ignore.

Qwen3 leaned into synthetic augmentation — using Qwen2.5-Math and Qwen2.5-Coder to generate textbooks, QA pairs, and code that fed back into pre-training (Qwen Blog). Research presented at EMNLP 2025 supports this direction: the optimal blend appears to be roughly 30% rephrased synthetic data combined with 70% natural web text (EMNLP 2025).

And post-training investment has exploded. The Fine Tuning and RLHF stages that used to be afterthoughts now consume more budget than the base training run. Llama 3.1’s post-training costs exceeded $50 million — up from under $1 million for the original Llama (Interconnects).

The pattern: labs are shifting resources from feeding models more data to making existing data work harder.

Who Moves Up

Open-weight labs own the momentum. GLM-5 ships under MIT license at $1.00 per million input tokens and $3.20 per million output tokens through Z.ai (HuggingFace Blog). Qwen3 is Apache 2.0. The accessibility gap between frontier and open models is collapsing — and open-weight pricing is now the benchmark closed providers must answer.

Hardware diversifiers gain leverage. GLM-5 on Huawei Ascend proves that frontier training works without NVIDIA silicon. Any lab still locked into a single chip vendor just received a pricing negotiation tool it did not have six months ago.

Data curation teams — the engineers building Scaling Laws-aware pipelines that optimize for token quality, not token count — are the new competitive edge. FineWeb2 is the proof: better filtering beats bigger crawls.

Who Gets Left Behind

Scale-only strategies are hitting diminishing returns. If your pre-training plan is “more tokens, same pipeline,” you are spending more to learn less. The labs that matter already moved on.

NVIDIA-only infrastructure carries concentration risk that is now optional, not mandatory. GLM-5 is the proof point. You are either diversifying your training stack or paying a premium for the privilege of dependency.

Closed-model pricing faces pressure it did not face a year ago. When an MIT-licensed frontier model costs $1.00 per million input tokens, the margin conversation changes for every API provider charging multiples of that.

What Happens Next

Base case (most likely): Pre-training token counts plateau as labs hit data ceiling constraints. Competition shifts decisively to post-training depth, synthetic data generation, and inference efficiency. Open-weight models close the remaining capability gap with closed ones. Signal to watch: A major lab publicly deprioritizes pre-training scale in favor of post-training investment. Timeline: Mid-2026 to early 2027.

Bull case: Synthetic data pipelines crack the data ceiling. Pre-training runs exceed 100 trillion effective tokens using high-quality synthetic-natural blends, unlocking another capability jump. Signal: Benchmark gains from synthetic-augmented runs that consistently outperform natural-data-only baselines. Timeline: Late 2026.

Bear case: Synthetic data quality degrades at scale — subtle distribution drift, mode collapse, factual contamination. Models grow without getting smarter. The industry burns compute on diminishing returns. Signal: Flagship models scoring lower on reasoning benchmarks despite higher token counts. Timeline: 2027.

Frequently Asked Questions

Q: What are the largest and most notable LLM pre-training runs completed in 2025 and 2026? A: Llama 4 Scout leads with roughly 40 trillion multimodal tokens, followed by Qwen3 at 36 trillion text tokens across 119 languages, and GLM-5 at 28.5 trillion text tokens on non-NVIDIA hardware. Llama 4’s count includes image and video data, so direct comparison with text-only runs is imprecise.

Q: How is the industry shift from pre-training to post-training changing AI development in 2026? A: Post-training budgets have grown from under $1 million for the original Llama to over $50 million for Llama 3.1. Labs now invest more engineering hours in fine-tuning, RLHF, and evaluation than in raw pre-training scale — making post-training the new competitive surface.

Q: Will synthetic data replace human-written text in future pre-training datasets? A: Not entirely. Current research points to an optimal blend of roughly 30% synthetic content and 70% natural web text. Synthetic data supplements the shrinking supply of human-written text but introduces quality risks at scale that remain under active study.

The Bottom Line

The 28-trillion-token headline belongs to GLM-5 — but the structural story is that raw pre-training scale stopped being the edge. Data curation, synthetic augmentation, hardware diversification, and post-training depth are the surfaces where the next round of competition plays out. You are either building those capabilities or watching others build them.

Disclaimer

This article discusses financial topics for educational purposes only. It does not constitute financial advice. Consult a qualified financial advisor before making investment decisions.

Aha Moments

MONA

The token count war obscures a more telling metric: efficiency per active parameter. GLM-5 uses a sparse mixture-of-experts routing strategy, activating only a small fraction of its total parameter count on each forward pass. This means the model maintains frontier-level output quality while consuming far less inference compute than a dense architecture of equivalent total size would require. The real engineering story is the routing — how the model decides which expert handles a given token determines both output quality and cost. Dense architectures hit a wall where every additional parameter costs proportionally more to serve. Sparse routing sidesteps that wall. The open question is whether these routing strategies generalize reliably across task types or remain fragile under distribution shift.

MAX

The data pipeline story deserves more attention than the parameter headline. FineWeb2 covers a massive multilingual corpus spanning an enormous number of documents, but raw volume is not the specification that matters. The deduplication and filtering architecture — the systematic decisions about what gets excluded — defines the dataset’s actual value. Qwen3’s synthetic data pipeline introduces another complexity layer: the quality of generated training material is bounded by the capabilities of the model producing it. If you use a math model to generate math textbooks for pre-training, you inherit every systematic error that model carries. The engineering discipline here lives in the validation layer between synthetic generation and training ingestion. Teams that skip that validation are building on data they cannot audit.

ALAN

Every conversation about pre-training scale quietly assumes that more data yields better models — a conviction the industry holds with near-religious certainty. But better for whom? The languages weighted, the perspectives amplified, the sources prioritized — these are not neutral engineering decisions. When a dataset draws from web crawl snapshots spanning years of internet content, it encodes the biases, power concentrations, and knowledge gaps of that era. Scaling does not correct for those patterns — it compounds them. If the next generation of models trains on synthetic text generated by the current generation, whose assumptions propagate through that loop? And who gets to audit the feedback cycle before it closes?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors