GLM-5, FineWeb2, and the 28-Trillion-Token Race: Pre-Training Breakthroughs Reshaping AI in 2026

Table of Contents
TL;DR
- The shift: Pre-training token counts have tripled in eighteen months, but the real competition has moved to data quality, synthetic augmentation, and post-training investment.
- Why it matters: Labs pouring compute into raw scale are hitting a data ceiling that open datasets like FineWeb2 and synthetic pipelines were built to solve.
- What’s next: Pre-training becomes a commodity input. Post-training becomes the moat.
A year ago, a 15-trillion-token Pre Training run was frontier-grade. Today, three separate labs have blown past that number — and the one that grabbed headlines did it without a single NVIDIA chip.
The game did not just get bigger. It changed shape.
The Token Arms Race Already Has a Ceiling
Thesis: The pre-training scale war peaked — what follows is a data quality fight, not a data quantity fight.
Zhipu AI’s GLM-5 launched on February 11, 2026. The specs: 744 billion parameters in a mixture-of-experts architecture, 40 billion active per token, trained on 28.5 trillion tokens (HuggingFace Blog). Up from 23 trillion in GLM-4.5.
GLM-5 is not the biggest run. Not close.
Qwen3 hit 36 trillion tokens across 119 languages back in April 2025 (Qwen Blog). Meta’s Llama 4 Scout claimed roughly 40 trillion multimodal tokens around the same period — though multimodal counts mix text, image, and video data, making direct comparison with text-only runs imprecise (Meta AI Blog).
The numbers keep climbing. The ceiling is already visible.
Epoch AI estimates the total stock of available human-generated text at roughly 300 trillion tokens — but with a confidence interval spanning 100 trillion to 1,000 trillion tokens (Epoch AI). Even the generous end has a finish line.
That constraint makes FineWeb2 strategic. HuggingFace’s open dataset spans approximately 3 trillion words across over 1,000 languages and 5 billion documents, drawn from 96 CommonCrawl snapshots (HuggingFace). It is not a training run. It is infrastructure — the Data Deduplication and curation engineering that will separate serious labs from ones that just rent more GPUs.
From Token Volume to Training Architecture
Pre-training has evolved far beyond the Masked Language Modeling objectives that defined earlier architectures. Today’s frontier runs are autoregressive, multimodal, and multilingual. The frameworks changed too.
The traditional stack — Megatron-LM and Deepspeed on NVIDIA clusters — is no longer the only path to frontier performance. GLM-5 ran on 100,000 Huawei Ascend 910B chips using the MindSpore framework. Zero NVIDIA dependency. It scored 77.8% on SWE-bench Verified and 86.0% on GPQA-Diamond — self-reported numbers, pending independent verification (HuggingFace Blog).
Those benchmarks matter less than the hardware signal. A frontier model trained entirely off NVIDIA silicon is a supply-chain fact the industry cannot ignore.
Qwen3 leaned into synthetic augmentation — using Qwen2.5-Math and Qwen2.5-Coder to generate textbooks, QA pairs, and code that fed back into pre-training (Qwen Blog). Research presented at EMNLP 2025 supports this direction: the optimal blend appears to be roughly 30% rephrased synthetic data combined with 70% natural web text (EMNLP 2025).
And post-training investment has exploded. The Fine Tuning and RLHF stages that used to be afterthoughts now consume more budget than the base training run. Llama 3.1’s post-training costs exceeded $50 million — up from under $1 million for the original Llama (Interconnects).
The pattern: labs are shifting resources from feeding models more data to making existing data work harder.
Who Moves Up
Open-weight labs own the momentum. GLM-5 ships under MIT license at $1.00 per million input tokens and $3.20 per million output tokens through Z.ai (HuggingFace Blog). Qwen3 is Apache 2.0. The accessibility gap between frontier and open models is collapsing — and open-weight pricing is now the benchmark closed providers must answer.
Hardware diversifiers gain leverage. GLM-5 on Huawei Ascend proves that frontier training works without NVIDIA silicon. Any lab still locked into a single chip vendor just received a pricing negotiation tool it did not have six months ago.
Data curation teams — the engineers building Scaling Laws-aware pipelines that optimize for token quality, not token count — are the new competitive edge. FineWeb2 is the proof: better filtering beats bigger crawls.
Who Gets Left Behind
Scale-only strategies are hitting diminishing returns. If your pre-training plan is “more tokens, same pipeline,” you are spending more to learn less. The labs that matter already moved on.
NVIDIA-only infrastructure carries concentration risk that is now optional, not mandatory. GLM-5 is the proof point. You are either diversifying your training stack or paying a premium for the privilege of dependency.
Closed-model pricing faces pressure it did not face a year ago. When an MIT-licensed frontier model costs $1.00 per million input tokens, the margin conversation changes for every API provider charging multiples of that.
What Happens Next
Base case (most likely): Pre-training token counts plateau as labs hit data ceiling constraints. Competition shifts decisively to post-training depth, synthetic data generation, and inference efficiency. Open-weight models close the remaining capability gap with closed ones. Signal to watch: A major lab publicly deprioritizes pre-training scale in favor of post-training investment. Timeline: Mid-2026 to early 2027.
Bull case: Synthetic data pipelines crack the data ceiling. Pre-training runs exceed 100 trillion effective tokens using high-quality synthetic-natural blends, unlocking another capability jump. Signal: Benchmark gains from synthetic-augmented runs that consistently outperform natural-data-only baselines. Timeline: Late 2026.
Bear case: Synthetic data quality degrades at scale — subtle distribution drift, mode collapse, factual contamination. Models grow without getting smarter. The industry burns compute on diminishing returns. Signal: Flagship models scoring lower on reasoning benchmarks despite higher token counts. Timeline: 2027.
Frequently Asked Questions
Q: What are the largest and most notable LLM pre-training runs completed in 2025 and 2026? A: Llama 4 Scout leads with roughly 40 trillion multimodal tokens, followed by Qwen3 at 36 trillion text tokens across 119 languages, and GLM-5 at 28.5 trillion text tokens on non-NVIDIA hardware. Llama 4’s count includes image and video data, so direct comparison with text-only runs is imprecise.
Q: How is the industry shift from pre-training to post-training changing AI development in 2026? A: Post-training budgets have grown from under $1 million for the original Llama to over $50 million for Llama 3.1. Labs now invest more engineering hours in fine-tuning, RLHF, and evaluation than in raw pre-training scale — making post-training the new competitive surface.
Q: Will synthetic data replace human-written text in future pre-training datasets? A: Not entirely. Current research points to an optimal blend of roughly 30% synthetic content and 70% natural web text. Synthetic data supplements the shrinking supply of human-written text but introduces quality risks at scale that remain under active study.
The Bottom Line
The 28-trillion-token headline belongs to GLM-5 — but the structural story is that raw pre-training scale stopped being the edge. Data curation, synthetic augmentation, hardware diversification, and post-training depth are the surfaces where the next round of competition plays out. You are either building those capabilities or watching others build them.
AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors