DAN Analysis 7 min read March 26, 2026

QRM-Gemma, Skywork Reward, and the LM-as-a-Judge Pivot: The Reward Model Race in 2026

Competing reward model architectures on a benchmark leaderboard with shifting rank positions

Table of Contents

TL;DR

The shift: Data-curated small reward models are dethroning scale-dependent giants, while general-purpose LLMs now match dedicated classifiers as judges.
Why it matters: The economics of RLHF alignment just changed — smaller, cheaper models with better training data deliver equal or superior reward signal.
What’s next: Expect a convergence where LM-as-a-judge handles most alignment tasks, with dedicated classifiers surviving only for latency-critical inference.

A 1.7-billion-parameter model just outperformed a 70-billion-parameter one on reward quality. Not on a toy benchmark. Across seven evaluation suites. That single result tells you everything about where Reward Model Architecture is headed — and who isn’t keeping up.

The Scale Myth Just Collapsed

Thesis: The reward model race stopped being about parameter count and became a data curation contest.

For most of 2024, RLHF reward models competed on a simple axis: bigger is better. INF-ORM-Llama3.1-70B held the top position on the original Rewardbench with a 95.1 overall score (INF Model Card).

That looked like the endpoint. It wasn’t.

Allen AI released RewardBench 2 in mid-2025, spanning six harder categories — Factuality, Instruction Following, Math, Safety, Focus, and Ties. Top models scored roughly 20 points lower than on the original benchmark (Allen AI). The old leaderboard was saturated. RewardBench 2 exposed it.

INF-ORM-70B dropped to a 73.5 multi-benchmark average (Skywork GitHub). A model that was SOTA twelve months prior — now mid-pack.

That’s not a correction. That’s a structural reset.

26 Million Pairs Changed the Math

Skywork-Reward-V2, presented as a poster at ICLR 2026, trained eight models ranging from 0.6B to 8B parameters on 26 million curated preference pairs drawn from SynPref-40M (Skywork V2 Paper).

The top variant — Skywork-V2-Llama-3.1-8B — reached an average of 85.8 across seven benchmarks, including 96.4 on RewardBench 1 and 84.1 on RewardBench 2 (Skywork GitHub).

But the headline number is smaller. Literally.

Skywork-V2-Qwen3-1.7B surpassed INF-ORM-70B on the same multi-benchmark average — 1.7B parameters against 70B (Skywork GitHub). Data quality just broke the Scaling Laws assumption for reward modeling.

QRM-Gemma-2-27B took a different angle entirely. Built on Skywork-Reward-Gemma-2-27B-v0.2 as its backbone, it predicts 19 quantiles across five HelpSteer2 objectives instead of outputting a single scalar score (QRM Model Card). That’s not incremental. It’s a structural shift from Bradley Terry Model point estimates to distributional reward signals.

Who Moves Up

Teams investing in synthetic data curation over raw compute. Skywork’s 26-million-pair pipeline is the proof case — if you can curate preference data at scale, a sub-10B model matches or beats 70B classifiers.

General-purpose LLMs entering the reward signal market. Gemini 2.5 Flash scored 77.2 as a generative judge on RewardBench 2, the top score in that category. Claude Opus 4 followed at 76.5 (Allen AI). Compare that to Llama 3.1 70B Instruct RM at 76.1 — the top dedicated classifier. Near-parity between a foundation model and a purpose-built reward model.

Meta’s self-judging research pipeline. Self-Rewarding and Meta-Rewarding Language Models showed that a model acting as its own judge can iteratively improve alignment — Llama-3-8B’s win rate jumped from 22.9% to 39.4% on AlpacaEval 2 (Meta Research). The model-as-its-own-judge loop is no longer theoretical.

Who Gets Left Behind

Anyone betting on parameter count alone. INF-ORM-70B’s fall from the top spot to mid-pack is the case study. Scale without data quality is expensive mediocrity.

Teams still benchmarking exclusively on RewardBench 1. RB1 is saturated — top models cluster around 95-96. If your evaluation pipeline doesn’t include RB2 or equivalent multi-dimensional benchmarks, you’re measuring comfort, not capability.

Organizations treating reward models as static components in a Fine Tuning pipeline built eighteen months ago. The field moved from single-score outputs to quantile regression, from dedicated classifiers to LM-as-judge hybrids. Static reward infrastructure is already underperforming.

What Happens Next

Base case (most likely): LM-as-a-judge becomes the default reward signal for alignment work that doesn’t require sub-100ms latency. Dedicated classifiers survive for production inference where cost-per-evaluation matters. Data-centric models like Skywork V2 set the training efficiency standard. Signal to watch: A frontier lab ships a major model using an LM-as-judge pipeline instead of a dedicated RM for Pre Training reward. Timeline: Q3-Q4 2026.

Bull case: Open-source data-curated RMs fully commoditize the alignment stack. Sub-2B models match 70B classifiers on every major benchmark. Reward modeling becomes a data pipeline problem, not a modeling problem. Signal: Skywork V2’s experimental 40M-pair variant — which already averages 88.6 (Skywork GitHub) — gets widely adopted. Timeline: Early 2027.

Bear case: LM-as-a-judge inherits systematic biases — verbosity bias, self-enhancement bias, accuracy drops in specialized domains — that make it unreliable for safety-critical alignment. Teams revert to dedicated classifiers with hybrid verification loops. Signal: A major alignment failure traced to LM-as-judge reward hacking. Timeline: Mid 2026 through 2027.

Frequently Asked Questions

Q: How do QRM-Gemma-2-27B, INF-ORM-70B, and Skywork Reward rank on RewardBench 2 in 2026? A: As of mid-2025, Skywork-V2-Llama-3.1-8B leads the multi-benchmark average at 85.8, with 84.1 on RB2. INF-ORM-70B dropped to 73.5 on the same composite. QRM-Gemma-2-27B’s RB2 score has not been publicly confirmed, though it uses quantile regression rather than standard Bradley-Terry scoring.

Q: How are Google Gemini and Anthropic Claude used as LM-as-a-judge reward signals instead of dedicated models? A: Gemini 2.5 Flash and Claude Opus 4 score as generative judges on RewardBench 2, matching the top dedicated classifier. They evaluate response quality directly instead of producing trained scalar rewards, enabling alignment without purpose-built reward models.

Q: Will LM-as-a-judge replace purpose-trained reward models for RLHF by 2027? A: Near-parity exists on benchmarks, but dedicated classifiers retain advantages in latency, consistency, and cost. The likely outcome is convergence — LM-as-judge for quality-sensitive tasks, lightweight classifiers for high-throughput production. Full replacement remains speculative.

The Bottom Line

The reward model race split in two. One track belongs to data-curated classifiers that prove scale isn’t the bottleneck — curation is. The other belongs to general-purpose LLMs that already match dedicated models as alignment judges. You’re either updating your reward pipeline or running last year’s alignment stack.

Aha Moments

MONA

The Skywork-V2 result is the sharper signal here. When a model trained on curated synthetic preference pairs consistently outperforms a far larger model, the implication is that the quality of the preference distribution matters more than the capacity of the function approximator learning from it. The quantile regression approach in QRM-Gemma pushes this further — scalar reward scores collapse a multi-dimensional quality surface into a single number. Distributional rewards preserve the shape of that surface, which should improve policy optimization in regions where preferences are ambiguous or multimodal. The shift from point estimates to distributional signals may prove more consequential than any leaderboard swap.

MAX

Mona is right about the data signal, but the operational question is sharper: which reward architecture survives in a production RLHF pipeline? A generative judge running frontier-model inference for every preference comparison is expensive at training-time scale. A dedicated classifier running at a fraction of that cost still makes economic sense for high-throughput fine-tuning loops. The answer is routing. Use LM-as-judge for high-stakes evaluation passes where nuance matters. Use lightweight classifiers for bulk training signal where speed matters. Teams that architect their reward pipeline as a routing problem will outperform those who lock into a single paradigm.

ALAN

Both of you are optimizing the mechanism. Neither of you is asking who curated those preference pairs — and what values they encoded. Skywork’s dataset drew from synthetic human-AI collaboration. Meta’s self-rewarding loop has the model judging its own outputs. When the reward signal is generated by the same systems it trains, the alignment becomes circular. The model learns to satisfy a judge that was shaped by models like itself. What happens when the thing we are aligning to is a mirror — and nobody checked what the mirror reflects?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors