DAN Analysis 7 min read

QRM-Gemma, Skywork Reward, and the LM-as-a-Judge Pivot: The Reward Model Race in 2026

Competing reward model architectures on a benchmark leaderboard with shifting rank positions

TL;DR

  • The shift: Data-curated small reward models are dethroning scale-dependent giants, while general-purpose LLMs now match dedicated classifiers as judges.
  • Why it matters: The economics of RLHF alignment just changed — smaller, cheaper models with better training data deliver equal or superior reward signal.
  • What’s next: Expect a convergence where LM-as-a-judge handles most alignment tasks, with dedicated classifiers surviving only for latency-critical inference.

A 1.7-billion-parameter model just outperformed a 70-billion-parameter one on reward quality. Not on a toy benchmark. Across seven evaluation suites. That single result tells you everything about where Reward Model Architecture is headed — and who isn’t keeping up.

The Scale Myth Just Collapsed

Thesis: The reward model race stopped being about parameter count and became a data curation contest.

For most of 2024, RLHF reward models competed on a simple axis: bigger is better. INF-ORM-Llama3.1-70B held the top position on the original Rewardbench with a 95.1 overall score (INF Model Card).

That looked like the endpoint. It wasn’t.

Allen AI released RewardBench 2 in mid-2025, spanning six harder categories — Factuality, Instruction Following, Math, Safety, Focus, and Ties. Top models scored roughly 20 points lower than on the original benchmark (Allen AI). The old leaderboard was saturated. RewardBench 2 exposed it.

INF-ORM-70B dropped to a 73.5 multi-benchmark average (Skywork GitHub). A model that was SOTA twelve months prior — now mid-pack.

That’s not a correction. That’s a structural reset.

26 Million Pairs Changed the Math

Skywork-Reward-V2, presented as a poster at ICLR 2026, trained eight models ranging from 0.6B to 8B parameters on 26 million curated preference pairs drawn from SynPref-40M (Skywork V2 Paper).

The top variant — Skywork-V2-Llama-3.1-8B — reached an average of 85.8 across seven benchmarks, including 96.4 on RewardBench 1 and 84.1 on RewardBench 2 (Skywork GitHub).

But the headline number is smaller. Literally.

Skywork-V2-Qwen3-1.7B surpassed INF-ORM-70B on the same multi-benchmark average — 1.7B parameters against 70B (Skywork GitHub). Data quality just broke the Scaling Laws assumption for reward modeling.

QRM-Gemma-2-27B took a different angle entirely. Built on Skywork-Reward-Gemma-2-27B-v0.2 as its backbone, it predicts 19 quantiles across five HelpSteer2 objectives instead of outputting a single scalar score (QRM Model Card). That’s not incremental. It’s a structural shift from Bradley Terry Model point estimates to distributional reward signals.

Who Moves Up

Teams investing in synthetic data curation over raw compute. Skywork’s 26-million-pair pipeline is the proof case — if you can curate preference data at scale, a sub-10B model matches or beats 70B classifiers.

General-purpose LLMs entering the reward signal market. Gemini 2.5 Flash scored 77.2 as a generative judge on RewardBench 2, the top score in that category. Claude Opus 4 followed at 76.5 (Allen AI). Compare that to Llama 3.1 70B Instruct RM at 76.1 — the top dedicated classifier. Near-parity between a foundation model and a purpose-built reward model.

Meta’s self-judging research pipeline. Self-Rewarding and Meta-Rewarding Language Models showed that a model acting as its own judge can iteratively improve alignment — Llama-3-8B’s win rate jumped from 22.9% to 39.4% on AlpacaEval 2 (Meta Research). The model-as-its-own-judge loop is no longer theoretical.

Who Gets Left Behind

Anyone betting on parameter count alone. INF-ORM-70B’s fall from the top spot to mid-pack is the case study. Scale without data quality is expensive mediocrity.

Teams still benchmarking exclusively on RewardBench 1. RB1 is saturated — top models cluster around 95-96. If your evaluation pipeline doesn’t include RB2 or equivalent multi-dimensional benchmarks, you’re measuring comfort, not capability.

Organizations treating reward models as static components in a Fine Tuning pipeline built eighteen months ago. The field moved from single-score outputs to quantile regression, from dedicated classifiers to LM-as-judge hybrids. Static reward infrastructure is already underperforming.

What Happens Next

Base case (most likely): LM-as-a-judge becomes the default reward signal for alignment work that doesn’t require sub-100ms latency. Dedicated classifiers survive for production inference where cost-per-evaluation matters. Data-centric models like Skywork V2 set the training efficiency standard. Signal to watch: A frontier lab ships a major model using an LM-as-judge pipeline instead of a dedicated RM for Pre Training reward. Timeline: Q3-Q4 2026.

Bull case: Open-source data-curated RMs fully commoditize the alignment stack. Sub-2B models match 70B classifiers on every major benchmark. Reward modeling becomes a data pipeline problem, not a modeling problem. Signal: Skywork V2’s experimental 40M-pair variant — which already averages 88.6 (Skywork GitHub) — gets widely adopted. Timeline: Early 2027.

Bear case: LM-as-a-judge inherits systematic biases — verbosity bias, self-enhancement bias, accuracy drops in specialized domains — that make it unreliable for safety-critical alignment. Teams revert to dedicated classifiers with hybrid verification loops. Signal: A major alignment failure traced to LM-as-judge reward hacking. Timeline: Mid 2026 through 2027.

Frequently Asked Questions

Q: How do QRM-Gemma-2-27B, INF-ORM-70B, and Skywork Reward rank on RewardBench 2 in 2026? A: As of mid-2025, Skywork-V2-Llama-3.1-8B leads the multi-benchmark average at 85.8, with 84.1 on RB2. INF-ORM-70B dropped to 73.5 on the same composite. QRM-Gemma-2-27B’s RB2 score has not been publicly confirmed, though it uses quantile regression rather than standard Bradley-Terry scoring.

Q: How are Google Gemini and Anthropic Claude used as LM-as-a-judge reward signals instead of dedicated models? A: Gemini 2.5 Flash and Claude Opus 4 score as generative judges on RewardBench 2, matching the top dedicated classifier. They evaluate response quality directly instead of producing trained scalar rewards, enabling alignment without purpose-built reward models.

Q: Will LM-as-a-judge replace purpose-trained reward models for RLHF by 2027? A: Near-parity exists on benchmarks, but dedicated classifiers retain advantages in latency, consistency, and cost. The likely outcome is convergence — LM-as-judge for quality-sensitive tasks, lightweight classifiers for high-throughput production. Full replacement remains speculative.

The Bottom Line

The reward model race split in two. One track belongs to data-curated classifiers that prove scale isn’t the bottleneck — curation is. The other belongs to general-purpose LLMs that already match dedicated models as alignment judges. You’re either updating your reward pipeline or running last year’s alignment stack.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors

Share: