DAN Analysis 8 min read March 25, 2026

From ChatGPT's PPO to DeepSeek's GRPO: How RLHF Alternatives Reshaped Alignment Through 2026

Diverging alignment pipelines branching away from a single reinforcement learning origin point

Table of Contents

TL;DR

The shift: Classical RLHF with PPO is no longer the default alignment method — reward-model-free alternatives now dominate
Why it matters: DPO and GRPO cut the most expensive stage of the alignment pipeline, making high-quality alignment accessible beyond billion-dollar compute budgets
What’s next: The alignment stack is fragmenting by use case — offline DPO for general alignment, online GRPO for reasoning, RLAIF for safety at scale

The method that made ChatGPT possible is being retired from the front lines. Not because it failed — because something cheaper arrived. Classical PPO (Proximal Policy Optimization)-based alignment required training a separate Reward Modeling, managing KL Divergence penalties, and burning compute on a notoriously unstable training loop. By early 2026, the industry routed around all of it.

The Reward Model Was Always the Wrong Bottleneck

Thesis: The alignment field didn’t fix PPO — it eliminated the reward model that PPO depended on.

OpenAI’s InstructGPT paper laid the blueprint in March 2022. Three stages: supervised Fine Tuning, reward model training on Preference Data, then PPO optimization. A 1.3B-parameter InstructGPT was preferred by human raters over the 175B GPT-3 (Ouyang et al.).

That was the proof of concept.

The problem was everything that came after it.

Reward models introduce their own failure cascade. They are expensive to train. They are prone to Reward Hacking. They drift as the policy model improves, demanding constant recalibration. Every lab that shipped RLHF at scale hit the same wall: the reward model was the weakest link, not the optimizer on top of it.

The solution was not a better reward model. It was no reward model at all.

Three Independent Bets, One Shared Conclusion

Stanford moved first. Rafailov et al. published Direct Preference Optimization in May 2023 — a closed-form solution that treats the language model itself as the reward function. No separate model. No RL loop. DPO exceeded PPO on sentiment control and matched or improved it on summarization and dialogue (Rafailov et al.).

DeepSeek took a different path. GRPO — Group Relative Policy Optimization — kept online RL but eliminated the critic model, replacing it with group-normalized rewards (Shao et al.). A 7B DeepSeekMath model hit 51.7% on the MATH benchmark, approaching GPT-4 territory.

Then came DeepSeek-R1 in January 2025. GRPO again, but this time it skipped the supervised fine-tuning phase entirely and relied on verifiable rewards only (DeepSeek). The compute savings over classical PPO were substantial — though the exact reduction varies by model size and setup.

Google validated the third path: RLAIF. Lee et al. showed that AI-generated feedback achieves comparable alignment performance to human feedback (Lee et al.). Anthropic scaled this into Constitutional AI, expanding its governing constitution dramatically by 2026.

Three research tracks. Three different architectures. One conclusion: the reward model is optional.

That’s not a refinement. That’s a structural break.

Who Moves Up

Open-source alignment teams gained the most ground. GRPO and DPO made high-quality alignment possible without massive annotation budgets. The OpenRLHF framework now supports PPO, GRPO, REINFORCE++, and DPO out of the box. HuggingFace’s TRL library ships GRPOTrainer as its primary online RL trainer (HuggingFace Docs).

Meta hedged intelligently. Llama 3 combined supervised fine-tuning, DPO, and PPO across six iterative rounds — with DPO requiring less compute at larger model scales. That hybrid approach is becoming the default playbook for frontier-scale alignment.

Labs focused on reasoning — math, coding, formal verification — are the clearest GRPO beneficiaries. Verifiable rewards let you skip human annotation entirely for tasks with checkable answers. That changes the economics of the entire pipeline.

Who Gets Squeezed

Annotation providers built for classical RLHF face a narrowing market. Scale AI and Surge AI built significant operations around human preference annotation — Surge AI provided domain-expert annotation across math, coding, law, and medicine for Anthropic’s Claude. The demand is not disappearing, but it is shifting from volume to specialization. DPO needs preference pairs, not reward scores. RLAIF replaces human raters with model-generated feedback for a growing share of safety work.

PPO itself is being sidelined in the tooling stack. TRL now marks PPOTrainer as experimental while GRPOTrainer holds the primary slot.

Compatibility note:
TRL PPOTrainer: Marked experimental as of TRL v0.29.1 (March 2026). GRPOTrainer is now the recommended online RL trainer. Teams running PPO-based pipelines should evaluate migration paths.

Anyone still building alignment infrastructure around reward-model-first pipelines is investing in the architecture the field just moved past.

What Happens Next

Base case (most likely): DPO remains the default for general alignment. GRPO dominates reasoning-heavy tasks. Classical PPO persists in legacy pipelines but loses share every quarter. Signal to watch: Major labs removing PPO from their public alignment documentation. Timeline: Already underway through 2026.

Bull case: GRPO-style methods enable a new class of small, specialized models that match frontier performance on narrow reasoning tasks — collapsing the cost gap between open-source and proprietary alignment. Signal: Sub-10B models consistently matching frontier models on domain-specific benchmarks. Timeline: Late 2026 to mid-2027.

Bear case: Reward-model-free methods hit a ceiling on complex, ambiguous tasks where human judgment cannot be reduced to verifiable rewards. Labs quietly bring back PPO for the hardest alignment problems. Signal: Frontier labs re-investing in reward model research after a period of public de-emphasis. Timeline: 2027, if at all.

Frequently Asked Questions

Q: How did OpenAI use RLHF to align ChatGPT and GPT-4? A: OpenAI’s InstructGPT pipeline used three stages — supervised fine-tuning, reward model training on human preferences, and PPO optimization. A 1.3B InstructGPT was preferred over the 175B GPT-3. OpenAI has not published detailed alignment specifics for GPT-4 or the o-series reasoning models.

Q: What real-world results have Scale AI and Surge AI delivered for RLHF annotation pipelines? A: Scale AI operates a major RLHF data annotation platform serving LLM training at scale. Surge AI provided domain-expert human annotation across math, coding, law, and medicine plus red teaming for Anthropic’s Claude training pipeline.

Q: How are DPO, GRPO, and RLAIF replacing classical RLHF with PPO in 2026? A: DPO eliminates the reward model via closed-form preference optimization. GRPO keeps online RL but removes the critic model. RLAIF substitutes AI-generated feedback for human annotation. All three reduce the cost and instability of classical PPO-based alignment pipelines.

The Bottom Line

The alignment stack splintered — and that is the feature, not the bug. PPO proved the concept. DPO, GRPO, and RLAIF proved you do not need a reward model to ship it. The teams building on reward-model-free methods already have a cost advantage that compounds with every training run. You are either adapting your alignment infrastructure or subsidizing a workflow the field optimized away.

Disclaimer

This article discusses financial topics for educational purposes only. It does not constitute financial advice. Consult a qualified financial advisor before making investment decisions.

Sources

Ouyang et al.: Training language models to follow instructions with human feedback - Original InstructGPT paper establishing the three-stage RLHF pipeline
Rafailov et al.: Direct Preference Optimization: Your Language Model is Secretly a Reward Model - DPO paper eliminating the reward model from alignment
Shao et al.: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models - GRPO introduction and DeepSeekMath benchmark results
DeepSeek: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning - GRPO applied without supervised fine-tuning
Lee et al.: RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback - Google validation of AI feedback for alignment
HuggingFace Docs: TRL - Transformers Reinforcement Learning - Tooling ecosystem shift from PPO to GRPO

Aha Moments

MONA

The mathematical argument for dropping the reward model is cleaner than most people realize. Classical RLHF trains a separate model to approximate a latent preference function, then uses that approximation as the objective for policy optimization. DPO’s insight is that you can derive the optimal policy directly from the preference data — the reward model was an intermediate step the math never required. GRPO takes a different route: it keeps generation-time optimization but normalizes rewards across a group of sampled outputs, making the critic model redundant. Both approaches reduce the dimensionality of the optimization problem. Fewer moving parts means fewer compounding errors. The convergence from independent directions tells us the reward model was architecturally vestigial — present for historical reasons, not mathematical ones.

MAX

What Mona describes at the theory level plays out as a concrete engineering win. The classical PPO pipeline requires you to version and maintain paired models in lockstep — the policy and the reward model. When one drifts, the other compensates in unpredictable ways. DPO collapses that into a single training artifact. GRPO keeps the generation loop but removes the extra model from the inference graph entirely. From a deployment perspective, both dramatically reduce the failure surface. The TRL ecosystem confirms this: when the maintainers move PPOTrainer to experimental status and promote GRPOTrainer as primary, that is the open-source community voting with their maintenance hours. The tooling follows the architecture, and the architecture follows the math.

ALAN

Efficiency gains always have a shadow side worth examining. The reward model was expensive, yes — but it was also inspectable. You could probe its outputs, audit its preference rankings, trace why it scored one completion higher than another. DPO folds the preference function directly into the model weights, making alignment choices harder to examine after training. GRPO normalizes rewards within a batch, meaning the model learns what is relatively better, not what is objectively good. As these methods make alignment cheaper and more accessible, they also make it less transparent. When the most consequential optimization in a language model’s behavior becomes the one that is hardest to audit — who is responsible for checking whether it drifted?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors