feat: PPO with MCore by bg51717 · Pull Request #2530 · NVIDIA-NeMo/RL

bg51717 · 2026-05-19T14:43:35Z

What does this PR do ?

Adds full Proximal Policy Optimization (PPO) support to NeMo-RL with an actor-critic architecture, using the Megatron-Core (mcore) backend for both the policy and value models. The policy (actor) and value function (critic) are jointly trained using Generalized Advantage Estimation (GAE). Both models run on Megatron-Core with GPU/CPU offloading for colocated execution on the same set of GPUs as vLLM generation.

Backend support: This PR implements PPO on the Megatron-Core backend only. The DTensor/FSDP2 backend is not yet supported for the value model. All recipes and tests use megatron_cfg.enabled: true.

Issues

close #2048

Summary of Changes

PPO Training Algorithm

Complete PPO training loop with critic-before-actor update order
Multiple training steps per rollout (steps_per_epoch)
Configurable critic warmup (policy_training_start_step) — trains value model alone before policy updates begin
Dynamic sampling support
Colocated architecture with GPU memory management via model offloading

Generalized Advantage Estimation (GAE)

Token-level GAE with carry-forward masking for correct multi-turn/padding handling
Token-level KL penalty in rewards (configurable coefficient and KL type: k1/k3)
VAPO decoupled GAE: separate lambda for value returns vs. policy advantages
Length-adaptive lambda: lambda_policy = 1 - 1/(alpha * response_length)
Reward whitening

Value Model (Megatron-Core backend)

LM backbone + scalar value head, reusing Megatron-Core policy infrastructure
Supports TP/PP/DP parallelism, distributed optimizer, sequence packing
GPU/CPU offloading for colocated execution with policy and vLLM
Full checkpoint save/load including value head weights
Clipped MSE value loss with configurable loss scale and clip range
VAPO NLL auxiliary loss on correct samples

Shared Algorithm Improvements

Refactored clipped PG loss to support both GRPO and PPO
Added Reinforce++ and raw-reward advantage estimators
GSM8K answer extraction and verification environment

Configuration and Recipes

Base config: examples/configs/ppo_math_1B_megatron.yaml (DAPO-style PPO: no KL penalty, asymmetric clipping, dual-clip, reward scaling)
ppo-dsr1-7b-math-8n8g-megatron — DeepSeek-R1-7B on DAPOMath-17K, 8 nodes, KL penalty + importance sampling
ppo-qwen2.5-1.5b-gsm8k-1n8g-megatron — Qwen2.5-1.5B-Instruct on GSM8K, 1 node, VAPO decoupled GAE

Tests

Unit tests: 17 tests for GAE computation, value loss, advantage estimator factory; 78 tests for Megatron model setup
Functional test: End-to-end PPO training on 2 GPUs with metric assertions on ratio clipping
Nightly tests (2 recipes):
- 8-node DeepSeek-R1-7B on DAPOMath, 40 steps, checks reward > 0.3 and accuracy > 0.42 at step 40
- 1-node Qwen2.5-1.5B on GSM8K, 100 steps, checks reward > 0.85 and accuracy > 0.7 at step 100
Reference config snapshot tests for all algorithms

Documentation

Algorithm overview: key differences from GRPO
In-depth guide: value model, GAE, VAPO decoupled GAE, training loop, loss functions, configuration, and metrics

Architecture

PPO Training Loop (mcore)

Generate responses (vLLM, colocated)
Score with environment (math verification)
Value inference → per-token V(s_t)
Policy logprobs → π_θ(a|s)
GAE → advantages A_t, returns R_t
Train critic (MSE on returns)
Train actor (clipped surrogate objective)
8.Steps 6-7 repeat steps_per_epoch times

Experimental Results

GSM8K: Qwen2.5-1.5B-Instruct, 1 node x 8 GPUs

val:accuracy over steps — shows convergence on GSM8K test set
train/reward over steps — shows reward progression

DAPOMath-17K: DeepSeek-R1-7B, 8 nodes x 8 GPUs

Metrics to screenshot from wandb (project: nemo-rl, run: ppo-dsr1-7b-math-8n8g-megatron):

val:accuracy (AIME 2024) over steps — shows convergence on competition math
train/reward over steps — shows reward progression

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

copy-pr-bot · 2026-05-19T14:43:42Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

bg51717 · 2026-05-21T10:42:37Z

/ok to test 50e878e