Skip to content

feat: PPO with MCore#2530

Merged
terrykong merged 66 commits into
NVIDIA-NeMo:mainfrom
bg51717:ppo
Jun 10, 2026
Merged

feat: PPO with MCore#2530
terrykong merged 66 commits into
NVIDIA-NeMo:mainfrom
bg51717:ppo

Conversation

@bg51717

@bg51717 bg51717 commented May 19, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Adds full Proximal Policy Optimization (PPO) support to NeMo-RL with an actor-critic architecture, using the Megatron-Core (mcore) backend for both the policy and value models. The policy (actor) and value function (critic) are jointly trained using Generalized Advantage Estimation (GAE). Both models run on Megatron-Core with GPU/CPU offloading for colocated execution on the same set of GPUs as vLLM generation.

Backend support: This PR implements PPO on the Megatron-Core backend only. The DTensor/FSDP2 backend is not yet supported for the value model. All recipes and tests use megatron_cfg.enabled: true.

Issues

close #2048

Summary of Changes

PPO Training Algorithm

  • Complete PPO training loop with critic-before-actor update order
  • Multiple training steps per rollout (steps_per_epoch)
  • Configurable critic warmup (policy_training_start_step) — trains value model alone before policy updates begin
  • Dynamic sampling support
  • Colocated architecture with GPU memory management via model offloading

Generalized Advantage Estimation (GAE)

  • Token-level GAE with carry-forward masking for correct multi-turn/padding handling
  • Token-level KL penalty in rewards (configurable coefficient and KL type: k1/k3)
  • VAPO decoupled GAE: separate lambda for value returns vs. policy advantages
  • Length-adaptive lambda: lambda_policy = 1 - 1/(alpha * response_length)
  • Reward whitening

Value Model (Megatron-Core backend)

  • LM backbone + scalar value head, reusing Megatron-Core policy infrastructure
  • Supports TP/PP/DP parallelism, distributed optimizer, sequence packing
  • GPU/CPU offloading for colocated execution with policy and vLLM
  • Full checkpoint save/load including value head weights
  • Clipped MSE value loss with configurable loss scale and clip range
  • VAPO NLL auxiliary loss on correct samples

Shared Algorithm Improvements

  • Refactored clipped PG loss to support both GRPO and PPO
  • Added Reinforce++ and raw-reward advantage estimators
  • GSM8K answer extraction and verification environment

Configuration and Recipes

  • Base config: examples/configs/ppo_math_1B_megatron.yaml (DAPO-style PPO: no KL penalty, asymmetric clipping, dual-clip, reward scaling)
  • ppo-dsr1-7b-math-8n8g-megatron — DeepSeek-R1-7B on DAPOMath-17K, 8 nodes, KL penalty + importance sampling
  • ppo-qwen2.5-1.5b-gsm8k-1n8g-megatron — Qwen2.5-1.5B-Instruct on GSM8K, 1 node, VAPO decoupled GAE

Tests

  • Unit tests: 17 tests for GAE computation, value loss, advantage estimator factory; 78 tests for Megatron model setup
  • Functional test: End-to-end PPO training on 2 GPUs with metric assertions on ratio clipping
  • Nightly tests (2 recipes):
    • 8-node DeepSeek-R1-7B on DAPOMath, 40 steps, checks reward > 0.3 and accuracy > 0.42 at step 40
    • 1-node Qwen2.5-1.5B on GSM8K, 100 steps, checks reward > 0.85 and accuracy > 0.7 at step 100
  • Reference config snapshot tests for all algorithms

Documentation

  • Algorithm overview: key differences from GRPO
  • In-depth guide: value model, GAE, VAPO decoupled GAE, training loop, loss functions, configuration, and metrics

Architecture

PPO Training Loop (mcore)

  1. Generate responses (vLLM, colocated)
  2. Score with environment (math verification)
  3. Value inference → per-token V(s_t)
  4. Policy logprobs → π_θ(a|s)
  5. GAE → advantages A_t, returns R_t
  6. Train critic (MSE on returns)
  7. Train actor (clipped surrogate objective)
    8.Steps 6-7 repeat steps_per_epoch times

Experimental Results

GSM8K: Qwen2.5-1.5B-Instruct, 1 node x 8 GPUs

截屏2026-05-21 18 36 47
  1. val:accuracy over steps — shows convergence on GSM8K test set
  2. train/reward over steps — shows reward progression

DAPOMath-17K: DeepSeek-R1-7B, 8 nodes x 8 GPUs

截屏2026-05-21 18 38 29

Metrics to screenshot from wandb (project: nemo-rl, run: ppo-dsr1-7b-math-8n8g-megatron):

  1. val:accuracy (AIME 2024) over steps — shows convergence on competition math
  2. train/reward over steps — shows reward progression

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

@copy-pr-bot

copy-pr-bot Bot commented May 19, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label May 19, 2026
@bg51717 bg51717 added CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) and removed Documentation Improvements or additions to documentation labels May 21, 2026
@bg51717 bg51717 marked this pull request as ready for review May 21, 2026 10:41
@bg51717 bg51717 requested review from a team as code owners May 21, 2026 10:41
@bg51717

bg51717 commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 50e878e

hXl3s and others added 10 commits May 23, 2026 08:10
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
bg51717 added 8 commits June 3, 2026 23:51
…missing keys

Signed-off-by: bg51717 <biguo@nvidia.com>
…n value model

Signed-off-by: bg51717 <biguo@nvidia.com>
…n value model

Signed-off-by: bg51717 <biguo@nvidia.com>
…tron value model

Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
Signed-off-by: bg51717 <biguo@nvidia.com>
@bg51717

bg51717 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 47d84e6

Comment thread nemo_rl/algorithms/ppo.py Outdated
@bg51717

bg51717 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test a53ce1a

@bg51717

bg51717 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 47d5e37

Comment thread nemo_rl/algorithms/ppo.py Outdated
Comment thread nemo_rl/models/automodel/setup.py
Comment thread nemo_rl/algorithms/ppo.py Outdated
Comment thread nemo_rl/algorithms/ppo.py Outdated
Comment thread nemo_rl/algorithms/ppo.py Outdated
Comment thread nemo_rl/algorithms/ppo.py Outdated
Comment thread nemo_rl/models/policy/workers/megatron_policy_worker.py Outdated
Comment thread nemo_rl/models/value/workers/megatron_value_worker.py Outdated
bg51717 added 2 commits June 6, 2026 10:18
…pstream

Signed-off-by: bg51717 <biguo@nvidia.com>
@bg51717

bg51717 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 4d9ae3f

@bg51717

bg51717 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test cc0b381

@bg51717

bg51717 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 40a67cc

Signed-off-by: bg51717 <biguo@nvidia.com>
Comment thread .github/workflows/cicd-main.yml
Comment thread nemo_rl/algorithms/ppo.py Outdated
Comment thread nemo_rl/models/policy/workers/megatron_policy_worker.py Outdated
Comment thread tests/functional/ppo_megatron.sh
Comment thread docs/guides/ppo.md Outdated
Comment thread nemo_rl/algorithms/ppo.py
Comment thread nemo_rl/algorithms/ppo.py Outdated
bg51717 added 3 commits June 10, 2026 02:30
Signed-off-by: bg51717 <biguo@nvidia.com>
# Conflicts:
#	nemo_rl/algorithms/loss/loss_functions.py
#	nemo_rl/models/policy/workers/megatron_policy_worker.py
Signed-off-by: bg51717 <biguo@nvidia.com>
@bg51717

bg51717 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test a567bfb

@yuki-97 yuki-97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bg51717 LGTM, thanks so much for the efforts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) CI Relating to CI Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[mcore] PPO

6 participants