feat: basic ppo training implementation by hXl3s · Pull Request #2027 · NVIDIA-NeMo/RL

hXl3s · 2026-02-26T14:47:32Z

DO NOT MERGE! WORK IN PROGRESS!

What does this PR do ?

This PR adds basic Proximal Policy Optimization training loop to Nemo-RL.

What is added:

Support for value model. Current value model is a separate worker. Case where value model is just a head of Policy is not covered yet
PPO training loop and example of training math model (no convergence tested yet)
Basic logging and validation during PPO training

Issues

No direct issue

closes #2047

Usage

uv run example/run_ppo.py

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

github-actions · 2026-02-26T14:48:53Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 08aa60d (PR #2027 from lukaszp/ppo)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

github-actions · 2026-02-26T14:52:07Z

⚠️ File Consistency Check

Check based on commit: 08aa60d (PR #2027 from lukaszp/ppo)

⚠️ DTensor Policy Worker Synchronization Warning

The file nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py was modified in this PR, but nemo_rl/models/policy/workers/dtensor_policy_worker.py was not updated.

Why this matters:
These files contain related DTensor policy worker implementations that should be kept synchronized to ensure consistency across different versions.

Action required:

Please review if the changes in nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py should also be applied to nemo_rl/models/policy/workers/dtensor_policy_worker.py
Update nemo_rl/models/policy/workers/dtensor_policy_worker.py if necessary to maintain consistency
If the files are intentionally different, please add a comment in the PR explaining why

Files to check:

Modified: nemo_rl/models/policy/workers/dtensor_policy_worker_v2.py
Not modified: nemo_rl/models/policy/workers/dtensor_policy_worker.py

_{This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning.}

github-actions · 2026-02-26T15:37:05Z

✅ Submodule Fast-Forward Check Results

Check based on commit: efd71bb (PR #2027 from lukaszp/ppo)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

copy-pr-bot · 2026-03-10T13:48:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Gerald Shen <geshen@nvidia.com>

Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>

github-actions · 2026-03-31T12:50:16Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 061fa41 (PR #2027 from lukaszp/ppo)

❌ Submodules that need attention:

Automodel: ❌ Commits have DIVERGED from a common ancestor
TARGET (main branch): https://github.com/NVIDIA-NeMo/Automodel/commits/92635e74f4fb16784268b9a9fd7b7d6a83fff6c5/
CURRENT (PR #2027 from lukaszp/ppo): https://github.com/NVIDIA-NeMo/Automodel/commits/519201d11b8dba3088c759df952d87295793e020/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>

github-actions · 2026-03-31T12:58:29Z

✅ Submodule Fast-Forward Check Results

Check based on commit: ed36260 (PR #2027 from lukaszp/ppo)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>

github-actions · 2026-04-01T13:50:34Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 17e5433 (PR #2027 from lukaszp/ppo)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>

github-actions · 2026-04-13T16:03:13Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 11da7f3 (PR #2027 from lukaszp/ppo)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

github-actions · 2026-04-15T12:57:07Z

✅ Submodule Fast-Forward Check Results

Check based on commit: 553f741 (PR #2027 from lukaszp/ppo)

✅ Submodules that are properly updated:

Automodel: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Job 12246898 failed at value worker __init__ with: TypeError: setup_model_and_optimizer() got an unexpected keyword argument 'distributed_manager'. Did you mean 'distributed_context'? Path A's worker (current ablation HEAD) documents 3 signature-drift fixes its own __init__ needed against this fork's APIs. Apply the same 3 fixes to the pre-Path-A worker (mechanical mirror of Path A's notes, no semantic divergence): 1. setup_model_and_optimizer(distributed_manager=...) -> setup_model_and_optimizer(distributed_context=...) (this fork renamed the kwarg in PR NVIDIA-NeMo#2027) 2. ModelAndOptimizerState unpacking: 11 slots -> 10 Drop self.model_state_dict_keys -- this fork's ModelAndOptimizerState NamedTuple (nemo_rl/models/automodel/config.py) has 10 fields and does not include model_state_dict_keys. Without this: ValueError: not enough values to unpack (expected 11, got 10) 3. RuntimeConfig unpacking: 12 slots -> 13 Add _runtime_sampling_params slot before _runtime_is_reward_model. This fork's RuntimeConfig has 13 fields with sampling_params right before is_reward_model. bg51717/ppo stripped sampling_params from train()/get_values() call sites too (which is what we want and what the pre-Path-A worker already has), so we discard the value into _runtime_sampling_params rather than assigning self.sampling_params. Without this: ValueError: too many values to unpack (expected 12) All 3 fixes are mechanical and documented verbatim in Path A's worker NOTE comments (commit fb38a59 lines 297-339).

… calls Job 12247164 reached value-worker init (3 prior signature fixes worked), got into the rollout/get_values cycle, then crashed with: TypeError: forward_with_post_processing_fn() got an unexpected keyword argument 'cfg' Pre-Path-A worker passed `cfg=self.cfg` to two functions that this fork no longer accepts it on (PR NVIDIA-NeMo#2027 stripped the param): - automodel_forward_backward (train() microbatch loop, line 437) - forward_with_post_processing_fn (get_values() forward, line 561) LossPostProcessor.__init__ and ScorePostProcessor.__init__ DO still accept cfg, so those call sites are left alone. This is the same pattern as the previous setup_model_and_optimizer + ModelAndOptimizerState/RuntimeConfig fixes -- pre-Path-A worker was inherited from upstream bg51717/ppo and never re-tested against this fork's APIs. Path A worked around it by writing a custom microbatch loop that didn't go through these functions at all.

Job 12246898 failed at value worker __init__ with: TypeError: setup_model_and_optimizer() got an unexpected keyword argument 'distributed_manager'. Did you mean 'distributed_context'? Path A's worker (current ablation HEAD) documents 3 signature-drift fixes its own __init__ needed against this fork's APIs. Apply the same 3 fixes to the pre-Path-A worker (mechanical mirror of Path A's notes, no semantic divergence): 1. setup_model_and_optimizer(distributed_manager=...) -> setup_model_and_optimizer(distributed_context=...) (this fork renamed the kwarg in PR NVIDIA-NeMo#2027) 2. ModelAndOptimizerState unpacking: 11 slots -> 10 Drop self.model_state_dict_keys -- this fork's ModelAndOptimizerState NamedTuple (nemo_rl/models/automodel/config.py) has 10 fields and does not include model_state_dict_keys. Without this: ValueError: not enough values to unpack (expected 11, got 10) 3. RuntimeConfig unpacking: 12 slots -> 13 Add _runtime_sampling_params slot before _runtime_is_reward_model. This fork's RuntimeConfig has 13 fields with sampling_params right before is_reward_model. bg51717/ppo stripped sampling_params from train()/get_values() call sites too (which is what we want and what the pre-Path-A worker already has), so we discard the value into _runtime_sampling_params rather than assigning self.sampling_params. Without this: ValueError: too many values to unpack (expected 12) All 3 fixes are mechanical and documented verbatim in Path A's worker NOTE comments (commit fb38a59 lines 297-339).

… calls Job 12247164 reached value-worker init (3 prior signature fixes worked), got into the rollout/get_values cycle, then crashed with: TypeError: forward_with_post_processing_fn() got an unexpected keyword argument 'cfg' Pre-Path-A worker passed `cfg=self.cfg` to two functions that this fork no longer accepts it on (PR NVIDIA-NeMo#2027 stripped the param): - automodel_forward_backward (train() microbatch loop, line 437) - forward_with_post_processing_fn (get_values() forward, line 561) LossPostProcessor.__init__ and ScorePostProcessor.__init__ DO still accept cfg, so those call sites are left alone. This is the same pattern as the previous setup_model_and_optimizer + ModelAndOptimizerState/RuntimeConfig fixes -- pre-Path-A worker was inherited from upstream bg51717/ppo and never re-tested against this fork's APIs. Path A worked around it by writing a custom microbatch loop that didn't go through these functions at all.

…o#2027) Repin the Automodel submodule from 26108096 to 6eb5e862 ("fix: Propagate torch_dtype to sub-configs correctly", from NVIDIA-NeMo/Automodel#2027) as a temporary pin. Note: 6eb5e862 is an unmerged PR commit (an older force-pushed revision of NVIDIA-NeMo#2027, not its current head and not on main) and predates the Nemotron-Omni RADIO post-load patches in 26108096. It still pins transformers==5.5.0 in its own metadata, so the transformers override stays consistent. The refreshed uv.lock reflects the reverse-delta (drops the later s3 / msc extras and the wandb>=0.26.1 pin). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: root <zhangyuekai@foxmail.com>

Bump the Automodel submodule to 5dcc9abe9 ("fix: Propagate torch_dtype to sub-configs correctly", NVIDIA-NeMo/Automodel#2027). This is the oldest commit on Automodel main that carries the NVIDIA-NeMo#2027 torch_dtype-propagation fix, so it is reachable by a plain `git submodule update` (unlike the orphaned, force-pushed PR-head revision of the same change, which lives in Automodel's pre-rewrite history and is on no upstream branch). It pins transformers==5.5.0 in its own metadata, keeping the transformers override consistent. uv.lock refreshed accordingly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: root <zhangyuekai@foxmail.com>

Job 12246898 failed at value worker __init__ with: TypeError: setup_model_and_optimizer() got an unexpected keyword argument 'distributed_manager'. Did you mean 'distributed_context'? Path A's worker (current ablation HEAD) documents 3 signature-drift fixes its own __init__ needed against this fork's APIs. Apply the same 3 fixes to the pre-Path-A worker (mechanical mirror of Path A's notes, no semantic divergence): 1. setup_model_and_optimizer(distributed_manager=...) -> setup_model_and_optimizer(distributed_context=...) (this fork renamed the kwarg in PR #2027) 2. ModelAndOptimizerState unpacking: 11 slots -> 10 Drop self.model_state_dict_keys -- this fork's ModelAndOptimizerState NamedTuple (nemo_rl/models/automodel/config.py) has 10 fields and does not include model_state_dict_keys. Without this: ValueError: not enough values to unpack (expected 11, got 10) 3. RuntimeConfig unpacking: 12 slots -> 13 Add _runtime_sampling_params slot before _runtime_is_reward_model. This fork's RuntimeConfig has 13 fields with sampling_params right before is_reward_model. bg51717/ppo stripped sampling_params from train()/get_values() call sites too (which is what we want and what the pre-Path-A worker already has), so we discard the value into _runtime_sampling_params rather than assigning self.sampling_params. Without this: ValueError: too many values to unpack (expected 12) All 3 fixes are mechanical and documented verbatim in Path A's worker NOTE comments (commit fb38a59 lines 297-339).

… calls Job 12247164 reached value-worker init (3 prior signature fixes worked), got into the rollout/get_values cycle, then crashed with: TypeError: forward_with_post_processing_fn() got an unexpected keyword argument 'cfg' Pre-Path-A worker passed `cfg=self.cfg` to two functions that this fork no longer accepts it on (PR #2027 stripped the param): - automodel_forward_backward (train() microbatch loop, line 437) - forward_with_post_processing_fn (get_values() forward, line 561) LossPostProcessor.__init__ and ScorePostProcessor.__init__ DO still accept cfg, so those call sites are left alone. This is the same pattern as the previous setup_model_and_optimizer + ModelAndOptimizerState/RuntimeConfig fixes -- pre-Path-A worker was inherited from upstream bg51717/ppo and never re-tested against this fork's APIs. Path A worked around it by writing a custom microbatch loop that didn't go through these functions at all.

Job 12246898 failed at value worker __init__ with: TypeError: setup_model_and_optimizer() got an unexpected keyword argument 'distributed_manager'. Did you mean 'distributed_context'? Path A's worker (current ablation HEAD) documents 3 signature-drift fixes its own __init__ needed against this fork's APIs. Apply the same 3 fixes to the pre-Path-A worker (mechanical mirror of Path A's notes, no semantic divergence): 1. setup_model_and_optimizer(distributed_manager=...) -> setup_model_and_optimizer(distributed_context=...) (this fork renamed the kwarg in PR #2027) 2. ModelAndOptimizerState unpacking: 11 slots -> 10 Drop self.model_state_dict_keys -- this fork's ModelAndOptimizerState NamedTuple (nemo_rl/models/automodel/config.py) has 10 fields and does not include model_state_dict_keys. Without this: ValueError: not enough values to unpack (expected 11, got 10) 3. RuntimeConfig unpacking: 12 slots -> 13 Add _runtime_sampling_params slot before _runtime_is_reward_model. This fork's RuntimeConfig has 13 fields with sampling_params right before is_reward_model. bg51717/ppo stripped sampling_params from train()/get_values() call sites too (which is what we want and what the pre-Path-A worker already has), so we discard the value into _runtime_sampling_params rather than assigning self.sampling_params. Without this: ValueError: too many values to unpack (expected 12) All 3 fixes are mechanical and documented verbatim in Path A's worker NOTE comments (commit fb38a59 lines 297-339).

… calls Job 12247164 reached value-worker init (3 prior signature fixes worked), got into the rollout/get_values cycle, then crashed with: TypeError: forward_with_post_processing_fn() got an unexpected keyword argument 'cfg' Pre-Path-A worker passed `cfg=self.cfg` to two functions that this fork no longer accepts it on (PR #2027 stripped the param): - automodel_forward_backward (train() microbatch loop, line 437) - forward_with_post_processing_fn (get_values() forward, line 561) LossPostProcessor.__init__ and ScorePostProcessor.__init__ DO still accept cfg, so those call sites are left alone. This is the same pattern as the previous setup_model_and_optimizer + ModelAndOptimizerState/RuntimeConfig fixes -- pre-Path-A worker was inherited from upstream bg51717/ppo and never re-tested against this fork's APIs. Path A worked around it by writing a custom microbatch loop that didn't go through these functions at all.

hXl3s added 6 commits February 19, 2026 12:02

feat(ppo): Implementation scaffolding

8b33a6c

feat(ppo): advantage estimator fix

165f994

fix(ppo): fix advantage computation

dcf0e9d

feat(ppo): add support for value model and reward whitening loss

f8a3bb3

feat: update automodel repo

1104b8c

fix: remove debug print

08aa60d

revert: accidentaly removed reference model

efd71bb

fix: advantage computation and better logging

81fc298

hXl3s and others added 18 commits March 16, 2026 13:48

chore: random tests for convergence

3e08a0e

fix reward bug

df77d5b

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix value

5636ff8

Signed-off-by: Gerald Shen <geshen@nvidia.com>

mcore

03dcc9c

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

ef6f6b1

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

7f8a9bb

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

b5615ac

Signed-off-by: Gerald Shen <geshen@nvidia.com>

check

bb23763

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix21

30ae120

Signed-off-by: Gerald Shen <geshen@nvidia.com>

add

e34960e

Signed-off-by: Gerald Shen <geshen@nvidia.com>

offload

8bf1019

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

d016778

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

26b7453

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix config

0f01d5f

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

0efa823

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

10a5c99

Signed-off-by: Gerald Shen <geshen@nvidia.com>

fix

0eefa8a

Signed-off-by: Gerald Shen <geshen@nvidia.com>

match more things to verl

b391a31

Signed-off-by: Gerald Shen <geshen@nvidia.com>

Merge branch 'geshen/ppo' into lukaszp/ppo

d04b2b4

hXl3s force-pushed the lukaszp/ppo branch 2 times, most recently from 8f01c5b to 24e1db0 Compare March 30, 2026 14:56

hXl3s added 2 commits March 30, 2026 17:13

fix: after merge fixes

b6cd301

Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>

fix: more post merge issues

c65b9fb

Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>

hXl3s force-pushed the lukaszp/ppo branch from 24e1db0 to c65b9fb Compare March 30, 2026 15:21

hXl3s added 2 commits March 31, 2026 08:53

Merge branch 'main' into lukaszp/ppo

4e6792a

Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>

fix: Update automodel dependnecy

061fa41

Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>

fix: correct automodel

ed36260

Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>

fix: More after merge fixes

17e5433

Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>

hXl3s added 3 commits April 9, 2026 13:33

Merge remote-tracking branch 'origin/main' into lukaszp/ppo

b95786d

fix: Updated gym

852f610

Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>

feat: Convergence recipe v1

11da7f3

fix: Remove forced dropout to value model

553f741

anwithk mentioned this pull request May 27, 2026

[NeMo RL] v0.7.0 Release Roadmap #2591

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: basic ppo training implementation#2027

feat: basic ppo training implementation#2027
hXl3s wants to merge 44 commits into
NVIDIA-NeMo:mainfrom
hXl3s:lukaszp/ppo

hXl3s commented Feb 26, 2026 •

edited by terrykong

Loading

Uh oh!

github-actions Bot commented Feb 26, 2026

Uh oh!

github-actions Bot commented Feb 26, 2026

Uh oh!

github-actions Bot commented Feb 26, 2026

Uh oh!

copy-pr-bot Bot commented Mar 10, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hXl3s commented Feb 26, 2026 • edited by terrykong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

github-actions Bot commented Feb 26, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions Bot commented Feb 26, 2026

⚠️ File Consistency Check

⚠️ DTensor Policy Worker Synchronization Warning

Uh oh!

github-actions Bot commented Feb 26, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

copy-pr-bot Bot commented Mar 10, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

github-actions Bot commented Mar 31, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions Bot commented Apr 1, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions Bot commented Apr 13, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

github-actions Bot commented Apr 15, 2026

✅ Submodule Fast-Forward Check Results

✅ Submodules that are properly updated:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hXl3s commented Feb 26, 2026 •

edited by terrykong

Loading