feat: basic ppo training implementation#2027
Conversation
|
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
Signed-off-by: Gerald Shen <geshen@nvidia.com>
8f01c5b to
24e1db0
Compare
Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>
Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>
Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>
Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>
❌ Submodule Fast-Forward Check FailedCheck based on commit: 061fa41 (PR #2027 from ❌ Submodules that need attention:Automodel: ❌ Commits have DIVERGED from a common ancestor Please ensure all submodule commits are fast-forwards of the main branch before merging. |
Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>
Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>
Signed-off-by: Lukasz Pierscieniewski <lukaszp@nvidia.com>
Job 12246898 failed at value worker __init__ with: TypeError: setup_model_and_optimizer() got an unexpected keyword argument 'distributed_manager'. Did you mean 'distributed_context'? Path A's worker (current ablation HEAD) documents 3 signature-drift fixes its own __init__ needed against this fork's APIs. Apply the same 3 fixes to the pre-Path-A worker (mechanical mirror of Path A's notes, no semantic divergence): 1. setup_model_and_optimizer(distributed_manager=...) -> setup_model_and_optimizer(distributed_context=...) (this fork renamed the kwarg in PR NVIDIA-NeMo#2027) 2. ModelAndOptimizerState unpacking: 11 slots -> 10 Drop self.model_state_dict_keys -- this fork's ModelAndOptimizerState NamedTuple (nemo_rl/models/automodel/config.py) has 10 fields and does not include model_state_dict_keys. Without this: ValueError: not enough values to unpack (expected 11, got 10) 3. RuntimeConfig unpacking: 12 slots -> 13 Add _runtime_sampling_params slot before _runtime_is_reward_model. This fork's RuntimeConfig has 13 fields with sampling_params right before is_reward_model. bg51717/ppo stripped sampling_params from train()/get_values() call sites too (which is what we want and what the pre-Path-A worker already has), so we discard the value into _runtime_sampling_params rather than assigning self.sampling_params. Without this: ValueError: too many values to unpack (expected 12) All 3 fixes are mechanical and documented verbatim in Path A's worker NOTE comments (commit fb38a59 lines 297-339).
… calls Job 12247164 reached value-worker init (3 prior signature fixes worked), got into the rollout/get_values cycle, then crashed with: TypeError: forward_with_post_processing_fn() got an unexpected keyword argument 'cfg' Pre-Path-A worker passed `cfg=self.cfg` to two functions that this fork no longer accepts it on (PR NVIDIA-NeMo#2027 stripped the param): - automodel_forward_backward (train() microbatch loop, line 437) - forward_with_post_processing_fn (get_values() forward, line 561) LossPostProcessor.__init__ and ScorePostProcessor.__init__ DO still accept cfg, so those call sites are left alone. This is the same pattern as the previous setup_model_and_optimizer + ModelAndOptimizerState/RuntimeConfig fixes -- pre-Path-A worker was inherited from upstream bg51717/ppo and never re-tested against this fork's APIs. Path A worked around it by writing a custom microbatch loop that didn't go through these functions at all.
Job 12246898 failed at value worker __init__ with: TypeError: setup_model_and_optimizer() got an unexpected keyword argument 'distributed_manager'. Did you mean 'distributed_context'? Path A's worker (current ablation HEAD) documents 3 signature-drift fixes its own __init__ needed against this fork's APIs. Apply the same 3 fixes to the pre-Path-A worker (mechanical mirror of Path A's notes, no semantic divergence): 1. setup_model_and_optimizer(distributed_manager=...) -> setup_model_and_optimizer(distributed_context=...) (this fork renamed the kwarg in PR NVIDIA-NeMo#2027) 2. ModelAndOptimizerState unpacking: 11 slots -> 10 Drop self.model_state_dict_keys -- this fork's ModelAndOptimizerState NamedTuple (nemo_rl/models/automodel/config.py) has 10 fields and does not include model_state_dict_keys. Without this: ValueError: not enough values to unpack (expected 11, got 10) 3. RuntimeConfig unpacking: 12 slots -> 13 Add _runtime_sampling_params slot before _runtime_is_reward_model. This fork's RuntimeConfig has 13 fields with sampling_params right before is_reward_model. bg51717/ppo stripped sampling_params from train()/get_values() call sites too (which is what we want and what the pre-Path-A worker already has), so we discard the value into _runtime_sampling_params rather than assigning self.sampling_params. Without this: ValueError: too many values to unpack (expected 12) All 3 fixes are mechanical and documented verbatim in Path A's worker NOTE comments (commit fb38a59 lines 297-339).
… calls Job 12247164 reached value-worker init (3 prior signature fixes worked), got into the rollout/get_values cycle, then crashed with: TypeError: forward_with_post_processing_fn() got an unexpected keyword argument 'cfg' Pre-Path-A worker passed `cfg=self.cfg` to two functions that this fork no longer accepts it on (PR NVIDIA-NeMo#2027 stripped the param): - automodel_forward_backward (train() microbatch loop, line 437) - forward_with_post_processing_fn (get_values() forward, line 561) LossPostProcessor.__init__ and ScorePostProcessor.__init__ DO still accept cfg, so those call sites are left alone. This is the same pattern as the previous setup_model_and_optimizer + ModelAndOptimizerState/RuntimeConfig fixes -- pre-Path-A worker was inherited from upstream bg51717/ppo and never re-tested against this fork's APIs. Path A worked around it by writing a custom microbatch loop that didn't go through these functions at all.
…o#2027) Repin the Automodel submodule from 26108096 to 6eb5e862 ("fix: Propagate torch_dtype to sub-configs correctly", from NVIDIA-NeMo/Automodel#2027) as a temporary pin. Note: 6eb5e862 is an unmerged PR commit (an older force-pushed revision of NVIDIA-NeMo#2027, not its current head and not on main) and predates the Nemotron-Omni RADIO post-load patches in 26108096. It still pins transformers==5.5.0 in its own metadata, so the transformers override stays consistent. The refreshed uv.lock reflects the reverse-delta (drops the later s3 / msc extras and the wandb>=0.26.1 pin). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: root <zhangyuekai@foxmail.com>
Bump the Automodel submodule to 5dcc9abe9 ("fix: Propagate torch_dtype to
sub-configs correctly", NVIDIA-NeMo/Automodel#2027). This is the oldest
commit on Automodel main that carries the NVIDIA-NeMo#2027 torch_dtype-propagation
fix, so it is reachable by a plain `git submodule update` (unlike the
orphaned, force-pushed PR-head revision of the same change, which lives in
Automodel's pre-rewrite history and is on no upstream branch).
It pins transformers==5.5.0 in its own metadata, keeping the transformers
override consistent. uv.lock refreshed accordingly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: root <zhangyuekai@foxmail.com>
Job 12246898 failed at value worker __init__ with: TypeError: setup_model_and_optimizer() got an unexpected keyword argument 'distributed_manager'. Did you mean 'distributed_context'? Path A's worker (current ablation HEAD) documents 3 signature-drift fixes its own __init__ needed against this fork's APIs. Apply the same 3 fixes to the pre-Path-A worker (mechanical mirror of Path A's notes, no semantic divergence): 1. setup_model_and_optimizer(distributed_manager=...) -> setup_model_and_optimizer(distributed_context=...) (this fork renamed the kwarg in PR #2027) 2. ModelAndOptimizerState unpacking: 11 slots -> 10 Drop self.model_state_dict_keys -- this fork's ModelAndOptimizerState NamedTuple (nemo_rl/models/automodel/config.py) has 10 fields and does not include model_state_dict_keys. Without this: ValueError: not enough values to unpack (expected 11, got 10) 3. RuntimeConfig unpacking: 12 slots -> 13 Add _runtime_sampling_params slot before _runtime_is_reward_model. This fork's RuntimeConfig has 13 fields with sampling_params right before is_reward_model. bg51717/ppo stripped sampling_params from train()/get_values() call sites too (which is what we want and what the pre-Path-A worker already has), so we discard the value into _runtime_sampling_params rather than assigning self.sampling_params. Without this: ValueError: too many values to unpack (expected 12) All 3 fixes are mechanical and documented verbatim in Path A's worker NOTE comments (commit fb38a59 lines 297-339).
… calls Job 12247164 reached value-worker init (3 prior signature fixes worked), got into the rollout/get_values cycle, then crashed with: TypeError: forward_with_post_processing_fn() got an unexpected keyword argument 'cfg' Pre-Path-A worker passed `cfg=self.cfg` to two functions that this fork no longer accepts it on (PR #2027 stripped the param): - automodel_forward_backward (train() microbatch loop, line 437) - forward_with_post_processing_fn (get_values() forward, line 561) LossPostProcessor.__init__ and ScorePostProcessor.__init__ DO still accept cfg, so those call sites are left alone. This is the same pattern as the previous setup_model_and_optimizer + ModelAndOptimizerState/RuntimeConfig fixes -- pre-Path-A worker was inherited from upstream bg51717/ppo and never re-tested against this fork's APIs. Path A worked around it by writing a custom microbatch loop that didn't go through these functions at all.
Job 12246898 failed at value worker __init__ with: TypeError: setup_model_and_optimizer() got an unexpected keyword argument 'distributed_manager'. Did you mean 'distributed_context'? Path A's worker (current ablation HEAD) documents 3 signature-drift fixes its own __init__ needed against this fork's APIs. Apply the same 3 fixes to the pre-Path-A worker (mechanical mirror of Path A's notes, no semantic divergence): 1. setup_model_and_optimizer(distributed_manager=...) -> setup_model_and_optimizer(distributed_context=...) (this fork renamed the kwarg in PR #2027) 2. ModelAndOptimizerState unpacking: 11 slots -> 10 Drop self.model_state_dict_keys -- this fork's ModelAndOptimizerState NamedTuple (nemo_rl/models/automodel/config.py) has 10 fields and does not include model_state_dict_keys. Without this: ValueError: not enough values to unpack (expected 11, got 10) 3. RuntimeConfig unpacking: 12 slots -> 13 Add _runtime_sampling_params slot before _runtime_is_reward_model. This fork's RuntimeConfig has 13 fields with sampling_params right before is_reward_model. bg51717/ppo stripped sampling_params from train()/get_values() call sites too (which is what we want and what the pre-Path-A worker already has), so we discard the value into _runtime_sampling_params rather than assigning self.sampling_params. Without this: ValueError: too many values to unpack (expected 12) All 3 fixes are mechanical and documented verbatim in Path A's worker NOTE comments (commit fb38a59 lines 297-339).
… calls Job 12247164 reached value-worker init (3 prior signature fixes worked), got into the rollout/get_values cycle, then crashed with: TypeError: forward_with_post_processing_fn() got an unexpected keyword argument 'cfg' Pre-Path-A worker passed `cfg=self.cfg` to two functions that this fork no longer accepts it on (PR #2027 stripped the param): - automodel_forward_backward (train() microbatch loop, line 437) - forward_with_post_processing_fn (get_values() forward, line 561) LossPostProcessor.__init__ and ScorePostProcessor.__init__ DO still accept cfg, so those call sites are left alone. This is the same pattern as the previous setup_model_and_optimizer + ModelAndOptimizerState/RuntimeConfig fixes -- pre-Path-A worker was inherited from upstream bg51717/ppo and never re-tested against this fork's APIs. Path A worked around it by writing a custom microbatch loop that didn't go through these functions at all.
DO NOT MERGE! WORK IN PROGRESS!
What does this PR do ?
This PR adds basic Proximal Policy Optimization training loop to Nemo-RL.
What is added:
Issues
No direct issue
closes #2047
Usage
Before your PR is "Ready for review"
Pre checks:
Additional Information