Checkpointing fixes by ashors1 · Pull Request #9 · NVIDIA-NeMo/RL

ashors1 · 2025-03-21T03:17:40Z

What does this PR do ?

converts relative paths to absolute paths before checkpoint saving/loading. This works around the problem of ray workers having different relative dirs
adds LR scheduler state to the checkpoint

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing

TBD

Additional Information

Related to # (issue)

Signed-off-by: ashors1 <ashors@nvidia.com>

Update gemma4 recipe to use fusedAdam and update Automodel

Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11. - Rename _load_M -> _get_sparse_projection_matrix and _load_dense_projection -> _get_topk_projection (later removed in favor of module-level cache helpers below). - Drop unused alignment_student_spans / alignment_teacher_spans from the cross-tokenizer batch payload. - Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect. - Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a new shared module nemo_rl/algorithms/x_token/utils.py. - Extract projection-file parsing into utils.parse_projection_file; tokenalign.py and loss_functions.py both go through it. - Move per-instance projection-matrix caches to process-local caches in utils.get_sparse_projection_matrix / get_topk_projection. The driver no longer holds large CUDA tensors; each Ray worker fills its own cache on first loss call. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>

Comments addressed: #3, #5, #7, #8, #9, #10, #11. - Rename _load_M -> _get_sparse_projection_matrix and _load_dense_projection -> _get_topk_projection (later removed in favor of module-level cache helpers below). - Drop unused alignment_student_spans / alignment_teacher_spans from the cross-tokenizer batch payload. - Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect. - Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a new shared module nemo_rl/algorithms/x_token/utils.py. - Extract projection-file parsing into utils.parse_projection_file; tokenalign.py and loss_functions.py both go through it. - Move per-instance projection-matrix caches to process-local caches in utils.get_sparse_projection_matrix / get_topk_projection. The driver no longer holds large CUDA tensors; each Ray worker fills its own cache on first loss call. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>

make checkpoint paths absolute, save lr scheduler state to checkpoint

ad32cb6

Signed-off-by: ashors1 <ashors@nvidia.com>

ashors1 requested review from SahilJain314 and parthchadha March 21, 2025 03:19

SahilJain314 approved these changes Mar 21, 2025

View reviewed changes

SahilJain314 merged commit c98ab97 into main Mar 21, 2025

SahilJain314 deleted the ashors/ckpt-fixes branch March 21, 2025 03:31

KiddoZhu pushed a commit that referenced this pull request May 6, 2025

Checkpointing fixes (#9)

19a084a

Signed-off-by: ashors1 <ashors@nvidia.com>

brluobt mentioned this pull request Mar 2, 2026

GRPO Quick Start guide missing prerequisites, env vars, and troubleshooting for single-node H20 setup #2043

Open

copy-pr-bot Bot pushed a commit that referenced this pull request May 19, 2026

Merge pull request #9 from jQizhang/gemma4-support-518

bd275dd

Update gemma4 recipe to use fusedAdam and update Automodel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointing fixes#9

Checkpointing fixes#9
SahilJain314 merged 1 commit into
mainfrom
ashors/ckpt-fixes

ashors1 commented Mar 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ashors1 commented Mar 21, 2025

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Checklist when contributing

Additional Information

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants