feat: reinforcer initial commit by terrykong · Pull Request #3 · NVIDIA-NeMo/RL

terrykong · 2025-03-20T21:54:10Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing

TBD

Additional Information

Related to # (issue)

Co-authored-by: Sahil Jain <sahil.jain5125@gmail.com> Co-authored-by: Parth Chadha <parth29@gmail.com> Co-authored-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Anna Shors <ashors@nvidia.com> Co-authored-by: Gerald Shen <geshen@nvidia.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: Yi-Fu Wu <yifuw@nvidia.com> Co-authored-by: ahmadki <ahmadki@users.noreply.github.com> Co-authored-by: Nathan McKimpson <nmckimpson@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

Signed-off-by: Terry Kong <terryk@nvidia.com>

Co-authored-by: Sahil Jain <sahil.jain5125@gmail.com> Co-authored-by: Parth Chadha <parth29@gmail.com> Co-authored-by: Anna Shors <ashors@nvidia.com> Co-authored-by: Gerald Shen <geshen@nvidia.com> Co-authored-by: Yuki Huang <yukih@nvidia.com> Co-authored-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: Yi-Fu Wu <yifuw@nvidia.com> Co-authored-by: ahmadki <ahmadki@users.noreply.github.com> Co-authored-by: Nathan McKimpson <nmckimpson@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Terry Kong <terryk@nvidia.com>

* add Qwen3.5-35B megatron nightly tests (LLM + VLM) * set PYTORCH_CUDA_ALLOC_CONF expandable_segments to True to reduce OOM * change test reward threshold --------- Signed-off-by: alexchiu <qiuzhaopeng@foxmail.com> Co-authored-by: alexchiu <qiuzhaopeng@foxmail.com> Signed-off-by: jQizhang <larkz@nvidia.com>

…patch TQ path Closes Issues #3 and #4 raised in PR review of the data-plane stack. Issue #3 — single-``KVBatchMeta`` path returned rows in scrambled order. ``shard_keys_by_seqlen`` sorts by sequence length and strides (``order[r::dp_world_size]``) to balance per-rank token totals. The worker logprob aggregators (``_aggregate_logprob_results``) then concatenate per-rank outputs in rank order via ``BatchedDataDict.from_batches`` — without inverting the seqlen- strided permutation. Result: ``policy.get_logprobs(KVBatchMeta(...))`` returned rows in [order[0], order[d], order[2d], …, order[1], order[1+d], …] order, not the caller's ``meta.keys`` order. Silent correctness bug (test_seqpack_legacy_equals_tq didn't catch it because the sync path calls ``policy.get_logprobs(BatchedDataDict)`` — legacy passthrough, no sharder). Fix: * ``shard_keys_by_seqlen`` records ``_dp_original_indices`` per shard in ``extra_info`` (the ``idx`` list it computed). * ``dp_dispatch`` reconstructs the concat-position → input-index permutation from the shards' ``extra_info``, then applies the inverse via ``BatchedDataDict.reorder_data`` after ``aggregate``. * The reorder is gated on ``is_meta and not is_meta_list`` — for ``list[KVBatchMeta]`` the driver controls ordering (PR 0 ``fan_out_per_rank_metas``) and the decorator must not undo it. * Skipped silently if the result isn't a BatchedDataDict (e.g. ``train`` returns a plain dict — order doesn't apply). Issue #4 — TQ path silently dropped legacy training semantics. The decorator's TQ branch returns ``aggregate(results)`` directly and never enters ``Policy.train``'s body — so the FLOPs accumulation at lm_policy.py around the ``flops_tracker`` block, plus the ``num_ranks`` and ``theoretical_tflops`` fields, were missing from results when the trainer called ``policy.train(KVBatchMeta)`` or ``policy.train(list[KVBatchMeta])``. Same gap for the missing GBS / DP divisibility assertion. Fix (additive — no signature changes to the existing aggregate callables): * ``dp_dispatch`` adds a basic divisibility assertion on the TQ path: ``total_meta_size % dp_size == 0`` (legacy path enforces this via ``shard_by_batch_size(batch_size=gbs)``; TQ path skips that call site). * ``dp_dispatch`` looks up ``self._dp_post_<method_name>`` after ``aggregate``. If defined, calls ``post(aggregated, raw_results, shards=shards)`` and uses its return value. Convention-based — opt-in per Policy method, no decorator boilerplate. * ``Policy._dp_post_train`` recovers FLOPs from ``meta.sequence_lengths`` on each shard (driver-pre-balanced for ``list[KVBatchMeta]``, sharder-strided for single ``KVBatchMeta``), records ``total_flops``, ``num_ranks``, ``theoretical_tflops`` — same fields the legacy body produces. Backward-compat: existing tests in tests/data_plane/unit/test_shard_parity.py and test_dispatch.py don't check ``extra_info`` shape on sharder output or assert on aggregate-method return type other than what's already returned, so the additive fields and gated reorder are transparent. The legacy ``policy.train(BatchedDataDict)`` path is unchanged — it keeps building results inline and never enters the new hook. Async-on-TQ (PR 4) and grpo_sync (PR 0) both use the ``list[KVBatchMeta]`` path, so they inherit the FLOPs fix automatically via the post-hook. The reorder fix is only meaningful for callers that pass single ``KVBatchMeta`` — primarily future logprob/reference- logprob TQ wiring; flagged in commit message of #3 above. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…patch TQ path Closes Issues #3 and #4 raised in PR review of the data-plane stack. Issue #3 — single-``KVBatchMeta`` path returned rows in scrambled order. ``shard_keys_by_seqlen`` sorts by sequence length and strides (``order[r::dp_world_size]``) to balance per-rank token totals. The worker logprob aggregators (``_aggregate_logprob_results``) then concatenate per-rank outputs in rank order via ``BatchedDataDict.from_batches`` — without inverting the seqlen- strided permutation. Result: ``policy.get_logprobs(KVBatchMeta(...))`` returned rows in [order[0], order[d], order[2d], …, order[1], order[1+d], …] order, not the caller's ``meta.keys`` order. Silent correctness bug (test_seqpack_legacy_equals_tq didn't catch it because the sync path calls ``policy.get_logprobs(BatchedDataDict)`` — legacy passthrough, no sharder). Fix: * ``shard_keys_by_seqlen`` records ``_dp_original_indices`` per shard in ``extra_info`` (the ``idx`` list it computed). * ``dp_dispatch`` reconstructs the concat-position → input-index permutation from the shards' ``extra_info``, then applies the inverse via ``BatchedDataDict.reorder_data`` after ``aggregate``. * The reorder is gated on ``is_meta and not is_meta_list`` — for ``list[KVBatchMeta]`` the driver controls ordering (PR 0 ``fan_out_per_rank_metas``) and the decorator must not undo it. * Skipped silently if the result isn't a BatchedDataDict (e.g. ``train`` returns a plain dict — order doesn't apply). Issue #4 — TQ path silently dropped legacy training semantics. The decorator's TQ branch returns ``aggregate(results)`` directly and never enters ``Policy.train``'s body — so the FLOPs accumulation at lm_policy.py around the ``flops_tracker`` block, plus the ``num_ranks`` and ``theoretical_tflops`` fields, were missing from results when the trainer called ``policy.train(KVBatchMeta)`` or ``policy.train(list[KVBatchMeta])``. Same gap for the missing GBS / DP divisibility assertion. Fix (additive — no signature changes to the existing aggregate callables): * ``dp_dispatch`` adds a basic divisibility assertion on the TQ path: ``total_meta_size % dp_size == 0`` (legacy path enforces this via ``shard_by_batch_size(batch_size=gbs)``; TQ path skips that call site). * ``dp_dispatch`` looks up ``self._dp_post_<method_name>`` after ``aggregate``. If defined, calls ``post(aggregated, raw_results, shards=shards)`` and uses its return value. Convention-based — opt-in per Policy method, no decorator boilerplate. * ``Policy._dp_post_train`` recovers FLOPs from ``meta.sequence_lengths`` on each shard (driver-pre-balanced for ``list[KVBatchMeta]``, sharder-strided for single ``KVBatchMeta``), records ``total_flops``, ``num_ranks``, ``theoretical_tflops`` — same fields the legacy body produces. Backward-compat: existing tests in tests/data_plane/unit/test_shard_parity.py and test_dispatch.py don't check ``extra_info`` shape on sharder output or assert on aggregate-method return type other than what's already returned, so the additive fields and gated reorder are transparent. The legacy ``policy.train(BatchedDataDict)`` path is unchanged — it keeps building results inline and never enters the new hook. Async-on-TQ (PR 4) and grpo_sync (PR 0) both use the ``list[KVBatchMeta]`` path, so they inherit the FLOPs fix automatically via the post-hook. The reorder fix is only meaningful for callers that pass single ``KVBatchMeta`` — primarily future logprob/reference- logprob TQ wiring; flagged in commit message of #3 above. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

* grpo_sync.py: remove unused batch_cache = None (leftover from grpo.py-style dynamic sampling; grpo_sync threads survivors through pending_meta / pending_slice). * TQPolicy: rename _dp_client -> dp_client and _tq_partition_id -> tq_partition_id. They are read from grpo_sync.py in 7 places, so the underscore prefix was misleading. Constructor kwarg tq_partition_id already matched the new attribute name. * Update README + data_plane_api_lifecycle docs example snippets. Per yuki-97 PR review (#3, #4). Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>