docs: GRPO documentation and Configuration cleanup#7
Merged
Conversation
bbc08f2 to
3cdffe3
Compare
terrykong
reviewed
Mar 21, 2025
terrykong
reviewed
Mar 21, 2025
terrykong
reviewed
Mar 21, 2025
terrykong
reviewed
Mar 21, 2025
terrykong
reviewed
Mar 21, 2025
parthchadha
reviewed
Mar 21, 2025
parthchadha
previously approved these changes
Mar 21, 2025
7a001c0 to
b99a400
Compare
terrykong
previously approved these changes
Mar 21, 2025
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
b99a400 to
bb7b255
Compare
parthchadha
approved these changes
Mar 21, 2025
KiddoZhu
pushed a commit
that referenced
this pull request
May 6, 2025
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
sharonyu-115
referenced
this pull request
in sharonyu-115/RL
Apr 27, 2026
Validated in jobs 11363285 (bake) + 11363286 (inference):
greedy generation against DeepSeek-V4-Flash-Base produced coherent text
("The capital of France is" -> " Paris. The capital of Germany is
Berlin..."). The original `_load_w13: size of tensor a (2048) must
match tensor b (16) at dim 0` blocker is now cleared at the vLLM half;
Automodel half still pending teammate's branch.
## What changed
### pyproject.toml + uv.lock — torch 2.10 rollback
Larkz's vllm wheel `vllm-0.19.2rc1.dev219+gc0879d948.cu129-cp313-cp313`
ships METADATA `Requires-Dist: torch==2.11.0` but its `_C.abi3.so` was
compiled in a `torch==2.10.0` venv (per larkz/nemorl-ds4/pyproject.toml
pin) and references the OLD 4-arg `MessageLogger(char const*, int, int,
bool)` constructor that torch 2.11's libc10 dropped. First bake (job
11362567) failed at venv rebuild with `undefined symbol:
_ZN3c1013MessageLoggerC1EPKciib`.
Reverted in 5 sites:
- project.dependencies torch 2.11.0 -> 2.10.0
- project.dependencies torchvision 0.26.0 -> 0.25.0
- dependency-groups.build torch 2.11.0 -> 2.10.0
- tool.uv.override-dependencies torch / torchaudio -> 2.10.0
- tool.uv.override-dependencies + explicit torchvision==0.25.0
uv lock confirmed: torch / torchaudio / torchvision rolled to 2.10.0+cu129
/ 2.10.0+cu129 / 0.25.0+cu129; nccl 2.28.9 -> 2.27.5 transitively.
### tools/patch_vllm_dsv4_base_fp8_quick.sh — post-#40860 anchors
Anchor #7 rewritten: upstream's rewritten `load_weights` now captures
the result before calling `self.model.finalize_mega_moe_weights()`, so
"return loader.load_weights(...)" no longer matches; new anchor
"loaded_params = loader.load_weights(...)".
Added 8th anchor: forces `use_mega_moe = False` when
`VLLM_DSV4_BASE_FP8=1`. Without this, post-load
`finalize_mega_moe_weights()` would invoke `experts.finalize_weights()`
on a non-MegaMoE FusedMoE layer (Base routes through `Fp8MoEMethod`,
not `DeepseekV4MegaMoEExperts`). Self-contained env gating means
recipes don't also need to override `moe_backend`.
Verified against the larkz wheel: 9 anchors apply cleanly,
`py_compile` clean.
### env_refresh_dsv4.sh — bake the Base patch
Now applies two patches over the freshly-built worker venv before
`--container-save`:
1. `tools/vllm_deepseek_v4_config_patch.py` (defensive superset of
upstream main's narrow rope kwargs fix)
2. `tools/patch_vllm_dsv4_base_fp8_quick.sh` (env-gated Base FP8
support; idempotent, no-op when `VLLM_DSV4_BASE_FP8` unset)
### tools/test_vllm_dsv4_base_inference.py (new)
Counterpart to test_vllm_dsv4_inference.py for Flash. Sets
`VLLM_DSV4_BASE_FP8=1` before vllm import, re-applies the patch script
defensively (idempotent), runs greedy generation through standalone
`vllm.LLM`. Ran clean against the new sqsh.
### slurm_jobs/dsv4_base_{bake,inference_test}.sub (new)
Submit scripts for the chained bake + inference flow:
- bake: 1 node 8 GPU 3:55 walltime, --container-save
- inference: 1 node 8 GPU 1:00 walltime, --dependency=afterok
### docs/model-readiness/deepseek-v4* (new + updated)
Tracks the full DSV4 bring-up. Bring-up status doc now reflects:
- vLLM half of Base blocker CLEARED
- Torch ABI rollback chronicled
- New sqsh artifact path
(.../nemo-rl-dsv4-vllm-c0879d-base-torch210-2026-04-27.sqsh ~99.8 GiB)
- Inference job 11363286 success, generated text logged
- Old broken sqsh marked for forensics
## What still blocks
Automodel-side Base FP8-block-quant state_dict_adapter is pending the
teammate's new branch. Once that lands, sqsh rebake + GRPO smoke can
follow. The vLLM-side recipe override `load_format: auto` may now be
removable since the new wheel includes the upstream dummy-load fix
(`finalize_weights()` 3 hits in main's deepseek_v4.py); validation
deferred until first GRPO smoke attempt.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
sharonyu-115
referenced
this pull request
in sharonyu-115/RL
Apr 27, 2026
Re-applies the in-scope subset of reverted ef7bfb4 (kept the docs/scripts out of git per user request). pyproject.toml + uv.lock — torch 2.11 -> 2.10: larkz's vllm wheel (vllm-0.19.2rc1.dev219+gc0879d948) ships METADATA "Requires-Dist: torch==2.11.0" but its _C.abi3.so was compiled in a torch==2.10 venv (per larkz/nemorl-ds4/pyproject.toml pin) and references the OLD 4-arg MessageLogger(char const*, int, int, bool) constructor that torch 2.11's libc10 dropped. Bake job 11362567 failed at venv rebuild with `undefined symbol _ZN3c1013MessageLoggerC1EPKciib`. Reverted in 5 sites (project deps, build group, override-deps). uv lock confirms torch / torchaudio / torchvision rolled to 2.10.0+cu129 / 2.10.0+cu129 / 0.25.0+cu129; nccl 2.28.9 -> 2.27.5 transitively. Bake job 11363285 + inference job 11363286 verified the fix end-to-end. tools/patch_vllm_dsv4_base_fp8_quick.sh — post-#40860 anchors: - Anchor #7 rewritten: upstream's rewritten load_weights now captures the result before calling self.model.finalize_mega_moe_weights(), so "return loader.load_weights(...)" no longer matches; new anchor "loaded_params = loader.load_weights(...)". - Added 8th anchor: forces use_mega_moe = False when VLLM_DSV4_BASE_FP8=1. Without this, post-load finalize_mega_moe_weights() would invoke experts.finalize_weights() on a non-MegaMoE FusedMoE layer (Base routes through Fp8MoEMethod, not DeepseekV4MegaMoEExperts). Self-contained env gating means recipes don't also need to override moe_backend. Verified: 9 anchors apply cleanly against the larkz wheel, py_compile clean, end-to-end inference passed in job 11363286. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>
ZhiyuLi-Nvidia
added a commit
that referenced
this pull request
May 15, 2026
…ifecycle
Remove three actor-level wrappers (finish_generation, get_logger_metrics,
clear_logger_metrics) that had zero external callers. The actor's internal
code already calls self.policy_generation.{...} directly at the right
points inside rollout_to_tq; the wrappers added indirection without value.
Rewrite the rollout_to_tq docstring to list all six steps bundled into
the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put ->
release GPU -> capture metrics), making the lifecycle visible without
having to read the method body.
Per yuki-97 PR review (#7, #8).
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia
added a commit
that referenced
this pull request
May 15, 2026
…metadata
* ``_get_local_node_ip`` now rejects both link-local AND loopback
addresses. A host whose ``/etc/hosts`` maps the hostname to
127.0.0.1 would otherwise announce an unroutable address to
Mooncake peers, causing cross-node ``connection refused``.
Adds a regression test (``test_local_node_ip_skips_loopback``)
and updates the module docstring (previously noted loopback was
NOT skipped).
* ``_patch_tq_actor_runtime_env`` no longer hard-codes the
``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's
installed metadata via ``importlib.metadata.requires("nemo-rl")``
(new ``_resolve_tq_pin`` helper). pyproject.toml is the single
source of truth; the two pins cannot drift. Adds a TODO to drop
the whole patch once the nightly container is published with TQ
baked in (at that point the injection becomes pure overhead).
* ``_tq()`` import-error hint no longer references a stale ``0.1.6``
version — points users to pyproject.toml instead.
Per terryk PR review (#7, #8).
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia
added a commit
that referenced
this pull request
May 19, 2026
…ifecycle
Remove three actor-level wrappers (finish_generation, get_logger_metrics,
clear_logger_metrics) that had zero external callers. The actor's internal
code already calls self.policy_generation.{...} directly at the right
points inside rollout_to_tq; the wrappers added indirection without value.
Rewrite the rollout_to_tq docstring to list all six steps bundled into
the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put ->
release GPU -> capture metrics), making the lifecycle visible without
having to read the method body.
Per yuki-97 PR review (#7, #8).
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia
added a commit
that referenced
this pull request
May 19, 2026
…metadata
* ``_get_local_node_ip`` now rejects both link-local AND loopback
addresses. A host whose ``/etc/hosts`` maps the hostname to
127.0.0.1 would otherwise announce an unroutable address to
Mooncake peers, causing cross-node ``connection refused``.
Adds a regression test (``test_local_node_ip_skips_loopback``)
and updates the module docstring (previously noted loopback was
NOT skipped).
* ``_patch_tq_actor_runtime_env`` no longer hard-codes the
``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's
installed metadata via ``importlib.metadata.requires("nemo-rl")``
(new ``_resolve_tq_pin`` helper). pyproject.toml is the single
source of truth; the two pins cannot drift. Adds a TODO to drop
the whole patch once the nightly container is published with TQ
baked in (at that point the injection becomes pure overhead).
* ``_tq()`` import-error hint no longer references a stale ``0.1.6``
version — points users to pyproject.toml instead.
Per terryk PR review (#7, #8).
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia
added a commit
that referenced
this pull request
May 19, 2026
…ifecycle
Remove three actor-level wrappers (finish_generation, get_logger_metrics,
clear_logger_metrics) that had zero external callers. The actor's internal
code already calls self.policy_generation.{...} directly at the right
points inside rollout_to_tq; the wrappers added indirection without value.
Rewrite the rollout_to_tq docstring to list all six steps bundled into
the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put ->
release GPU -> capture metrics), making the lifecycle visible without
having to read the method body.
Per yuki-97 PR review (#7, #8).
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia
added a commit
that referenced
this pull request
May 19, 2026
…metadata
* ``_get_local_node_ip`` now rejects both link-local AND loopback
addresses. A host whose ``/etc/hosts`` maps the hostname to
127.0.0.1 would otherwise announce an unroutable address to
Mooncake peers, causing cross-node ``connection refused``.
Adds a regression test (``test_local_node_ip_skips_loopback``)
and updates the module docstring (previously noted loopback was
NOT skipped).
* ``_patch_tq_actor_runtime_env`` no longer hard-codes the
``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's
installed metadata via ``importlib.metadata.requires("nemo-rl")``
(new ``_resolve_tq_pin`` helper). pyproject.toml is the single
source of truth; the two pins cannot drift. Adds a TODO to drop
the whole patch once the nightly container is published with TQ
baked in (at that point the injection becomes pure overhead).
* ``_tq()`` import-error hint no longer references a stale ``0.1.6``
version — points users to pyproject.toml instead.
Per terryk PR review (#7, #8).
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
avenkateshha
added a commit
to avenkateshha/RL
that referenced
this pull request
May 20, 2026
Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11. - Rename _load_M -> _get_sparse_projection_matrix and _load_dense_projection -> _get_topk_projection (later removed in favor of module-level cache helpers below). - Drop unused alignment_student_spans / alignment_teacher_spans from the cross-tokenizer batch payload. - Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect. - Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a new shared module nemo_rl/algorithms/x_token/utils.py. - Extract projection-file parsing into utils.parse_projection_file; tokenalign.py and loss_functions.py both go through it. - Move per-instance projection-matrix caches to process-local caches in utils.get_sparse_projection_matrix / get_topk_projection. The driver no longer holds large CUDA tensors; each Ray worker fills its own cache on first loss call. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
ZhiyuLi-Nvidia
added a commit
that referenced
this pull request
May 22, 2026
…ifecycle
Remove three actor-level wrappers (finish_generation, get_logger_metrics,
clear_logger_metrics) that had zero external callers. The actor's internal
code already calls self.policy_generation.{...} directly at the right
points inside rollout_to_tq; the wrappers added indirection without value.
Rewrite the rollout_to_tq docstring to list all six steps bundled into
the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put ->
release GPU -> capture metrics), making the lifecycle visible without
having to read the method body.
Per yuki-97 PR review (#7, #8).
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia
added a commit
that referenced
this pull request
May 22, 2026
…metadata
* ``_get_local_node_ip`` now rejects both link-local AND loopback
addresses. A host whose ``/etc/hosts`` maps the hostname to
127.0.0.1 would otherwise announce an unroutable address to
Mooncake peers, causing cross-node ``connection refused``.
Adds a regression test (``test_local_node_ip_skips_loopback``)
and updates the module docstring (previously noted loopback was
NOT skipped).
* ``_patch_tq_actor_runtime_env`` no longer hard-codes the
``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's
installed metadata via ``importlib.metadata.requires("nemo-rl")``
(new ``_resolve_tq_pin`` helper). pyproject.toml is the single
source of truth; the two pins cannot drift. Adds a TODO to drop
the whole patch once the nightly container is published with TQ
baked in (at that point the injection becomes pure overhead).
* ``_tq()`` import-error hint no longer references a stale ``0.1.6``
version — points users to pyproject.toml instead.
Per terryk PR review (#7, #8).
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia
added a commit
that referenced
this pull request
May 23, 2026
…ifecycle
Remove three actor-level wrappers (finish_generation, get_logger_metrics,
clear_logger_metrics) that had zero external callers. The actor's internal
code already calls self.policy_generation.{...} directly at the right
points inside rollout_to_tq; the wrappers added indirection without value.
Rewrite the rollout_to_tq docstring to list all six steps bundled into
the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put ->
release GPU -> capture metrics), making the lifecycle visible without
having to read the method body.
Per yuki-97 PR review (#7, #8).
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia
added a commit
that referenced
this pull request
May 23, 2026
…metadata
* ``_get_local_node_ip`` now rejects both link-local AND loopback
addresses. A host whose ``/etc/hosts`` maps the hostname to
127.0.0.1 would otherwise announce an unroutable address to
Mooncake peers, causing cross-node ``connection refused``.
Adds a regression test (``test_local_node_ip_skips_loopback``)
and updates the module docstring (previously noted loopback was
NOT skipped).
* ``_patch_tq_actor_runtime_env`` no longer hard-codes the
``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's
installed metadata via ``importlib.metadata.requires("nemo-rl")``
(new ``_resolve_tq_pin`` helper). pyproject.toml is the single
source of truth; the two pins cannot drift. Adds a TODO to drop
the whole patch once the nightly container is published with TQ
baked in (at that point the injection becomes pure overhead).
* ``_tq()`` import-error hint no longer references a stale ``0.1.6``
version — points users to pyproject.toml instead.
Per terryk PR review (#7, #8).
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
avenkateshha
added a commit
to avenkateshha/RL
that referenced
this pull request
May 27, 2026
Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11. - Rename _load_M -> _get_sparse_projection_matrix and _load_dense_projection -> _get_topk_projection (later removed in favor of module-level cache helpers below). - Drop unused alignment_student_spans / alignment_teacher_spans from the cross-tokenizer batch payload. - Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect. - Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a new shared module nemo_rl/algorithms/x_token/utils.py. - Extract projection-file parsing into utils.parse_projection_file; tokenalign.py and loss_functions.py both go through it. - Move per-instance projection-matrix caches to process-local caches in utils.get_sparse_projection_matrix / get_topk_projection. The driver no longer holds large CUDA tensors; each Ray worker fills its own cache on first loss call. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
copy-pr-bot Bot
pushed a commit
that referenced
this pull request
Jun 7, 2026
Comments addressed: #3, #5, #7, #8, #9, #10, #11. - Rename _load_M -> _get_sparse_projection_matrix and _load_dense_projection -> _get_topk_projection (later removed in favor of module-level cache helpers below). - Drop unused alignment_student_spans / alignment_teacher_spans from the cross-tokenizer batch payload. - Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect. - Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a new shared module nemo_rl/algorithms/x_token/utils.py. - Extract projection-file parsing into utils.parse_projection_file; tokenalign.py and loss_functions.py both go through it. - Move per-instance projection-matrix caches to process-local caches in utils.get_sparse_projection_matrix / get_topk_projection. The driver no longer holds large CUDA tensors; each Ray worker fills its own cache on first loss call. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
copy-pr-bot Bot
pushed a commit
that referenced
this pull request
Jun 7, 2026
Comments addressed: #3, #5, #7, #8, #9, #10, #11. - Rename _load_M -> _get_sparse_projection_matrix and _load_dense_projection -> _get_topk_projection (later removed in favor of module-level cache helpers below). - Drop unused alignment_student_spans / alignment_teacher_spans from the cross-tokenizer batch payload. - Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect. - Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a new shared module nemo_rl/algorithms/x_token/utils.py. - Extract projection-file parsing into utils.parse_projection_file; tokenalign.py and loss_functions.py both go through it. - Move per-instance projection-matrix caches to process-local caches in utils.get_sparse_projection_matrix / get_topk_projection. The driver no longer holds large CUDA tensors; each Ray worker fills its own cache on first loss call. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Adds Quick-Start, Example Notebook, and Detailed Explanation documentation for GRPO