docs: GRPO documentation and Configuration cleanup by SahilJain314 · Pull Request #7 · NVIDIA-NeMo/RL

SahilJain314 · 2025-03-21T00:43:49Z

What does this PR do ?

Adds Quick-Start, Example Notebook, and Detailed Explanation documentation for GRPO

Signed-off-by: Sahil Jain <sahilj@nvidia.com>

Validated in jobs 11363285 (bake) + 11363286 (inference): greedy generation against DeepSeek-V4-Flash-Base produced coherent text ("The capital of France is" -> " Paris. The capital of Germany is Berlin..."). The original `_load_w13: size of tensor a (2048) must match tensor b (16) at dim 0` blocker is now cleared at the vLLM half; Automodel half still pending teammate's branch. ## What changed ### pyproject.toml + uv.lock — torch 2.10 rollback Larkz's vllm wheel `vllm-0.19.2rc1.dev219+gc0879d948.cu129-cp313-cp313` ships METADATA `Requires-Dist: torch==2.11.0` but its `_C.abi3.so` was compiled in a `torch==2.10.0` venv (per larkz/nemorl-ds4/pyproject.toml pin) and references the OLD 4-arg `MessageLogger(char const*, int, int, bool)` constructor that torch 2.11's libc10 dropped. First bake (job 11362567) failed at venv rebuild with `undefined symbol: _ZN3c1013MessageLoggerC1EPKciib`. Reverted in 5 sites: - project.dependencies torch 2.11.0 -> 2.10.0 - project.dependencies torchvision 0.26.0 -> 0.25.0 - dependency-groups.build torch 2.11.0 -> 2.10.0 - tool.uv.override-dependencies torch / torchaudio -> 2.10.0 - tool.uv.override-dependencies + explicit torchvision==0.25.0 uv lock confirmed: torch / torchaudio / torchvision rolled to 2.10.0+cu129 / 2.10.0+cu129 / 0.25.0+cu129; nccl 2.28.9 -> 2.27.5 transitively. ### tools/patch_vllm_dsv4_base_fp8_quick.sh — post-#40860 anchors Anchor #7 rewritten: upstream's rewritten `load_weights` now captures the result before calling `self.model.finalize_mega_moe_weights()`, so "return loader.load_weights(...)" no longer matches; new anchor "loaded_params = loader.load_weights(...)". Added 8th anchor: forces `use_mega_moe = False` when `VLLM_DSV4_BASE_FP8=1`. Without this, post-load `finalize_mega_moe_weights()` would invoke `experts.finalize_weights()` on a non-MegaMoE FusedMoE layer (Base routes through `Fp8MoEMethod`, not `DeepseekV4MegaMoEExperts`). Self-contained env gating means recipes don't also need to override `moe_backend`. Verified against the larkz wheel: 9 anchors apply cleanly, `py_compile` clean. ### env_refresh_dsv4.sh — bake the Base patch Now applies two patches over the freshly-built worker venv before `--container-save`: 1. `tools/vllm_deepseek_v4_config_patch.py` (defensive superset of upstream main's narrow rope kwargs fix) 2. `tools/patch_vllm_dsv4_base_fp8_quick.sh` (env-gated Base FP8 support; idempotent, no-op when `VLLM_DSV4_BASE_FP8` unset) ### tools/test_vllm_dsv4_base_inference.py (new) Counterpart to test_vllm_dsv4_inference.py for Flash. Sets `VLLM_DSV4_BASE_FP8=1` before vllm import, re-applies the patch script defensively (idempotent), runs greedy generation through standalone `vllm.LLM`. Ran clean against the new sqsh. ### slurm_jobs/dsv4_base_{bake,inference_test}.sub (new) Submit scripts for the chained bake + inference flow: - bake: 1 node 8 GPU 3:55 walltime, --container-save - inference: 1 node 8 GPU 1:00 walltime, --dependency=afterok ### docs/model-readiness/deepseek-v4* (new + updated) Tracks the full DSV4 bring-up. Bring-up status doc now reflects: - vLLM half of Base blocker CLEARED - Torch ABI rollback chronicled - New sqsh artifact path (.../nemo-rl-dsv4-vllm-c0879d-base-torch210-2026-04-27.sqsh ~99.8 GiB) - Inference job 11363286 success, generated text logged - Old broken sqsh marked for forensics ## What still blocks Automodel-side Base FP8-block-quant state_dict_adapter is pending the teammate's new branch. Once that lands, sqsh rebake + GRPO smoke can follow. The vLLM-side recipe override `load_format: auto` may now be removable since the new wheel includes the upstream dummy-load fix (`finalize_weights()` 3 hits in main's deepseek_v4.py); validation deferred until first GRPO smoke attempt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>

Re-applies the in-scope subset of reverted ef7bfb4 (kept the docs/scripts out of git per user request). pyproject.toml + uv.lock — torch 2.11 -> 2.10: larkz's vllm wheel (vllm-0.19.2rc1.dev219+gc0879d948) ships METADATA "Requires-Dist: torch==2.11.0" but its _C.abi3.so was compiled in a torch==2.10 venv (per larkz/nemorl-ds4/pyproject.toml pin) and references the OLD 4-arg MessageLogger(char const*, int, int, bool) constructor that torch 2.11's libc10 dropped. Bake job 11362567 failed at venv rebuild with `undefined symbol _ZN3c1013MessageLoggerC1EPKciib`. Reverted in 5 sites (project deps, build group, override-deps). uv lock confirms torch / torchaudio / torchvision rolled to 2.10.0+cu129 / 2.10.0+cu129 / 0.25.0+cu129; nccl 2.28.9 -> 2.27.5 transitively. Bake job 11363285 + inference job 11363286 verified the fix end-to-end. tools/patch_vllm_dsv4_base_fp8_quick.sh — post-#40860 anchors: - Anchor #7 rewritten: upstream's rewritten load_weights now captures the result before calling self.model.finalize_mega_moe_weights(), so "return loader.load_weights(...)" no longer matches; new anchor "loaded_params = loader.load_weights(...)". - Added 8th anchor: forces use_mega_moe = False when VLLM_DSV4_BASE_FP8=1. Without this, post-load finalize_mega_moe_weights() would invoke experts.finalize_weights() on a non-MegaMoE FusedMoE layer (Base routes through Fp8MoEMethod, not DeepseekV4MegaMoEExperts). Self-contained env gating means recipes don't also need to override moe_backend. Verified: 9 anchors apply cleanly against the larkz wheel, py_compile clean, end-to-end inference passed in job 11363286. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>

…ifecycle Remove three actor-level wrappers (finish_generation, get_logger_metrics, clear_logger_metrics) that had zero external callers. The actor's internal code already calls self.policy_generation.{...} directly at the right points inside rollout_to_tq; the wrappers added indirection without value. Rewrite the rollout_to_tq docstring to list all six steps bundled into the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put -> release GPU -> capture metrics), making the lifecycle visible without having to read the method body. Per yuki-97 PR review (#7, #8). Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…metadata * ``_get_local_node_ip`` now rejects both link-local AND loopback addresses. A host whose ``/etc/hosts`` maps the hostname to 127.0.0.1 would otherwise announce an unroutable address to Mooncake peers, causing cross-node ``connection refused``. Adds a regression test (``test_local_node_ip_skips_loopback``) and updates the module docstring (previously noted loopback was NOT skipped). * ``_patch_tq_actor_runtime_env`` no longer hard-codes the ``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's installed metadata via ``importlib.metadata.requires("nemo-rl")`` (new ``_resolve_tq_pin`` helper). pyproject.toml is the single source of truth; the two pins cannot drift. Adds a TODO to drop the whole patch once the nightly container is published with TQ baked in (at that point the injection becomes pure overhead). * ``_tq()`` import-error hint no longer references a stale ``0.1.6`` version — points users to pyproject.toml instead. Per terryk PR review (#7, #8). Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…ifecycle Remove three actor-level wrappers (finish_generation, get_logger_metrics, clear_logger_metrics) that had zero external callers. The actor's internal code already calls self.policy_generation.{...} directly at the right points inside rollout_to_tq; the wrappers added indirection without value. Rewrite the rollout_to_tq docstring to list all six steps bundled into the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put -> release GPU -> capture metrics), making the lifecycle visible without having to read the method body. Per yuki-97 PR review (#7, #8). Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…metadata * ``_get_local_node_ip`` now rejects both link-local AND loopback addresses. A host whose ``/etc/hosts`` maps the hostname to 127.0.0.1 would otherwise announce an unroutable address to Mooncake peers, causing cross-node ``connection refused``. Adds a regression test (``test_local_node_ip_skips_loopback``) and updates the module docstring (previously noted loopback was NOT skipped). * ``_patch_tq_actor_runtime_env`` no longer hard-codes the ``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's installed metadata via ``importlib.metadata.requires("nemo-rl")`` (new ``_resolve_tq_pin`` helper). pyproject.toml is the single source of truth; the two pins cannot drift. Adds a TODO to drop the whole patch once the nightly container is published with TQ baked in (at that point the injection becomes pure overhead). * ``_tq()`` import-error hint no longer references a stale ``0.1.6`` version — points users to pyproject.toml instead. Per terryk PR review (#7, #8). Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…ifecycle Remove three actor-level wrappers (finish_generation, get_logger_metrics, clear_logger_metrics) that had zero external callers. The actor's internal code already calls self.policy_generation.{...} directly at the right points inside rollout_to_tq; the wrappers added indirection without value. Rewrite the rollout_to_tq docstring to list all six steps bundled into the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put -> release GPU -> capture metrics), making the lifecycle visible without having to read the method body. Per yuki-97 PR review (#7, #8). Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…metadata * ``_get_local_node_ip`` now rejects both link-local AND loopback addresses. A host whose ``/etc/hosts`` maps the hostname to 127.0.0.1 would otherwise announce an unroutable address to Mooncake peers, causing cross-node ``connection refused``. Adds a regression test (``test_local_node_ip_skips_loopback``) and updates the module docstring (previously noted loopback was NOT skipped). * ``_patch_tq_actor_runtime_env`` no longer hard-codes the ``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's installed metadata via ``importlib.metadata.requires("nemo-rl")`` (new ``_resolve_tq_pin`` helper). pyproject.toml is the single source of truth; the two pins cannot drift. Adds a TODO to drop the whole patch once the nightly container is published with TQ baked in (at that point the injection becomes pure overhead). * ``_tq()`` import-error hint no longer references a stale ``0.1.6`` version — points users to pyproject.toml instead. Per terryk PR review (#7, #8). Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11. - Rename _load_M -> _get_sparse_projection_matrix and _load_dense_projection -> _get_topk_projection (later removed in favor of module-level cache helpers below). - Drop unused alignment_student_spans / alignment_teacher_spans from the cross-tokenizer batch payload. - Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect. - Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a new shared module nemo_rl/algorithms/x_token/utils.py. - Extract projection-file parsing into utils.parse_projection_file; tokenalign.py and loss_functions.py both go through it. - Move per-instance projection-matrix caches to process-local caches in utils.get_sparse_projection_matrix / get_topk_projection. The driver no longer holds large CUDA tensors; each Ray worker fills its own cache on first loss call. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>

…ifecycle Remove three actor-level wrappers (finish_generation, get_logger_metrics, clear_logger_metrics) that had zero external callers. The actor's internal code already calls self.policy_generation.{...} directly at the right points inside rollout_to_tq; the wrappers added indirection without value. Rewrite the rollout_to_tq docstring to list all six steps bundled into the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put -> release GPU -> capture metrics), making the lifecycle visible without having to read the method body. Per yuki-97 PR review (#7, #8). Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…metadata * ``_get_local_node_ip`` now rejects both link-local AND loopback addresses. A host whose ``/etc/hosts`` maps the hostname to 127.0.0.1 would otherwise announce an unroutable address to Mooncake peers, causing cross-node ``connection refused``. Adds a regression test (``test_local_node_ip_skips_loopback``) and updates the module docstring (previously noted loopback was NOT skipped). * ``_patch_tq_actor_runtime_env`` no longer hard-codes the ``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's installed metadata via ``importlib.metadata.requires("nemo-rl")`` (new ``_resolve_tq_pin`` helper). pyproject.toml is the single source of truth; the two pins cannot drift. Adds a TODO to drop the whole patch once the nightly container is published with TQ baked in (at that point the injection becomes pure overhead). * ``_tq()`` import-error hint no longer references a stale ``0.1.6`` version — points users to pyproject.toml instead. Per terryk PR review (#7, #8). Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…ifecycle Remove three actor-level wrappers (finish_generation, get_logger_metrics, clear_logger_metrics) that had zero external callers. The actor's internal code already calls self.policy_generation.{...} directly at the right points inside rollout_to_tq; the wrappers added indirection without value. Rewrite the rollout_to_tq docstring to list all six steps bundled into the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put -> release GPU -> capture metrics), making the lifecycle visible without having to read the method body. Per yuki-97 PR review (#7, #8). Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

…metadata * ``_get_local_node_ip`` now rejects both link-local AND loopback addresses. A host whose ``/etc/hosts`` maps the hostname to 127.0.0.1 would otherwise announce an unroutable address to Mooncake peers, causing cross-node ``connection refused``. Adds a regression test (``test_local_node_ip_skips_loopback``) and updates the module docstring (previously noted loopback was NOT skipped). * ``_patch_tq_actor_runtime_env`` no longer hard-codes the ``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's installed metadata via ``importlib.metadata.requires("nemo-rl")`` (new ``_resolve_tq_pin`` helper). pyproject.toml is the single source of truth; the two pins cannot drift. Adds a TODO to drop the whole patch once the nightly container is published with TQ baked in (at that point the injection becomes pure overhead). * ``_tq()`` import-error hint no longer references a stale ``0.1.6`` version — points users to pyproject.toml instead. Per terryk PR review (#7, #8). Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11. - Rename _load_M -> _get_sparse_projection_matrix and _load_dense_projection -> _get_topk_projection (later removed in favor of module-level cache helpers below). - Drop unused alignment_student_spans / alignment_teacher_spans from the cross-tokenizer batch payload. - Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect. - Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a new shared module nemo_rl/algorithms/x_token/utils.py. - Extract projection-file parsing into utils.parse_projection_file; tokenalign.py and loss_functions.py both go through it. - Move per-instance projection-matrix caches to process-local caches in utils.get_sparse_projection_matrix / get_topk_projection. The driver no longer holds large CUDA tensors; each Ray worker fills its own cache on first loss call. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>

Comments addressed: #3, #5, #7, #8, #9, #10, #11. - Rename _load_M -> _get_sparse_projection_matrix and _load_dense_projection -> _get_topk_projection (later removed in favor of module-level cache helpers below). - Drop unused alignment_student_spans / alignment_teacher_spans from the cross-tokenizer batch payload. - Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect. - Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a new shared module nemo_rl/algorithms/x_token/utils.py. - Extract projection-file parsing into utils.parse_projection_file; tokenalign.py and loss_functions.py both go through it. - Move per-instance projection-matrix caches to process-local caches in utils.get_sparse_projection_matrix / get_topk_projection. The driver no longer holds large CUDA tensors; each Ray worker fills its own cache on first loss call. Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>

github-actions Bot added the Documentation Improvements or additions to documentation label Mar 21, 2025

SahilJain314 changed the title ~~Draft: GRPO documentation~~ Draft: GRPO documentation and Configuration cleanup Mar 21, 2025

SahilJain314 force-pushed the sahilj/more_docs branch 2 times, most recently from bbc08f2 to 3cdffe3 Compare March 21, 2025 04:02

SahilJain314 added the Run CICD label Mar 21, 2025