Skip to content

docs: GRPO documentation and Configuration cleanup#7

Merged
SahilJain314 merged 6 commits into
mainfrom
sahilj/more_docs
Mar 21, 2025
Merged

docs: GRPO documentation and Configuration cleanup#7
SahilJain314 merged 6 commits into
mainfrom
sahilj/more_docs

Conversation

@SahilJain314

Copy link
Copy Markdown
Contributor

What does this PR do ?

Adds Quick-Start, Example Notebook, and Detailed Explanation documentation for GRPO

@github-actions github-actions Bot added the Documentation Improvements or additions to documentation label Mar 21, 2025
@SahilJain314 SahilJain314 changed the title Draft: GRPO documentation Draft: GRPO documentation and Configuration cleanup Mar 21, 2025
@SahilJain314 SahilJain314 force-pushed the sahilj/more_docs branch 2 times, most recently from bbc08f2 to 3cdffe3 Compare March 21, 2025 04:02
Comment thread docs/design_docs/design_and_philosophy.md Outdated
Comment thread docs/guides/grpo.md
Comment thread docs/guides/sft.md Outdated
Comment thread docs/guides/grpo.md
Comment thread docs/guides/grpo.md
Comment thread docs/guides/grpo.md Outdated
Comment thread docs/guides/grpo.md Outdated
@SahilJain314 SahilJain314 changed the title Draft: GRPO documentation and Configuration cleanup docs: GRPO documentation and Configuration cleanup Mar 21, 2025
parthchadha
parthchadha previously approved these changes Mar 21, 2025
@SahilJain314 SahilJain314 force-pushed the sahilj/more_docs branch 2 times, most recently from 7a001c0 to b99a400 Compare March 21, 2025 07:29
terrykong
terrykong previously approved these changes Mar 21, 2025
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
@SahilJain314 SahilJain314 merged commit 8816926 into main Mar 21, 2025
@SahilJain314 SahilJain314 deleted the sahilj/more_docs branch March 21, 2025 07:39
KiddoZhu pushed a commit that referenced this pull request May 6, 2025
Signed-off-by: Sahil Jain <sahilj@nvidia.com>
sharonyu-115 referenced this pull request in sharonyu-115/RL Apr 27, 2026
Validated in jobs 11363285 (bake) + 11363286 (inference):
greedy generation against DeepSeek-V4-Flash-Base produced coherent text
("The capital of France is" -> " Paris. The capital of Germany is
Berlin..."). The original `_load_w13: size of tensor a (2048) must
match tensor b (16) at dim 0` blocker is now cleared at the vLLM half;
Automodel half still pending teammate's branch.

## What changed

### pyproject.toml + uv.lock — torch 2.10 rollback

Larkz's vllm wheel `vllm-0.19.2rc1.dev219+gc0879d948.cu129-cp313-cp313`
ships METADATA `Requires-Dist: torch==2.11.0` but its `_C.abi3.so` was
compiled in a `torch==2.10.0` venv (per larkz/nemorl-ds4/pyproject.toml
pin) and references the OLD 4-arg `MessageLogger(char const*, int, int,
bool)` constructor that torch 2.11's libc10 dropped. First bake (job
11362567) failed at venv rebuild with `undefined symbol:
_ZN3c1013MessageLoggerC1EPKciib`.

Reverted in 5 sites:
- project.dependencies torch 2.11.0 -> 2.10.0
- project.dependencies torchvision 0.26.0 -> 0.25.0
- dependency-groups.build torch 2.11.0 -> 2.10.0
- tool.uv.override-dependencies torch / torchaudio -> 2.10.0
- tool.uv.override-dependencies + explicit torchvision==0.25.0

uv lock confirmed: torch / torchaudio / torchvision rolled to 2.10.0+cu129
/ 2.10.0+cu129 / 0.25.0+cu129; nccl 2.28.9 -> 2.27.5 transitively.

### tools/patch_vllm_dsv4_base_fp8_quick.sh — post-#40860 anchors

Anchor #7 rewritten: upstream's rewritten `load_weights` now captures
the result before calling `self.model.finalize_mega_moe_weights()`, so
"return loader.load_weights(...)" no longer matches; new anchor
"loaded_params = loader.load_weights(...)".

Added 8th anchor: forces `use_mega_moe = False` when
`VLLM_DSV4_BASE_FP8=1`. Without this, post-load
`finalize_mega_moe_weights()` would invoke `experts.finalize_weights()`
on a non-MegaMoE FusedMoE layer (Base routes through `Fp8MoEMethod`,
not `DeepseekV4MegaMoEExperts`). Self-contained env gating means
recipes don't also need to override `moe_backend`.

Verified against the larkz wheel: 9 anchors apply cleanly,
`py_compile` clean.

### env_refresh_dsv4.sh — bake the Base patch

Now applies two patches over the freshly-built worker venv before
`--container-save`:
1. `tools/vllm_deepseek_v4_config_patch.py` (defensive superset of
   upstream main's narrow rope kwargs fix)
2. `tools/patch_vllm_dsv4_base_fp8_quick.sh` (env-gated Base FP8
   support; idempotent, no-op when `VLLM_DSV4_BASE_FP8` unset)

### tools/test_vllm_dsv4_base_inference.py (new)

Counterpart to test_vllm_dsv4_inference.py for Flash. Sets
`VLLM_DSV4_BASE_FP8=1` before vllm import, re-applies the patch script
defensively (idempotent), runs greedy generation through standalone
`vllm.LLM`. Ran clean against the new sqsh.

### slurm_jobs/dsv4_base_{bake,inference_test}.sub (new)

Submit scripts for the chained bake + inference flow:
- bake: 1 node 8 GPU 3:55 walltime, --container-save
- inference: 1 node 8 GPU 1:00 walltime, --dependency=afterok

### docs/model-readiness/deepseek-v4* (new + updated)

Tracks the full DSV4 bring-up. Bring-up status doc now reflects:
- vLLM half of Base blocker CLEARED
- Torch ABI rollback chronicled
- New sqsh artifact path
  (.../nemo-rl-dsv4-vllm-c0879d-base-torch210-2026-04-27.sqsh ~99.8 GiB)
- Inference job 11363286 success, generated text logged
- Old broken sqsh marked for forensics

## What still blocks

Automodel-side Base FP8-block-quant state_dict_adapter is pending the
teammate's new branch. Once that lands, sqsh rebake + GRPO smoke can
follow. The vLLM-side recipe override `load_format: auto` may now be
removable since the new wheel includes the upstream dummy-load fix
(`finalize_weights()` 3 hits in main's deepseek_v4.py); validation
deferred until first GRPO smoke attempt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
sharonyu-115 referenced this pull request in sharonyu-115/RL Apr 27, 2026
Re-applies the in-scope subset of reverted ef7bfb4 (kept the
docs/scripts out of git per user request).

pyproject.toml + uv.lock — torch 2.11 -> 2.10:
larkz's vllm wheel (vllm-0.19.2rc1.dev219+gc0879d948) ships METADATA
"Requires-Dist: torch==2.11.0" but its _C.abi3.so was compiled in a
torch==2.10 venv (per larkz/nemorl-ds4/pyproject.toml pin) and references
the OLD 4-arg MessageLogger(char const*, int, int, bool) constructor that
torch 2.11's libc10 dropped.  Bake job 11362567 failed at venv rebuild
with `undefined symbol _ZN3c1013MessageLoggerC1EPKciib`.  Reverted in
5 sites (project deps, build group, override-deps).  uv lock confirms
torch / torchaudio / torchvision rolled to 2.10.0+cu129 / 2.10.0+cu129
/ 0.25.0+cu129; nccl 2.28.9 -> 2.27.5 transitively.  Bake job 11363285
+ inference job 11363286 verified the fix end-to-end.

tools/patch_vllm_dsv4_base_fp8_quick.sh — post-#40860 anchors:
- Anchor #7 rewritten: upstream's rewritten load_weights now captures
  the result before calling self.model.finalize_mega_moe_weights(), so
  "return loader.load_weights(...)" no longer matches; new anchor
  "loaded_params = loader.load_weights(...)".
- Added 8th anchor: forces use_mega_moe = False when
  VLLM_DSV4_BASE_FP8=1.  Without this, post-load
  finalize_mega_moe_weights() would invoke experts.finalize_weights()
  on a non-MegaMoE FusedMoE layer (Base routes through Fp8MoEMethod,
  not DeepseekV4MegaMoEExperts).  Self-contained env gating means
  recipes don't also need to override moe_backend.
Verified: 9 anchors apply cleanly against the larkz wheel,
py_compile clean, end-to-end inference passed in job 11363286.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
ZhiyuLi-Nvidia added a commit that referenced this pull request May 15, 2026
…ifecycle

Remove three actor-level wrappers (finish_generation, get_logger_metrics,
clear_logger_metrics) that had zero external callers. The actor's internal
code already calls self.policy_generation.{...} directly at the right
points inside rollout_to_tq; the wrappers added indirection without value.

Rewrite the rollout_to_tq docstring to list all six steps bundled into
the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put ->
release GPU -> capture metrics), making the lifecycle visible without
having to read the method body.

Per yuki-97 PR review (#7, #8).

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia added a commit that referenced this pull request May 15, 2026
…metadata

* ``_get_local_node_ip`` now rejects both link-local AND loopback
  addresses. A host whose ``/etc/hosts`` maps the hostname to
  127.0.0.1 would otherwise announce an unroutable address to
  Mooncake peers, causing cross-node ``connection refused``.
  Adds a regression test (``test_local_node_ip_skips_loopback``)
  and updates the module docstring (previously noted loopback was
  NOT skipped).

* ``_patch_tq_actor_runtime_env`` no longer hard-codes the
  ``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's
  installed metadata via ``importlib.metadata.requires("nemo-rl")``
  (new ``_resolve_tq_pin`` helper). pyproject.toml is the single
  source of truth; the two pins cannot drift. Adds a TODO to drop
  the whole patch once the nightly container is published with TQ
  baked in (at that point the injection becomes pure overhead).

* ``_tq()`` import-error hint no longer references a stale ``0.1.6``
  version — points users to pyproject.toml instead.

Per terryk PR review (#7, #8).

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia added a commit that referenced this pull request May 19, 2026
…ifecycle

Remove three actor-level wrappers (finish_generation, get_logger_metrics,
clear_logger_metrics) that had zero external callers. The actor's internal
code already calls self.policy_generation.{...} directly at the right
points inside rollout_to_tq; the wrappers added indirection without value.

Rewrite the rollout_to_tq docstring to list all six steps bundled into
the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put ->
release GPU -> capture metrics), making the lifecycle visible without
having to read the method body.

Per yuki-97 PR review (#7, #8).

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia added a commit that referenced this pull request May 19, 2026
…metadata

* ``_get_local_node_ip`` now rejects both link-local AND loopback
  addresses. A host whose ``/etc/hosts`` maps the hostname to
  127.0.0.1 would otherwise announce an unroutable address to
  Mooncake peers, causing cross-node ``connection refused``.
  Adds a regression test (``test_local_node_ip_skips_loopback``)
  and updates the module docstring (previously noted loopback was
  NOT skipped).

* ``_patch_tq_actor_runtime_env`` no longer hard-codes the
  ``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's
  installed metadata via ``importlib.metadata.requires("nemo-rl")``
  (new ``_resolve_tq_pin`` helper). pyproject.toml is the single
  source of truth; the two pins cannot drift. Adds a TODO to drop
  the whole patch once the nightly container is published with TQ
  baked in (at that point the injection becomes pure overhead).

* ``_tq()`` import-error hint no longer references a stale ``0.1.6``
  version — points users to pyproject.toml instead.

Per terryk PR review (#7, #8).

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia added a commit that referenced this pull request May 19, 2026
…ifecycle

Remove three actor-level wrappers (finish_generation, get_logger_metrics,
clear_logger_metrics) that had zero external callers. The actor's internal
code already calls self.policy_generation.{...} directly at the right
points inside rollout_to_tq; the wrappers added indirection without value.

Rewrite the rollout_to_tq docstring to list all six steps bundled into
the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put ->
release GPU -> capture metrics), making the lifecycle visible without
having to read the method body.

Per yuki-97 PR review (#7, #8).

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia added a commit that referenced this pull request May 19, 2026
…metadata

* ``_get_local_node_ip`` now rejects both link-local AND loopback
  addresses. A host whose ``/etc/hosts`` maps the hostname to
  127.0.0.1 would otherwise announce an unroutable address to
  Mooncake peers, causing cross-node ``connection refused``.
  Adds a regression test (``test_local_node_ip_skips_loopback``)
  and updates the module docstring (previously noted loopback was
  NOT skipped).

* ``_patch_tq_actor_runtime_env`` no longer hard-codes the
  ``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's
  installed metadata via ``importlib.metadata.requires("nemo-rl")``
  (new ``_resolve_tq_pin`` helper). pyproject.toml is the single
  source of truth; the two pins cannot drift. Adds a TODO to drop
  the whole patch once the nightly container is published with TQ
  baked in (at that point the injection becomes pure overhead).

* ``_tq()`` import-error hint no longer references a stale ``0.1.6``
  version — points users to pyproject.toml instead.

Per terryk PR review (#7, #8).

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
avenkateshha added a commit to avenkateshha/RL that referenced this pull request May 20, 2026
Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11.

- Rename _load_M -> _get_sparse_projection_matrix and
  _load_dense_projection -> _get_topk_projection (later removed in
  favor of module-level cache helpers below).
- Drop unused alignment_student_spans / alignment_teacher_spans
  from the cross-tokenizer batch payload.
- Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect.
- Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a
  new shared module nemo_rl/algorithms/x_token/utils.py.
- Extract projection-file parsing into utils.parse_projection_file;
  tokenalign.py and loss_functions.py both go through it.
- Move per-instance projection-matrix caches to process-local caches
  in utils.get_sparse_projection_matrix / get_topk_projection. The
  driver no longer holds large CUDA tensors; each Ray worker fills
  its own cache on first loss call.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
ZhiyuLi-Nvidia added a commit that referenced this pull request May 22, 2026
…ifecycle

Remove three actor-level wrappers (finish_generation, get_logger_metrics,
clear_logger_metrics) that had zero external callers. The actor's internal
code already calls self.policy_generation.{...} directly at the right
points inside rollout_to_tq; the wrappers added indirection without value.

Rewrite the rollout_to_tq docstring to list all six steps bundled into
the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put ->
release GPU -> capture metrics), making the lifecycle visible without
having to read the method body.

Per yuki-97 PR review (#7, #8).

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia added a commit that referenced this pull request May 22, 2026
…metadata

* ``_get_local_node_ip`` now rejects both link-local AND loopback
  addresses. A host whose ``/etc/hosts`` maps the hostname to
  127.0.0.1 would otherwise announce an unroutable address to
  Mooncake peers, causing cross-node ``connection refused``.
  Adds a regression test (``test_local_node_ip_skips_loopback``)
  and updates the module docstring (previously noted loopback was
  NOT skipped).

* ``_patch_tq_actor_runtime_env`` no longer hard-codes the
  ``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's
  installed metadata via ``importlib.metadata.requires("nemo-rl")``
  (new ``_resolve_tq_pin`` helper). pyproject.toml is the single
  source of truth; the two pins cannot drift. Adds a TODO to drop
  the whole patch once the nightly container is published with TQ
  baked in (at that point the injection becomes pure overhead).

* ``_tq()`` import-error hint no longer references a stale ``0.1.6``
  version — points users to pyproject.toml instead.

Per terryk PR review (#7, #8).

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia added a commit that referenced this pull request May 23, 2026
…ifecycle

Remove three actor-level wrappers (finish_generation, get_logger_metrics,
clear_logger_metrics) that had zero external callers. The actor's internal
code already calls self.policy_generation.{...} directly at the right
points inside rollout_to_tq; the wrappers added indirection without value.

Rewrite the rollout_to_tq docstring to list all six steps bundled into
the single Ray RPC (reset metrics -> rollout -> flatten -> TQ put ->
release GPU -> capture metrics), making the lifecycle visible without
having to read the method body.

Per yuki-97 PR review (#7, #8).

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
ZhiyuLi-Nvidia added a commit that referenced this pull request May 23, 2026
…metadata

* ``_get_local_node_ip`` now rejects both link-local AND loopback
  addresses. A host whose ``/etc/hosts`` maps the hostname to
  127.0.0.1 would otherwise announce an unroutable address to
  Mooncake peers, causing cross-node ``connection refused``.
  Adds a regression test (``test_local_node_ip_skips_loopback``)
  and updates the module docstring (previously noted loopback was
  NOT skipped).

* ``_patch_tq_actor_runtime_env`` no longer hard-codes the
  ``TransferQueue`` git+SHA pin. It now reads it from nemo-rl's
  installed metadata via ``importlib.metadata.requires("nemo-rl")``
  (new ``_resolve_tq_pin`` helper). pyproject.toml is the single
  source of truth; the two pins cannot drift. Adds a TODO to drop
  the whole patch once the nightly container is published with TQ
  baked in (at that point the injection becomes pure overhead).

* ``_tq()`` import-error hint no longer references a stale ``0.1.6``
  version — points users to pyproject.toml instead.

Per terryk PR review (#7, #8).

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
avenkateshha added a commit to avenkateshha/RL that referenced this pull request May 27, 2026
Comments addressed: #3, #5, NVIDIA-NeMo#7, NVIDIA-NeMo#8, NVIDIA-NeMo#9, NVIDIA-NeMo#10, NVIDIA-NeMo#11.

- Rename _load_M -> _get_sparse_projection_matrix and
  _load_dense_projection -> _get_topk_projection (later removed in
  favor of module-level cache helpers below).
- Drop unused alignment_student_spans / alignment_teacher_spans
  from the cross-tokenizer batch payload.
- Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect.
- Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a
  new shared module nemo_rl/algorithms/x_token/utils.py.
- Extract projection-file parsing into utils.parse_projection_file;
  tokenalign.py and loss_functions.py both go through it.
- Move per-instance projection-matrix caches to process-local caches
  in utils.get_sparse_projection_matrix / get_topk_projection. The
  driver no longer holds large CUDA tensors; each Ray worker fills
  its own cache on first loss call.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 7, 2026
Comments addressed: #3, #5, #7, #8, #9, #10, #11.

- Rename _load_M -> _get_sparse_projection_matrix and
  _load_dense_projection -> _get_topk_projection (later removed in
  favor of module-level cache helpers below).
- Drop unused alignment_student_spans / alignment_teacher_spans
  from the cross-tokenizer batch payload.
- Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect.
- Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a
  new shared module nemo_rl/algorithms/x_token/utils.py.
- Extract projection-file parsing into utils.parse_projection_file;
  tokenalign.py and loss_functions.py both go through it.
- Move per-instance projection-matrix caches to process-local caches
  in utils.get_sparse_projection_matrix / get_topk_projection. The
  driver no longer holds large CUDA tensors; each Ray worker fills
  its own cache on first loss call.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request Jun 7, 2026
Comments addressed: #3, #5, #7, #8, #9, #10, #11.

- Rename _load_M -> _get_sparse_projection_matrix and
  _load_dense_projection -> _get_topk_projection (later removed in
  favor of module-level cache helpers below).
- Drop unused alignment_student_spans / alignment_teacher_spans
  from the cross-tokenizer batch payload.
- Remove NRL_XTOKEN_LOSS_DUMP_DIR debug-dump side effect.
- Move Fp32SparseMM, chunk_average_log_probs, valid_chunk_mask to a
  new shared module nemo_rl/algorithms/x_token/utils.py.
- Extract projection-file parsing into utils.parse_projection_file;
  tokenalign.py and loss_functions.py both go through it.
- Move per-instance projection-matrix caches to process-local caches
  in utils.get_sparse_projection_matrix / get_topk_projection. The
  driver no longer holds large CUDA tensors; each Ray worker fills
  its own cache on first loss call.

Signed-off-by: Adithya Hanasoge <avenkateshha@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants