feat: Auto research skill#2419
Conversation
fafbf11 to
ca90191
Compare
|
/claude review |
terrykong
left a comment
There was a problem hiding this comment.
Review: PR #2419 — feat: Auto research skill
Nice work — this skill provides a well-structured framework for running iterative RL experiments with git as the experiment journal. The exploration-ideas guide and git-workflow reference are thorough and practical. The safety guardrails in git-workflow.md (no stash/reset/overwrite without consent) are particularly good.
A few suggestions to align with existing repo conventions and improve consistency:
Directory naming mismatch
All other skill directories use hyphens (build-and-dependency, config-conventions, launch-nemo-rl, etc.), but this one uses an underscore (auto_research). The frontmatter name field is auto-research (with hyphen), creating an inconsistency. Consider renaming the directory to auto-research/ to match the convention.
Nemo-gym coverage
The PR description mentions guiding agents on research with "Nemo-RL and Nemo-gym", but the SKILL.md workflow (step 3) only references NeMo-RL paths (examples/run_grpo.py, nemo_rl/models/, etc.). The Nemo-gym entrypoints (examples/nemo_gym/) are not mentioned. Consider either adding Nemo-gym paths to the workflow or adjusting the PR description to match the actual scope.
See inline comments for additional suggestions.
Generated by Claude Code
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
29bb26f to
7aee365
Compare
|
Thanks, @terrykong, for the prompt review. Fixed the reported issues and tightened all 3 skills. Add best practices and gotchas observed with Codex, but could happen with other agents. |
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
|
/ok to test 9880d5e |
|
/ok to test 39de685 |
NVIDIA-NeMo#2419 workaround) Automodel's _restore_loaded_model_dtype (HF/force_hf load path) re-casts loaded params back to the bf16 checkpoint dtype, silently undoing NeMo-RL's intended torch_dtype=float32 master-weight load. With bf16 master weights, AdamW updates underflow and the policy never learns: grpo-nano-v2-12b reward[30] stuck ~0.18 (vs ~0.54) and sft-nanov3-30BA3B loss plateaus. Only force_hf models (NemotronH nano-v2/nano-v3) are affected; custom-impl models (gemma4, Llama) load via the DCP copy path that preserves fp32. Add _disable_automodel_checkpoint_dtype_restore() to no-op that restore before from_pretrained so the requested fp32 is honored. Validated: nano-v2-12b reward[30] 0.176 -> 0.541 PASS; nanov3-30BA3B-lora loss[20] 2.027 PASS. This is temporary until the automodel pin includes NVIDIA-NeMo/Automodel#2419 (rewrites _restore_loaded_model_dtype to honor an explicit torch_dtype). Add an obsolescence tripwire test that fails when NVIDIA-NeMo#2419 lands so the workaround is removed timely, plus an analogous tripwire for the existing Qwen-VL vision-tower key-mapping workaround (fires when transformers #45358 / >=5.6 reaches the pin). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>
NVIDIA-NeMo#2419 workaround) Automodel's _restore_loaded_model_dtype (HF/force_hf load path) re-casts loaded params back to the bf16 checkpoint dtype, silently undoing NeMo-RL's intended torch_dtype=float32 master-weight load. With bf16 master weights, AdamW updates underflow and the policy never learns: grpo-nano-v2-12b reward[30] stuck ~0.18 (vs ~0.54) and sft-nanov3-30BA3B loss plateaus. Only force_hf models (NemotronH nano-v2/nano-v3) are affected; custom-impl models (gemma4, Llama) load via the DCP copy path that preserves fp32. Add _disable_automodel_checkpoint_dtype_restore() to no-op that restore before from_pretrained so the requested fp32 is honored. Validated: nano-v2-12b reward[30] 0.176 -> 0.541 PASS; nanov3-30BA3B-lora loss[20] 2.027 PASS. This is temporary until the automodel pin includes NVIDIA-NeMo/Automodel#2419 (rewrites _restore_loaded_model_dtype to honor an explicit torch_dtype). Add an obsolescence tripwire test that fails when NVIDIA-NeMo#2419 lands so the workaround is removed timely, plus an analogous tripwire for the existing Qwen-VL vision-tower key-mapping workaround (fires when transformers #45358 / >=5.6 reaches the pin). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>
The automodel fp32-master-weight tripwire test (test_automodel_dtype_restore_workaround_still_needed) failed in CI as a false positive. _disable_automodel_checkpoint_dtype_restore() globally and irreversibly replaces _restore_loaded_model_dtype with a no-op; earlier setup_model_and_optimizer tests in the same process leave that no-op installed, so the tripwire exercised the no-op (which preserves fp32) instead of Automodel's real downgrading function. Stash the original on the no-op and have the test recover it via _nrl_original. Also pass requested_dtype=fp32 to the function when its signature accepts it, so the tripwire actually fires once Automodel NVIDIA-NeMo#2419 is pinned: the rewritten function honors the explicit fp32 request only via that new parameter (promote_types), not via hf_config/load_kwargs. Correct the Skywork reward baseline (-5.4062 -> -5.2500) to the value the CI build produces (also the historical pre-refresh value); the incorrect-answer score is sensitive to the transformers/torch/kernel build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>
The automodel fp32-master-weight tripwire test (test_automodel_dtype_restore_workaround_still_needed) failed in CI as a false positive. _disable_automodel_checkpoint_dtype_restore() globally and irreversibly replaces _restore_loaded_model_dtype with a no-op; earlier setup_model_and_optimizer tests in the same process leave that no-op installed, so the tripwire exercised the no-op (which preserves fp32) instead of Automodel's real downgrading function. Stash the original on the no-op and have the test recover it via _nrl_original. Also pass requested_dtype=fp32 to the function when its signature accepts it, so the tripwire actually fires once Automodel NVIDIA-NeMo#2419 is pinned: the rewritten function honors the explicit fp32 request only via that new parameter (promote_types), not via hf_config/load_kwargs. Correct the Skywork reward baseline (-5.4062 -> -5.2500) to the value the CI build produces (also the historical pre-refresh value); the incorrect-answer score is sensitive to the transformers/torch/kernel build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>
NVIDIA-NeMo#2419 workaround) Automodel's _restore_loaded_model_dtype (HF/force_hf load path) re-casts loaded params back to the bf16 checkpoint dtype, silently undoing NeMo-RL's intended torch_dtype=float32 master-weight load. With bf16 master weights, AdamW updates underflow and the policy never learns: grpo-nano-v2-12b reward[30] stuck ~0.18 (vs ~0.54) and sft-nanov3-30BA3B loss plateaus. Only force_hf models (NemotronH nano-v2/nano-v3) are affected; custom-impl models (gemma4, Llama) load via the DCP copy path that preserves fp32. Add _disable_automodel_checkpoint_dtype_restore() to no-op that restore before from_pretrained so the requested fp32 is honored. Validated: nano-v2-12b reward[30] 0.176 -> 0.541 PASS; nanov3-30BA3B-lora loss[20] 2.027 PASS. This is temporary until the automodel pin includes NVIDIA-NeMo/Automodel#2419 (rewrites _restore_loaded_model_dtype to honor an explicit torch_dtype). Add an obsolescence tripwire test that fails when NVIDIA-NeMo#2419 lands so the workaround is removed timely, plus an analogous tripwire for the existing Qwen-VL vision-tower key-mapping workaround (fires when transformers #45358 / >=5.6 reaches the pin). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>
The automodel fp32-master-weight tripwire test (test_automodel_dtype_restore_workaround_still_needed) failed in CI as a false positive. _disable_automodel_checkpoint_dtype_restore() globally and irreversibly replaces _restore_loaded_model_dtype with a no-op; earlier setup_model_and_optimizer tests in the same process leave that no-op installed, so the tripwire exercised the no-op (which preserves fp32) instead of Automodel's real downgrading function. Stash the original on the no-op and have the test recover it via _nrl_original. Also pass requested_dtype=fp32 to the function when its signature accepts it, so the tripwire actually fires once Automodel NVIDIA-NeMo#2419 is pinned: the rewritten function honors the explicit fp32 request only via that new parameter (promote_types), not via hf_config/load_kwargs. Correct the Skywork reward baseline (-5.4062 -> -5.2500) to the value the CI build produces (also the historical pre-refresh value); the incorrect-answer score is sensitive to the transformers/torch/kernel build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>
NVIDIA-NeMo#2419 workaround) Automodel's _restore_loaded_model_dtype (HF/force_hf load path) re-casts loaded params back to the bf16 checkpoint dtype, silently undoing NeMo-RL's intended torch_dtype=float32 master-weight load. With bf16 master weights, AdamW updates underflow and the policy never learns: grpo-nano-v2-12b reward[30] stuck ~0.18 (vs ~0.54) and sft-nanov3-30BA3B loss plateaus. Only force_hf models (NemotronH nano-v2/nano-v3) are affected; custom-impl models (gemma4, Llama) load via the DCP copy path that preserves fp32. Add _disable_automodel_checkpoint_dtype_restore() to no-op that restore before from_pretrained so the requested fp32 is honored. Validated: nano-v2-12b reward[30] 0.176 -> 0.541 PASS; nanov3-30BA3B-lora loss[20] 2.027 PASS. This is temporary until the automodel pin includes NVIDIA-NeMo/Automodel#2419 (rewrites _restore_loaded_model_dtype to honor an explicit torch_dtype). Add an obsolescence tripwire test that fails when NVIDIA-NeMo#2419 lands so the workaround is removed timely, plus an analogous tripwire for the existing Qwen-VL vision-tower key-mapping workaround (fires when transformers #45358 / >=5.6 reaches the pin). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>
The automodel fp32-master-weight tripwire test (test_automodel_dtype_restore_workaround_still_needed) failed in CI as a false positive. _disable_automodel_checkpoint_dtype_restore() globally and irreversibly replaces _restore_loaded_model_dtype with a no-op; earlier setup_model_and_optimizer tests in the same process leave that no-op installed, so the tripwire exercised the no-op (which preserves fp32) instead of Automodel's real downgrading function. Stash the original on the no-op and have the test recover it via _nrl_original. Also pass requested_dtype=fp32 to the function when its signature accepts it, so the tripwire actually fires once Automodel NVIDIA-NeMo#2419 is pinned: the rewritten function honors the explicit fp32 request only via that new parameter (promote_types), not via hf_config/load_kwargs. Correct the Skywork reward baseline (-5.4062 -> -5.2500) to the value the CI build produces (also the historical pre-refresh value); the incorrect-answer score is sensitive to the transformers/torch/kernel build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>
NVIDIA-NeMo#2419 workaround) Automodel's _restore_loaded_model_dtype (HF/force_hf load path) re-casts loaded params back to the bf16 checkpoint dtype, silently undoing NeMo-RL's intended torch_dtype=float32 master-weight load. With bf16 master weights, AdamW updates underflow and the policy never learns: grpo-nano-v2-12b reward[30] stuck ~0.18 (vs ~0.54) and sft-nanov3-30BA3B loss plateaus. Only force_hf models (NemotronH nano-v2/nano-v3) are affected; custom-impl models (gemma4, Llama) load via the DCP copy path that preserves fp32. Add _disable_automodel_checkpoint_dtype_restore() to no-op that restore before from_pretrained so the requested fp32 is honored. Validated: nano-v2-12b reward[30] 0.176 -> 0.541 PASS; nanov3-30BA3B-lora loss[20] 2.027 PASS. This is temporary until the automodel pin includes NVIDIA-NeMo/Automodel#2419 (rewrites _restore_loaded_model_dtype to honor an explicit torch_dtype). Add an obsolescence tripwire test that fails when NVIDIA-NeMo#2419 lands so the workaround is removed timely, plus an analogous tripwire for the existing Qwen-VL vision-tower key-mapping workaround (fires when transformers #45358 / >=5.6 reaches the pin). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>
The automodel fp32-master-weight tripwire test (test_automodel_dtype_restore_workaround_still_needed) failed in CI as a false positive. _disable_automodel_checkpoint_dtype_restore() globally and irreversibly replaces _restore_loaded_model_dtype with a no-op; earlier setup_model_and_optimizer tests in the same process leave that no-op installed, so the tripwire exercised the no-op (which preserves fp32) instead of Automodel's real downgrading function. Stash the original on the no-op and have the test recover it via _nrl_original. Also pass requested_dtype=fp32 to the function when its signature accepts it, so the tripwire actually fires once Automodel NVIDIA-NeMo#2419 is pinned: the rewritten function honors the explicit fp32 request only via that new parameter (promote_types), not via hf_config/load_kwargs. Correct the Skywork reward baseline (-5.4062 -> -5.2500) to the value the CI build produces (also the historical pre-refresh value); the incorrect-answer score is sensitive to the transformers/torch/kernel build. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Shuang Yu <shuangy@nvidia.com>
What does this PR do ?
This PR adds an auto research skill that guides agents on how to do a prolonged research session with Nemo-RL and Nemo-gym. It sets some operating guidelines on how to form and test hypotheses, how to organize git branches, how to monitor and report progress, and how to explicitly check for the stopping conditions of the campaign.
Issues
N/A
Usage
You can prompt Codex, such as:
For this skill to be effective, Codex should have sufficient knowledge of the local operating environment (e.g. Slurm or local machine). A prerequisite to using the auto research skill is therefore, for the agent to be able to automatically run a baseline workload on the given environment.