Skip to content

feat: Auto research skill#2419

Merged
terrykong merged 21 commits into
NVIDIA-NeMo:mainfrom
vinhngx:vinhn/autoresearch
May 20, 2026
Merged

feat: Auto research skill#2419
terrykong merged 21 commits into
NVIDIA-NeMo:mainfrom
vinhngx:vinhn/autoresearch

Conversation

@vinhngx

@vinhngx vinhngx commented May 6, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

This PR adds an auto research skill that guides agents on how to do a prolonged research session with Nemo-RL and Nemo-gym. It sets some operating guidelines on how to form and test hypotheses, how to organize git branches, how to monitor and report progress, and how to explicitly check for the stopping conditions of the campaign.

Issues

N/A

Usage

You can prompt Codex, such as:

Use the @skill/auto_research skill and train the Qwen-3-VL-2B-instruct model to high accuracy in the Nemo-gym circle click environment. Time budget: 5h

For this skill to be effective, Codex should have sufficient knowledge of the local operating environment (e.g. Slurm or local machine). A prerequisite to using the auto research skill is therefore, for the agent to be able to automatically run a baseline workload on the given environment.

@vinhngx vinhngx requested a review from a team as a code owner May 6, 2026 02:23
@copy-pr-bot

copy-pr-bot Bot commented May 6, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@vinhngx vinhngx force-pushed the vinhn/autoresearch branch from fafbf11 to ca90191 Compare May 6, 2026 02:28
@vinhngx vinhngx changed the title Auto research skill feat: Auto research skill May 6, 2026
@terrykong

Copy link
Copy Markdown
Collaborator

/claude review

Comment thread skills/auto-research/SKILL.md Outdated
Comment thread skills/auto-research/SKILL.md
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 9, 2026

@terrykong terrykong left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #2419 — feat: Auto research skill

Nice work — this skill provides a well-structured framework for running iterative RL experiments with git as the experiment journal. The exploration-ideas guide and git-workflow reference are thorough and practical. The safety guardrails in git-workflow.md (no stash/reset/overwrite without consent) are particularly good.

A few suggestions to align with existing repo conventions and improve consistency:

Directory naming mismatch

All other skill directories use hyphens (build-and-dependency, config-conventions, launch-nemo-rl, etc.), but this one uses an underscore (auto_research). The frontmatter name field is auto-research (with hyphen), creating an inconsistency. Consider renaming the directory to auto-research/ to match the convention.

Nemo-gym coverage

The PR description mentions guiding agents on research with "Nemo-RL and Nemo-gym", but the SKILL.md workflow (step 3) only references NeMo-RL paths (examples/run_grpo.py, nemo_rl/models/, etc.). The Nemo-gym entrypoints (examples/nemo_gym/) are not mentioned. Consider either adding Nemo-gym paths to the workflow or adjusting the PR description to match the actual scope.

See inline comments for additional suggestions.

Generated by Claude Code

Comment thread skills/auto-research/SKILL.md Outdated
Comment thread skills/auto-research/SKILL.md Outdated
Comment thread skills/auto-research/SKILL.md Outdated
Comment thread skills/auto-research/references/git-workflow.md Outdated
Comment thread skills/auto-research/references/git-workflow.md
@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels May 11, 2026
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
@vinhngx vinhngx force-pushed the vinhn/autoresearch branch from 29bb26f to 7aee365 Compare May 12, 2026 01:09
@vinhngx

vinhngx commented May 12, 2026

Copy link
Copy Markdown
Contributor Author

Thanks, @terrykong, for the prompt review. Fixed the reported issues and tightened all 3 skills. Add best practices and gotchas observed with Codex, but could happen with other agents.

vinhngx added 15 commits May 12, 2026 01:22
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
Signed-off-by: Vinh Nguyen <vinhn@nvidia.com>
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 18, 2026
@yuki-97

yuki-97 commented May 19, 2026

Copy link
Copy Markdown
Contributor

/ok to test 9880d5e

@chtruong814

Copy link
Copy Markdown
Contributor

/ok to test 39de685

@terrykong terrykong merged commit 012bf17 into NVIDIA-NeMo:main May 20, 2026
42 checks passed
sharonyu-115 added a commit to sharonyu-115/RL that referenced this pull request Jun 11, 2026
NVIDIA-NeMo#2419 workaround)

Automodel's _restore_loaded_model_dtype (HF/force_hf load path) re-casts loaded
params back to the bf16 checkpoint dtype, silently undoing NeMo-RL's intended
torch_dtype=float32 master-weight load. With bf16 master weights, AdamW updates
underflow and the policy never learns: grpo-nano-v2-12b reward[30] stuck ~0.18
(vs ~0.54) and sft-nanov3-30BA3B loss plateaus. Only force_hf models (NemotronH
nano-v2/nano-v3) are affected; custom-impl models (gemma4, Llama) load via the DCP
copy path that preserves fp32.

Add _disable_automodel_checkpoint_dtype_restore() to no-op that restore before
from_pretrained so the requested fp32 is honored. Validated: nano-v2-12b reward[30]
0.176 -> 0.541 PASS; nanov3-30BA3B-lora loss[20] 2.027 PASS.

This is temporary until the automodel pin includes NVIDIA-NeMo/Automodel#2419
(rewrites _restore_loaded_model_dtype to honor an explicit torch_dtype). Add an
obsolescence tripwire test that fails when NVIDIA-NeMo#2419 lands so the workaround is removed
timely, plus an analogous tripwire for the existing Qwen-VL vision-tower
key-mapping workaround (fires when transformers #45358 / >=5.6 reaches the pin).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
sharonyu-115 added a commit to sharonyu-115/RL that referenced this pull request Jun 12, 2026
NVIDIA-NeMo#2419 workaround)

Automodel's _restore_loaded_model_dtype (HF/force_hf load path) re-casts loaded
params back to the bf16 checkpoint dtype, silently undoing NeMo-RL's intended
torch_dtype=float32 master-weight load. With bf16 master weights, AdamW updates
underflow and the policy never learns: grpo-nano-v2-12b reward[30] stuck ~0.18
(vs ~0.54) and sft-nanov3-30BA3B loss plateaus. Only force_hf models (NemotronH
nano-v2/nano-v3) are affected; custom-impl models (gemma4, Llama) load via the DCP
copy path that preserves fp32.

Add _disable_automodel_checkpoint_dtype_restore() to no-op that restore before
from_pretrained so the requested fp32 is honored. Validated: nano-v2-12b reward[30]
0.176 -> 0.541 PASS; nanov3-30BA3B-lora loss[20] 2.027 PASS.

This is temporary until the automodel pin includes NVIDIA-NeMo/Automodel#2419
(rewrites _restore_loaded_model_dtype to honor an explicit torch_dtype). Add an
obsolescence tripwire test that fails when NVIDIA-NeMo#2419 lands so the workaround is removed
timely, plus an analogous tripwire for the existing Qwen-VL vision-tower
key-mapping workaround (fires when transformers #45358 / >=5.6 reaches the pin).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
sharonyu-115 added a commit to sharonyu-115/RL that referenced this pull request Jun 12, 2026
The automodel fp32-master-weight tripwire test
(test_automodel_dtype_restore_workaround_still_needed) failed in CI as a
false positive. _disable_automodel_checkpoint_dtype_restore() globally and
irreversibly replaces _restore_loaded_model_dtype with a no-op; earlier
setup_model_and_optimizer tests in the same process leave that no-op
installed, so the tripwire exercised the no-op (which preserves fp32)
instead of Automodel's real downgrading function. Stash the original on the
no-op and have the test recover it via _nrl_original.

Also pass requested_dtype=fp32 to the function when its signature accepts
it, so the tripwire actually fires once Automodel NVIDIA-NeMo#2419 is pinned: the
rewritten function honors the explicit fp32 request only via that new
parameter (promote_types), not via hf_config/load_kwargs.

Correct the Skywork reward baseline (-5.4062 -> -5.2500) to the value the
CI build produces (also the historical pre-refresh value); the
incorrect-answer score is sensitive to the transformers/torch/kernel build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
sharonyu-115 added a commit to sharonyu-115/RL that referenced this pull request Jun 12, 2026
The automodel fp32-master-weight tripwire test
(test_automodel_dtype_restore_workaround_still_needed) failed in CI as a
false positive. _disable_automodel_checkpoint_dtype_restore() globally and
irreversibly replaces _restore_loaded_model_dtype with a no-op; earlier
setup_model_and_optimizer tests in the same process leave that no-op
installed, so the tripwire exercised the no-op (which preserves fp32)
instead of Automodel's real downgrading function. Stash the original on the
no-op and have the test recover it via _nrl_original.

Also pass requested_dtype=fp32 to the function when its signature accepts
it, so the tripwire actually fires once Automodel NVIDIA-NeMo#2419 is pinned: the
rewritten function honors the explicit fp32 request only via that new
parameter (promote_types), not via hf_config/load_kwargs.

Correct the Skywork reward baseline (-5.4062 -> -5.2500) to the value the
CI build produces (also the historical pre-refresh value); the
incorrect-answer score is sensitive to the transformers/torch/kernel build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
sharonyu-115 added a commit to sharonyu-115/RL that referenced this pull request Jun 13, 2026
NVIDIA-NeMo#2419 workaround)

Automodel's _restore_loaded_model_dtype (HF/force_hf load path) re-casts loaded
params back to the bf16 checkpoint dtype, silently undoing NeMo-RL's intended
torch_dtype=float32 master-weight load. With bf16 master weights, AdamW updates
underflow and the policy never learns: grpo-nano-v2-12b reward[30] stuck ~0.18
(vs ~0.54) and sft-nanov3-30BA3B loss plateaus. Only force_hf models (NemotronH
nano-v2/nano-v3) are affected; custom-impl models (gemma4, Llama) load via the DCP
copy path that preserves fp32.

Add _disable_automodel_checkpoint_dtype_restore() to no-op that restore before
from_pretrained so the requested fp32 is honored. Validated: nano-v2-12b reward[30]
0.176 -> 0.541 PASS; nanov3-30BA3B-lora loss[20] 2.027 PASS.

This is temporary until the automodel pin includes NVIDIA-NeMo/Automodel#2419
(rewrites _restore_loaded_model_dtype to honor an explicit torch_dtype). Add an
obsolescence tripwire test that fails when NVIDIA-NeMo#2419 lands so the workaround is removed
timely, plus an analogous tripwire for the existing Qwen-VL vision-tower
key-mapping workaround (fires when transformers #45358 / >=5.6 reaches the pin).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
sharonyu-115 added a commit to sharonyu-115/RL that referenced this pull request Jun 13, 2026
The automodel fp32-master-weight tripwire test
(test_automodel_dtype_restore_workaround_still_needed) failed in CI as a
false positive. _disable_automodel_checkpoint_dtype_restore() globally and
irreversibly replaces _restore_loaded_model_dtype with a no-op; earlier
setup_model_and_optimizer tests in the same process leave that no-op
installed, so the tripwire exercised the no-op (which preserves fp32)
instead of Automodel's real downgrading function. Stash the original on the
no-op and have the test recover it via _nrl_original.

Also pass requested_dtype=fp32 to the function when its signature accepts
it, so the tripwire actually fires once Automodel NVIDIA-NeMo#2419 is pinned: the
rewritten function honors the explicit fp32 request only via that new
parameter (promote_types), not via hf_config/load_kwargs.

Correct the Skywork reward baseline (-5.4062 -> -5.2500) to the value the
CI build produces (also the historical pre-refresh value); the
incorrect-answer score is sensitive to the transformers/torch/kernel build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
sharonyu-115 added a commit to sharonyu-115/RL that referenced this pull request Jun 13, 2026
NVIDIA-NeMo#2419 workaround)

Automodel's _restore_loaded_model_dtype (HF/force_hf load path) re-casts loaded
params back to the bf16 checkpoint dtype, silently undoing NeMo-RL's intended
torch_dtype=float32 master-weight load. With bf16 master weights, AdamW updates
underflow and the policy never learns: grpo-nano-v2-12b reward[30] stuck ~0.18
(vs ~0.54) and sft-nanov3-30BA3B loss plateaus. Only force_hf models (NemotronH
nano-v2/nano-v3) are affected; custom-impl models (gemma4, Llama) load via the DCP
copy path that preserves fp32.

Add _disable_automodel_checkpoint_dtype_restore() to no-op that restore before
from_pretrained so the requested fp32 is honored. Validated: nano-v2-12b reward[30]
0.176 -> 0.541 PASS; nanov3-30BA3B-lora loss[20] 2.027 PASS.

This is temporary until the automodel pin includes NVIDIA-NeMo/Automodel#2419
(rewrites _restore_loaded_model_dtype to honor an explicit torch_dtype). Add an
obsolescence tripwire test that fails when NVIDIA-NeMo#2419 lands so the workaround is removed
timely, plus an analogous tripwire for the existing Qwen-VL vision-tower
key-mapping workaround (fires when transformers #45358 / >=5.6 reaches the pin).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
sharonyu-115 added a commit to sharonyu-115/RL that referenced this pull request Jun 13, 2026
The automodel fp32-master-weight tripwire test
(test_automodel_dtype_restore_workaround_still_needed) failed in CI as a
false positive. _disable_automodel_checkpoint_dtype_restore() globally and
irreversibly replaces _restore_loaded_model_dtype with a no-op; earlier
setup_model_and_optimizer tests in the same process leave that no-op
installed, so the tripwire exercised the no-op (which preserves fp32)
instead of Automodel's real downgrading function. Stash the original on the
no-op and have the test recover it via _nrl_original.

Also pass requested_dtype=fp32 to the function when its signature accepts
it, so the tripwire actually fires once Automodel NVIDIA-NeMo#2419 is pinned: the
rewritten function honors the explicit fp32 request only via that new
parameter (promote_types), not via hf_config/load_kwargs.

Correct the Skywork reward baseline (-5.4062 -> -5.2500) to the value the
CI build produces (also the historical pre-refresh value); the
incorrect-answer score is sensitive to the transformers/torch/kernel build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
sharonyu-115 added a commit to sharonyu-115/RL that referenced this pull request Jun 14, 2026
NVIDIA-NeMo#2419 workaround)

Automodel's _restore_loaded_model_dtype (HF/force_hf load path) re-casts loaded
params back to the bf16 checkpoint dtype, silently undoing NeMo-RL's intended
torch_dtype=float32 master-weight load. With bf16 master weights, AdamW updates
underflow and the policy never learns: grpo-nano-v2-12b reward[30] stuck ~0.18
(vs ~0.54) and sft-nanov3-30BA3B loss plateaus. Only force_hf models (NemotronH
nano-v2/nano-v3) are affected; custom-impl models (gemma4, Llama) load via the DCP
copy path that preserves fp32.

Add _disable_automodel_checkpoint_dtype_restore() to no-op that restore before
from_pretrained so the requested fp32 is honored. Validated: nano-v2-12b reward[30]
0.176 -> 0.541 PASS; nanov3-30BA3B-lora loss[20] 2.027 PASS.

This is temporary until the automodel pin includes NVIDIA-NeMo/Automodel#2419
(rewrites _restore_loaded_model_dtype to honor an explicit torch_dtype). Add an
obsolescence tripwire test that fails when NVIDIA-NeMo#2419 lands so the workaround is removed
timely, plus an analogous tripwire for the existing Qwen-VL vision-tower
key-mapping workaround (fires when transformers #45358 / >=5.6 reaches the pin).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
sharonyu-115 added a commit to sharonyu-115/RL that referenced this pull request Jun 14, 2026
The automodel fp32-master-weight tripwire test
(test_automodel_dtype_restore_workaround_still_needed) failed in CI as a
false positive. _disable_automodel_checkpoint_dtype_restore() globally and
irreversibly replaces _restore_loaded_model_dtype with a no-op; earlier
setup_model_and_optimizer tests in the same process leave that no-op
installed, so the tripwire exercised the no-op (which preserves fp32)
instead of Automodel's real downgrading function. Stash the original on the
no-op and have the test recover it via _nrl_original.

Also pass requested_dtype=fp32 to the function when its signature accepts
it, so the tripwire actually fires once Automodel NVIDIA-NeMo#2419 is pinned: the
rewritten function honors the explicit fp32 request only via that new
parameter (promote_types), not via hf_config/load_kwargs.

Correct the Skywork reward baseline (-5.4062 -> -5.2500) to the value the
CI build produces (also the historical pre-refresh value); the
incorrect-answer score is sensitive to the transformers/torch/kernel build.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Shuang Yu <shuangy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants