Skip to content

[perf] fix: guard cuda_graph_scope validation against None#3249

Merged
malay-nagda merged 1 commit into
mainfrom
yuya/fix-cuda-graph-scope-none-crash
Apr 10, 2026
Merged

[perf] fix: guard cuda_graph_scope validation against None#3249
malay-nagda merged 1 commit into
mainfrom
yuya/fix-cuda-graph-scope-none-crash

Conversation

@yaoyu-33

@yaoyu-33 yaoyu-33 commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Fix TypeError: 'NoneType' object is not iterable in _set_cuda_graph_overrides() when cuda_graph_scope is not passed via CLI
  • Move TE scope validation after both cuda_graph_impl and cuda_graph_scope writes so it validates the effective recipe state, not the raw CLI input
  • When cuda_graph_scope is omitted, the recipe default from set_workload_base_configs() is now correctly used

Fixes #3247

Test plan

  • Run python scripts/performance/run_script.py --cuda_graph_impl transformer_engine -m gpt_oss -mr gpt_oss_120b -g b300 -c bf16 --task pretrain -ng 64 -gn 8 -tp 1 -pp 1 -ep 8 -mb 4 -gb 1280 --detach false without --cuda_graph_scope — should no longer crash
  • Run with explicit --cuda_graph_scope attn mlp — should still work
  • Run with invalid scope --cuda_graph_scope bogus — should raise AssertionError with a clear message

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved CUDA graph scope validation to correctly handle recipe defaults and apply validation at the appropriate stage in the configuration process.

Move the TE scope validation after both `cuda_graph_impl` and
`cuda_graph_scope` writes so it checks the effective recipe state
instead of the raw CLI input.  When `cuda_graph_scope` is not passed
on the CLI the recipe default (set by set_workload_base_configs) is
used, and the old code crashed with `TypeError: 'NoneType' object is
not iterable` before reaching the None guard.

Fixes #3247

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Apr 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 34f492f8-035e-4812-915c-33f3e5f73d7c

📥 Commits

Reviewing files that changed from the base of the PR and between 0dce3d4 and 3c54dfb.

📒 Files selected for processing (1)
  • scripts/performance/utils/overrides.py

📝 Walkthrough

Walkthrough

This change fixes a bug where _set_cuda_graph_overrides crashes with a TypeError when cuda_graph_impl is provided via CLI but cuda_graph_scope is not. The fix defers validation until after all overrides are written to the recipe, allowing default values to be used if the CLI argument is omitted.

Changes

Cohort / File(s) Summary
Validation Timing
scripts/performance/utils/overrides.py
Moved cuda_graph_scope validation from pre-override to post-override execution, checking the effective recipe state instead of raw CLI arguments to properly handle None defaults.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: guarding cuda_graph_scope validation against None to fix a crash.
Linked Issues check ✅ Passed The PR correctly addresses issue #3247 by moving validation after overrides are applied, ensuring validation checks effective recipe state with proper None guards.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the cuda_graph_scope validation bug described in issue #3247 with no unrelated modifications.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed Minor bug fix in single file with adequate test coverage referenced in PR description.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yuya/fix-cuda-graph-scope-none-crash

Comment @coderabbitai help to get the list of available commands and usage tips.

@yaoyu-33 yaoyu-33 added bug Something isn't working area:perf Performance optimizations and benchmarking needs-review PR is ready for code review and waiting on a reviewer labels Apr 9, 2026
@malay-nagda

Copy link
Copy Markdown
Contributor

/ok to test 3c54dfb

@malay-nagda malay-nagda added r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. 26.04.01 labels Apr 10, 2026
@malay-nagda malay-nagda merged commit fad15ab into main Apr 10, 2026
52 checks passed
@malay-nagda malay-nagda deleted the yuya/fix-cuda-graph-scope-none-crash branch April 10, 2026 07:11
svcnvidia-nemo-ci pushed a commit that referenced this pull request Apr 10, 2026
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
yaoyu-33 added a commit to conver334/Megatron-Bridge that referenced this pull request Apr 10, 2026
…Mo#3249)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

26.04.01 area:perf Performance optimizations and benchmarking bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] _set_cuda_graph_overrides crashes with TypeError when cuda_graph_scope is None

2 participants