Skip to content

[Dev] Add chunk-wise (whole-block) CUDA graph support for THD training#5258

Open
HaochenYuan wants to merge 15 commits into
NVIDIA:devfrom
HaochenYuan:thd_chunk_wise_graph
Open

[Dev] Add chunk-wise (whole-block) CUDA graph support for THD training#5258
HaochenYuan wants to merge 15 commits into
NVIDIA:devfrom
HaochenYuan:thd_chunk_wise_graph

Conversation

@HaochenYuan

Copy link
Copy Markdown
Contributor
  • I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

Summary

Adds a chunk CUDA-graph granularity (--cuda-graph-impl local --cuda-graph-granularity chunk) that captures an entire pipeline model-chunk (a whole TransformerBlock) as a single Megatron-owned graph - forward and backward split to fit the 1F1B/VPP schedule for THD (packed-sequence) training. Stacked on #4359 (layer-wise THD CUDA graphs), which provides the shared THD padding / RoPE / padding-mask foundation.

Why chunk-wise, not the per-layer graphs in #4359

  • It captures the overhead the per-layer graphs leave in eager. In THD + MoE/HybridEP training, much of the per-iteration cost lives around/between layers - the dynamic token-count device syncs (int(max_num_tokens.item())), MoE dispatch/combine, and per-layer launch overhead. Graphing the whole chunk folds those CPU-GPU syncs and launches into one replay, recovering GPU utilization otherwise lost to PP bubbles + dispatcher stalls.
  • Single capture, dynamic micro-batch replay. A Megatron-owned slot model captures once at the schedule's max in-flight count and replays for any runtime microbatch count, so it's robust to micro-batch counts that change across iterations (4->6->8->...) with no re-capture.
  • Paged-stash compatible(local graph backend only). Because Megatron owns the capture timing, it captures after paged stash is initialized;

Validation
GB200, 2-node 8-GPU, full 27-layer Moonlight (MLA) + HybridEP + paged stash: no-graph vs chunk-graph bit-exact across all iterations.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

  • New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
  • Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

@HaochenYuan HaochenYuan requested review from a team as code owners June 10, 2026 05:03
@HaochenYuan HaochenYuan added Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. dev branch Dev branch related issues and development labels Jun 10, 2026
@hxbai hxbai mentioned this pull request Jun 10, 2026
3 tasks
@dingqingy-nv dingqingy-nv added deepseekv4 DeepSeek V4 PRs and removed deepseekv4 DeepSeek V4 PRs labels Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev branch Dev branch related issues and development Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. module: moe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants