[Dev] Add chunk-wise (whole-block) CUDA graph support for THD training by HaochenYuan · Pull Request #5258 · NVIDIA/Megatron-LM

HaochenYuan · 2026-06-10T05:03:50Z

I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

Summary

Adds a chunk CUDA-graph granularity (--cuda-graph-impl local --cuda-graph-granularity chunk) that captures an entire pipeline model-chunk (a whole TransformerBlock) as a single Megatron-owned graph - forward and backward split to fit the 1F1B/VPP schedule for THD (packed-sequence) training. Stacked on #4359 (layer-wise THD CUDA graphs), which provides the shared THD padding / RoPE / padding-mask foundation.

Why chunk-wise, not the per-layer graphs in #4359

It captures the overhead the per-layer graphs leave in eager. In THD + MoE/HybridEP training, much of the per-iteration cost lives around/between layers - the dynamic token-count device syncs (int(max_num_tokens.item())), MoE dispatch/combine, and per-layer launch overhead. Graphing the whole chunk folds those CPU-GPU syncs and launches into one replay, recovering GPU utilization otherwise lost to PP bubbles + dispatcher stalls.
Single capture, dynamic micro-batch replay. A Megatron-owned slot model captures once at the schedule's max in-flight count and replays for any runtime microbatch count, so it's robust to micro-batch counts that change across iterations (4->6->8->...) with no re-capture.
Paged-stash compatible(local graph backend only). Because Megatron owns the capture timing, it captures after paged stash is initialized;

Validation
GB200, 2-node 8-GPU, full 27-layer Moonlight (MLA) + HybridEP + paged stash: no-graph vs chunk-graph bit-exact across all iterations.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

Removed redundant import of _flash_attn_forward.

Haochen Yuan and others added 15 commits May 20, 2026 06:30

add cuda graph support for thd format training

fa3569f

add unit test

22a990f

fix & refactor pad-thd logic

0c5849b

refactor

131f05f

refactor

34176af

Merge branch 'dev' into thd_cuda_graph_dev

5ecff2c

fix linting

c6cafc3

fix linting

66526e2

fix linting

d4dfac5

Removed redundant import of _flash_attn_forward.

fix CI

42bc1fa

change UT cp size to avoid OOM

f30eb02

shorten the UT seqlen

03c0cb1

fix UT

a1853a7

refactor padding

aec3f15

thd chunk-wise cuda graph

3b5dd92

HaochenYuan added the module: moe label Jun 10, 2026

HaochenYuan requested review from a team as code owners June 10, 2026 05:03

HaochenYuan added Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. dev branch Dev branch related issues and development labels Jun 10, 2026

Victarry mentioned this pull request Jun 10, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

hxbai mentioned this pull request Jun 10, 2026

DeepSeek-V4 training support #4468

Open

3 tasks

dingqingy-nv added deepseekv4 DeepSeek V4 PRs and removed deepseekv4 DeepSeek V4 PRs labels Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dev] Add chunk-wise (whole-block) CUDA graph support for THD training#5258

[Dev] Add chunk-wise (whole-block) CUDA graph support for THD training#5258
HaochenYuan wants to merge 15 commits into
NVIDIA:devfrom
HaochenYuan:thd_chunk_wise_graph

HaochenYuan commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HaochenYuan commented Jun 10, 2026

What does this PR do ?

Summary

Why chunk-wise, not the per-layer graphs in #4359

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants