Cudagraphs: MCore local path training with variable-sized sequences by mathemakitten · Pull Request #5046 · NVIDIA/Megatron-LM

mathemakitten · 2026-05-28T21:23:57Z

I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

Adds CUDA graph capture/replay support for packed-sequence (THD / variable-length) training using the MCore-local cudagraph path.

pad_cu_seqlens_for_cuda_graph() pads cu_seqlens from K+1 to target_num_seqs+1 by repeating the final cumulative value, so the captured graph's input signature is stable across steps.
disable_cuda_graphs_this_step() adds a per-step bypass context manager that MegatronModule.__call__ consults at the single dispatch site. Falls back to eager when a microbatch has more docs than the cap.

--cuda-graph-max-packed-seqs N (default 0 = off). Passing the flag without a value enables it with a default of 50.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

copy-pr-bot · 2026-05-28T21:24:01Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

mathemakitten added 4 commits May 28, 2026 07:33

draft functionality

bea33d5

cleanup

05d2380

cleanup

bf13d47

add comment

3cf9911

Victarry mentioned this pull request Jun 10, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cudagraphs: MCore local path training with variable-sized sequences#5046

Cudagraphs: MCore local path training with variable-sized sequences#5046
mathemakitten wants to merge 4 commits into
NVIDIA:mainfrom
mathemakitten:helenn-training-cg-nemorl

mathemakitten commented May 28, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mathemakitten commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mathemakitten commented May 28, 2026 •

edited

Loading