[main] feat(moe): Support packed sequence for gated delta net (GDN) by yuzhongw-nvidia · Pull Request #2645 · NVIDIA/Megatron-LM

yuzhongw-nvidia · 2025-12-12T13:13:45Z

What does this PR do ?

Support packed sequence for gated delta net (GDN).

PR for dev: #2644 , #4230
Closes: #4043, #3798

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2025-12-12T13:13:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yuzhongw-nvidia · 2026-04-29T08:11:49Z

/ok to test 9b4c8ad

svcnvidia-nemo-ci · 2026-05-14T07:52:17Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25848663624

…apping_main Conflict resolution + porting upstream deltas to relocated code: - gpt/fine_grained_callables.py: keep slim HEAD; PreProcess / PostProcess / TransformerLayerNode / _BackwardDWWrapper and build_mtp_layer_callables / build_layer_callables now live under common/. - common/utils.py: port PR NVIDIA#4511 (remove dead manual_release_grads code path) into TransformerLayerNode.backward_impl / backward_dw. - common/fine_grained_callables.py: port PR NVIDIA#2645 (packed sequence GDN) into build_mtp_layer_callables — unpack the new 5-tuple from _get_embeddings and forward packed_seq_params / padding_mask. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…VIDIA#2645) Signed-off-by: yuzhongw <yuzhongw@nvidia.com> Co-authored-by: kunlunl <kunlunl@nvidia.com> Co-authored-by: Xuesong Ye <xuesongyey@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: Daisy Gao <daisyg@nvidia.com>

MultiTokenPredictionLayer.forward calls self._checkpointed_forward( padding_mask=padding_mask, ...) (multi_token_prediction.py:1305), but _checkpointed_forward and its inner custom_forward never accepted padding_mask. With recompute_granularity == 'full' and self.training, this raised: TypeError: MultiTokenPredictionLayer._checkpointed_forward() got an unexpected keyword argument 'padding_mask' at multi_token_prediction.py:1301. The kwarg was introduced in #2645 on the call site; the _checkpointed_forward refactor in #4593 dropped padding_mask from the recompute path. Add padding_mask: * to _checkpointed_forward's signature * to custom_forward's signature so it flows into _proj_and_transformer_layer * positionally to te_checkpoint and tensor_parallel.checkpoint, matching the other tensor / None args (padding_mask is a rolled tensor, not a non-tensor closure-captured arg like attention_bias) * to the recompute_method == 'block' fallback that also calls _proj_and_transformer_layer directly Also remove the @pytest.mark.flaky_in_dev markers from test_forward_backward, test_fp8_support, and test_packed_sequences_with_full_recompute, which were added in #4931 to mask this exact failure. Closes #4933 Signed-off-by: oliver könig <okoenig@nvidia.com>

MultiTokenPredictionLayer.forward calls self._checkpointed_forward( padding_mask=padding_mask, ...) (multi_token_prediction.py:1305), but _checkpointed_forward and its inner custom_forward never accepted padding_mask. With recompute_granularity == 'full' and self.training, this raised: TypeError: MultiTokenPredictionLayer._checkpointed_forward() got an unexpected keyword argument 'padding_mask' at multi_token_prediction.py:1301. The kwarg was introduced in NVIDIA#2645 on the call site; the _checkpointed_forward refactor in NVIDIA#4593 dropped padding_mask from the recompute path. Add padding_mask: * to _checkpointed_forward's signature * to custom_forward's signature so it flows into _proj_and_transformer_layer * positionally to te_checkpoint and tensor_parallel.checkpoint, matching the other tensor / None args (padding_mask is a rolled tensor, not a non-tensor closure-captured arg like attention_bias) * to the recompute_method == 'block' fallback that also calls _proj_and_transformer_layer directly Also remove the @pytest.mark.flaky_in_dev markers from test_forward_backward, test_fp8_support, and test_packed_sequences_with_full_recompute, which were added in NVIDIA#4931 to mask this exact failure. Closes NVIDIA#4933 Signed-off-by: oliver könig <okoenig@nvidia.com>

…VIDIA#2645) Signed-off-by: yuzhongw <yuzhongw@nvidia.com> Co-authored-by: kunlunl <kunlunl@nvidia.com> Co-authored-by: Xuesong Ye <xuesongyey@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: Daisy Gao <daisyg@nvidia.com>

Megatron passes a PackedSeqParams object to every attention module even in BSHD mode (qkv_format="bshd"), but GDN.forward blanket-rejected any non-None packed_seq_params, crashing Qwen3.5/3.6 GDN training/inference in BSHD. Guard on qkv_format=="thd" specifically, matching upstream NVIDIA/Megatron-LM PR NVIDIA#2645. Genuine THD packing still raises (full support is a follow-up port). Fixes radixark/miles#1292

…ron-bridge to latest (#1762) This PR makes the following bumps megatron_core from `cefc2520158c7ceba3f9adbe4b547a6f7a118da1` (latest dev branch as of 6/8/26) to `71e418ea7d7b3a6c9a53238c543c3e0b43e11026` (latest main branch as of 6/8/26. megatron-bridge from `8382dc343b07b068a827ca20bae860633df3baa0` to `91a15142a4b4442a8d46ab539d1b923bd08570d0` (latest main 6/8) Megatron-Bridge has upstreamed code to the main branch that isn't on the dev branch that is needed to use Megatron-Bridge (NVIDIA-NeMo/Megatron-Bridge#3988) Since sequence packing with GDN is now supported on main, we can move back over to the latest commit on the main branch: NVIDIA/Megatron-LM#2645

yuzhongw-nvidia mentioned this pull request Dec 12, 2025

[dev] feat(moe): Support packed sequence for gated delta net (GDN) #2644

Merged

6 tasks

yuzhongw-nvidia force-pushed the gdn_thd branch 2 times, most recently from 2575c6d to 4f8888d Compare December 15, 2025 03:45

yuzhongw-nvidia force-pushed the gdn_thd branch from 4f8888d to 9ccf5a4 Compare January 6, 2026 09:42

yuzhongw-nvidia force-pushed the gdn_thd branch 2 times, most recently from 73d512d to ae8806c Compare January 21, 2026 11:04

copy-pr-bot Bot temporarily deployed to nemo-ci January 21, 2026 13:09 Inactive

copy-pr-bot Bot temporarily deployed to test January 21, 2026 13:10 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 21, 2026 13:54 Inactive

yuzhongw-nvidia force-pushed the gdn_thd branch from 83a8607 to 545a2a5 Compare January 22, 2026 03:59

copy-pr-bot Bot temporarily deployed to nemo-ci January 22, 2026 04:00 Inactive

yuzhongw-nvidia force-pushed the gdn_thd branch 2 times, most recently from 58fdd22 to e8ed23c Compare January 28, 2026 13:43

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 06:15 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 29, 2026 06:16 Inactive

yuzhongw-nvidia force-pushed the gdn_thd branch from 80d2d1c to e94395d Compare February 26, 2026 09:07

yuzhongw-nvidia force-pushed the gdn_thd branch 3 times, most recently from 8ae8e94 to cebd475 Compare April 7, 2026 04:55

copy-pr-bot Bot temporarily deployed to test April 7, 2026 04:56 Inactive

yuzhongw-nvidia force-pushed the gdn_thd branch from cebd475 to ec781f4 Compare April 14, 2026 02:39

yuzhongw-nvidia force-pushed the gdn_thd branch from 3cdbc2b to 9b4c8ad Compare April 29, 2026 08:10

copy-pr-bot Bot temporarily deployed to test April 29, 2026 08:12 Inactive

asolergi-nv mentioned this pull request May 8, 2026

Support packed sequence for qwen3.5 sft----gated delta net (GDN) #3798

Closed

ericharper approved these changes May 12, 2026

View reviewed changes

jaredcasper approved these changes May 14, 2026

View reviewed changes

svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels May 14, 2026

asolergi-nv enabled auto-merge May 14, 2026 07:51

asolergi-nv added this pull request to the merge queue May 14, 2026

Merged via the queue into NVIDIA:main with commit 2d1fa8d May 14, 2026
69 of 72 checks passed

This was referenced May 14, 2026

[recipe] feat: enable THD packing by default for Qwen3.5-VL finetune NVIDIA-NeMo/Megatron-Bridge#3481

Merged

[model, perf] feat: real THD packing in qwen3_vl_step NVIDIA-NeMo/Megatron-Bridge#3838

Draft

xuantengh mentioned this pull request May 21, 2026

Fuse per-sequence AlltoAll into a unified one in GDN forward #4913

Merged

5 tasks

ko3n1g mentioned this pull request May 22, 2026

🐛 CI failure: MultiTokenPredictionLayer._checkpointed_forward() got unexpected kwarg 'padding_mask' #4933

Open

factnn mentioned this pull request May 25, 2026

Fix _checkpointed_forward missing padding_mask parameter #4966

Closed

ko3n1g mentioned this pull request May 26, 2026

fix: forward padding_mask through MTP recompute path #4983

Closed

sbhavani mentioned this pull request May 26, 2026

[ROADMAP][2026 Q2] Megatron Core Roadmap #4997

Open

dznvidia mentioned this pull request May 28, 2026

GDN packed sequence support landing in Nemo containers #5044

Open

Zhichenzzz mentioned this pull request Jun 8, 2026

fix(gdn): only reject THD packed_seq_params, allow BSHD radixark/Megatron-LM#52

Open

erictang000 mentioned this pull request Jun 8, 2026

[chore] move megatron-core from dev latest to main latest, bump megatron-bridge to latest NovaSky-AI/SkyRL#1762

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[main] feat(moe): Support packed sequence for gated delta net (GDN)#2645

[main] feat(moe): Support packed sequence for gated delta net (GDN)#2645
asolergi-nv merged 6 commits into
NVIDIA:mainfrom
yuzhongw-nvidia:gdn_thd

yuzhongw-nvidia commented Dec 12, 2025 •

edited by Phlip79

Loading

Uh oh!

copy-pr-bot Bot commented Dec 12, 2025

Uh oh!

yuzhongw-nvidia commented Apr 29, 2026

Uh oh!

svcnvidia-nemo-ci commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

yuzhongw-nvidia commented Dec 12, 2025 • edited by Phlip79 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot Bot commented Dec 12, 2025

Uh oh!

yuzhongw-nvidia commented Apr 29, 2026

Uh oh!

svcnvidia-nemo-ci commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

yuzhongw-nvidia commented Dec 12, 2025 •

edited by Phlip79

Loading

(Step 1): Add PR label `Expert Review`