Skip to content

[main] feat(moe): Support packed sequence for gated delta net (GDN)#2645

Merged
asolergi-nv merged 6 commits into
NVIDIA:mainfrom
yuzhongw-nvidia:gdn_thd
May 14, 2026
Merged

[main] feat(moe): Support packed sequence for gated delta net (GDN)#2645
asolergi-nv merged 6 commits into
NVIDIA:mainfrom
yuzhongw-nvidia:gdn_thd

Conversation

@yuzhongw-nvidia

@yuzhongw-nvidia yuzhongw-nvidia commented Dec 12, 2025

Copy link
Copy Markdown
Contributor

What does this PR do ?

Support packed sequence for gated delta net (GDN).

PR for dev: #2644 , #4230
Closes: #4043, #3798

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

@copy-pr-bot

copy-pr-bot Bot commented Dec 12, 2025

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yuzhongw-nvidia yuzhongw-nvidia force-pushed the gdn_thd branch 2 times, most recently from 2575c6d to 4f8888d Compare December 15, 2025 03:45
@yuzhongw-nvidia yuzhongw-nvidia force-pushed the gdn_thd branch 2 times, most recently from 73d512d to ae8806c Compare January 21, 2026 11:04
@yuzhongw-nvidia yuzhongw-nvidia force-pushed the gdn_thd branch 2 times, most recently from 58fdd22 to e8ed23c Compare January 28, 2026 13:43
@yuzhongw-nvidia yuzhongw-nvidia force-pushed the gdn_thd branch 3 times, most recently from 8ae8e94 to cebd475 Compare April 7, 2026 04:55
@yuzhongw-nvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 9b4c8ad

@svcnvidia-nemo-ci svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels May 14, 2026
@asolergi-nv asolergi-nv enabled auto-merge May 14, 2026 07:51
@asolergi-nv asolergi-nv added this pull request to the merge queue May 14, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25848663624

Merged via the queue into NVIDIA:main with commit 2d1fa8d May 14, 2026
69 of 72 checks passed
Wohox added a commit to Wohox/Megatron-LM that referenced this pull request May 14, 2026
…apping_main

Conflict resolution + porting upstream deltas to relocated code:

- gpt/fine_grained_callables.py: keep slim HEAD; PreProcess / PostProcess /
  TransformerLayerNode / _BackwardDWWrapper and build_mtp_layer_callables /
  build_layer_callables now live under common/.
- common/utils.py: port PR NVIDIA#4511 (remove dead manual_release_grads code
  path) into TransformerLayerNode.backward_impl / backward_dw.
- common/fine_grained_callables.py: port PR NVIDIA#2645 (packed sequence GDN)
  into build_mtp_layer_callables — unpack the new 5-tuple from
  _get_embeddings and forward packed_seq_params / padding_mask.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cspades pushed a commit to cspades/Megatron-LM that referenced this pull request May 14, 2026
…VIDIA#2645)

Signed-off-by: yuzhongw <yuzhongw@nvidia.com>
Co-authored-by: kunlunl <kunlunl@nvidia.com>
Co-authored-by: Xuesong Ye <xuesongyey@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: Daisy Gao <daisyg@nvidia.com>
copy-pr-bot Bot pushed a commit that referenced this pull request May 26, 2026
MultiTokenPredictionLayer.forward calls self._checkpointed_forward(
padding_mask=padding_mask, ...) (multi_token_prediction.py:1305), but
_checkpointed_forward and its inner custom_forward never accepted
padding_mask. With recompute_granularity == 'full' and self.training,
this raised:

    TypeError: MultiTokenPredictionLayer._checkpointed_forward() got
    an unexpected keyword argument 'padding_mask'

at multi_token_prediction.py:1301. The kwarg was introduced in #2645
on the call site; the _checkpointed_forward refactor in #4593 dropped
padding_mask from the recompute path.

Add padding_mask:
  * to _checkpointed_forward's signature
  * to custom_forward's signature so it flows into _proj_and_transformer_layer
  * positionally to te_checkpoint and tensor_parallel.checkpoint, matching the
    other tensor / None args (padding_mask is a rolled tensor, not a non-tensor
    closure-captured arg like attention_bias)
  * to the recompute_method == 'block' fallback that also calls
    _proj_and_transformer_layer directly

Also remove the @pytest.mark.flaky_in_dev markers from
test_forward_backward, test_fp8_support, and test_packed_sequences_with_full_recompute,
which were added in #4931 to mask this exact failure.

Closes #4933

Signed-off-by: oliver könig <okoenig@nvidia.com>
BestJuly pushed a commit to BestJuly/Megatron-LM that referenced this pull request May 26, 2026
MultiTokenPredictionLayer.forward calls self._checkpointed_forward(
padding_mask=padding_mask, ...) (multi_token_prediction.py:1305), but
_checkpointed_forward and its inner custom_forward never accepted
padding_mask. With recompute_granularity == 'full' and self.training,
this raised:

    TypeError: MultiTokenPredictionLayer._checkpointed_forward() got
    an unexpected keyword argument 'padding_mask'

at multi_token_prediction.py:1301. The kwarg was introduced in NVIDIA#2645
on the call site; the _checkpointed_forward refactor in NVIDIA#4593 dropped
padding_mask from the recompute path.

Add padding_mask:
  * to _checkpointed_forward's signature
  * to custom_forward's signature so it flows into _proj_and_transformer_layer
  * positionally to te_checkpoint and tensor_parallel.checkpoint, matching the
    other tensor / None args (padding_mask is a rolled tensor, not a non-tensor
    closure-captured arg like attention_bias)
  * to the recompute_method == 'block' fallback that also calls
    _proj_and_transformer_layer directly

Also remove the @pytest.mark.flaky_in_dev markers from
test_forward_backward, test_fp8_support, and test_packed_sequences_with_full_recompute,
which were added in NVIDIA#4931 to mask this exact failure.

Closes NVIDIA#4933

Signed-off-by: oliver könig <okoenig@nvidia.com>
janEbert pushed a commit to janEbert/Megatron-LM that referenced this pull request Jun 2, 2026
…VIDIA#2645)

Signed-off-by: yuzhongw <yuzhongw@nvidia.com>
Co-authored-by: kunlunl <kunlunl@nvidia.com>
Co-authored-by: Xuesong Ye <xuesongyey@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Co-authored-by: Daisy Gao <daisyg@nvidia.com>
Zhichenzzz added a commit to radixark/Megatron-LM that referenced this pull request Jun 8, 2026
Megatron passes a PackedSeqParams object to every attention module even in
BSHD mode (qkv_format="bshd"), but GDN.forward blanket-rejected any non-None
packed_seq_params, crashing Qwen3.5/3.6 GDN training/inference in BSHD. Guard
on qkv_format=="thd" specifically, matching upstream NVIDIA/Megatron-LM PR NVIDIA#2645.
Genuine THD packing still raises (full support is a follow-up port).

Fixes radixark/miles#1292
erictang000 added a commit to NovaSky-AI/SkyRL that referenced this pull request Jun 9, 2026
…ron-bridge to latest (#1762)

This PR makes the following bumps

megatron_core from `cefc2520158c7ceba3f9adbe4b547a6f7a118da1` (latest
dev branch as of 6/8/26) to `71e418ea7d7b3a6c9a53238c543c3e0b43e11026`
(latest main branch as of 6/8/26.

megatron-bridge from `8382dc343b07b068a827ca20bae860633df3baa0` to
`91a15142a4b4442a8d46ab539d1b923bd08570d0` (latest main 6/8)

Megatron-Bridge has upstreamed code to the main branch that isn't on the
dev branch that is needed to use Megatron-Bridge
(NVIDIA-NeMo/Megatron-Bridge#3988)

Since sequence packing with GDN is now supported on main, we can move
back over to the latest commit on the main branch:
NVIDIA/Megatron-LM#2645
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Question] Enable GDN Packed Sequence Support for Context Parallelism in Qwen 3.5 series

9 participants