[main] feat(moe): Support packed sequence for gated delta net (GDN)#2645
Merged
Conversation
6 tasks
2575c6d to
4f8888d
Compare
4f8888d to
9ccf5a4
Compare
73d512d to
ae8806c
Compare
83a8607 to
545a2a5
Compare
58fdd22 to
e8ed23c
Compare
80d2d1c to
e94395d
Compare
8ae8e94 to
cebd475
Compare
cebd475 to
ec781f4
Compare
3cdbc2b to
9b4c8ad
Compare
Contributor
Author
|
/ok to test 9b4c8ad |
ericharper
approved these changes
May 12, 2026
jaredcasper
approved these changes
May 14, 2026
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25848663624 |
Wohox
added a commit
to Wohox/Megatron-LM
that referenced
this pull request
May 14, 2026
…apping_main Conflict resolution + porting upstream deltas to relocated code: - gpt/fine_grained_callables.py: keep slim HEAD; PreProcess / PostProcess / TransformerLayerNode / _BackwardDWWrapper and build_mtp_layer_callables / build_layer_callables now live under common/. - common/utils.py: port PR NVIDIA#4511 (remove dead manual_release_grads code path) into TransformerLayerNode.backward_impl / backward_dw. - common/fine_grained_callables.py: port PR NVIDIA#2645 (packed sequence GDN) into build_mtp_layer_callables — unpack the new 5-tuple from _get_embeddings and forward packed_seq_params / padding_mask. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cspades
pushed a commit
to cspades/Megatron-LM
that referenced
this pull request
May 14, 2026
…VIDIA#2645) Signed-off-by: yuzhongw <yuzhongw@nvidia.com> Co-authored-by: kunlunl <kunlunl@nvidia.com> Co-authored-by: Xuesong Ye <xuesongyey@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: Daisy Gao <daisyg@nvidia.com>
This was referenced May 14, 2026
5 tasks
copy-pr-bot Bot
pushed a commit
that referenced
this pull request
May 26, 2026
MultiTokenPredictionLayer.forward calls self._checkpointed_forward(
padding_mask=padding_mask, ...) (multi_token_prediction.py:1305), but
_checkpointed_forward and its inner custom_forward never accepted
padding_mask. With recompute_granularity == 'full' and self.training,
this raised:
TypeError: MultiTokenPredictionLayer._checkpointed_forward() got
an unexpected keyword argument 'padding_mask'
at multi_token_prediction.py:1301. The kwarg was introduced in #2645
on the call site; the _checkpointed_forward refactor in #4593 dropped
padding_mask from the recompute path.
Add padding_mask:
* to _checkpointed_forward's signature
* to custom_forward's signature so it flows into _proj_and_transformer_layer
* positionally to te_checkpoint and tensor_parallel.checkpoint, matching the
other tensor / None args (padding_mask is a rolled tensor, not a non-tensor
closure-captured arg like attention_bias)
* to the recompute_method == 'block' fallback that also calls
_proj_and_transformer_layer directly
Also remove the @pytest.mark.flaky_in_dev markers from
test_forward_backward, test_fp8_support, and test_packed_sequences_with_full_recompute,
which were added in #4931 to mask this exact failure.
Closes #4933
Signed-off-by: oliver könig <okoenig@nvidia.com>
BestJuly
pushed a commit
to BestJuly/Megatron-LM
that referenced
this pull request
May 26, 2026
MultiTokenPredictionLayer.forward calls self._checkpointed_forward(
padding_mask=padding_mask, ...) (multi_token_prediction.py:1305), but
_checkpointed_forward and its inner custom_forward never accepted
padding_mask. With recompute_granularity == 'full' and self.training,
this raised:
TypeError: MultiTokenPredictionLayer._checkpointed_forward() got
an unexpected keyword argument 'padding_mask'
at multi_token_prediction.py:1301. The kwarg was introduced in NVIDIA#2645
on the call site; the _checkpointed_forward refactor in NVIDIA#4593 dropped
padding_mask from the recompute path.
Add padding_mask:
* to _checkpointed_forward's signature
* to custom_forward's signature so it flows into _proj_and_transformer_layer
* positionally to te_checkpoint and tensor_parallel.checkpoint, matching the
other tensor / None args (padding_mask is a rolled tensor, not a non-tensor
closure-captured arg like attention_bias)
* to the recompute_method == 'block' fallback that also calls
_proj_and_transformer_layer directly
Also remove the @pytest.mark.flaky_in_dev markers from
test_forward_backward, test_fp8_support, and test_packed_sequences_with_full_recompute,
which were added in NVIDIA#4931 to mask this exact failure.
Closes NVIDIA#4933
Signed-off-by: oliver könig <okoenig@nvidia.com>
janEbert
pushed a commit
to janEbert/Megatron-LM
that referenced
this pull request
Jun 2, 2026
…VIDIA#2645) Signed-off-by: yuzhongw <yuzhongw@nvidia.com> Co-authored-by: kunlunl <kunlunl@nvidia.com> Co-authored-by: Xuesong Ye <xuesongyey@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Co-authored-by: Daisy Gao <daisyg@nvidia.com>
Zhichenzzz
added a commit
to radixark/Megatron-LM
that referenced
this pull request
Jun 8, 2026
Megatron passes a PackedSeqParams object to every attention module even in BSHD mode (qkv_format="bshd"), but GDN.forward blanket-rejected any non-None packed_seq_params, crashing Qwen3.5/3.6 GDN training/inference in BSHD. Guard on qkv_format=="thd" specifically, matching upstream NVIDIA/Megatron-LM PR NVIDIA#2645. Genuine THD packing still raises (full support is a follow-up port). Fixes radixark/miles#1292
erictang000
added a commit
to NovaSky-AI/SkyRL
that referenced
this pull request
Jun 9, 2026
…ron-bridge to latest (#1762) This PR makes the following bumps megatron_core from `cefc2520158c7ceba3f9adbe4b547a6f7a118da1` (latest dev branch as of 6/8/26) to `71e418ea7d7b3a6c9a53238c543c3e0b43e11026` (latest main branch as of 6/8/26. megatron-bridge from `8382dc343b07b068a827ca20bae860633df3baa0` to `91a15142a4b4442a8d46ab539d1b923bd08570d0` (latest main 6/8) Megatron-Bridge has upstreamed code to the main branch that isn't on the dev branch that is needed to use Megatron-Bridge (NVIDIA-NeMo/Megatron-Bridge#3988) Since sequence packing with GDN is now supported on main, we can move back over to the latest commit on the main branch: NVIDIA/Megatron-LM#2645
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Support packed sequence for gated delta net (GDN).
PR for dev: #2644 , #4230
Closes: #4043, #3798
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.