Summary
The radixark/Megatron-LM fork (miles-main) rejects any packed_seq_params in GatedDeltaNet.forward, but Megatron passes a PackedSeqParams object to all attention modules uniformly — even in BSHD mode. This crashes GDN (Gated Delta Net) training/inference for Qwen3.5/3.6 with qkv_format="bshd". Upstream NVIDIA/Megatron-LM has already fixed this (PR #2645, merged 2026-05-14); our fork missed it because its last upstream sync predates that fix.
Symptom
NotImplementedError: GDN does not support packed sequence for now.
File ".../megatron/core/ssm/gated_delta_net.py", line 300, in forward
Root cause
The fork's megatron/core/ssm/gated_delta_net.py (at pinned commit 23924a0b) has a blanket rejection:
if packed_seq_params is not None:
# TODO: support packed sequence
raise NotImplementedError("GDN does not support packed sequence for now.")
In BSHD mode (qkv_format="bshd"), Megatron still passes a PackedSeqParams object to every attention module uniformly. GDN doesn't actually need it in BSHD mode, but the blanket is not None check raises anyway.
Upstream already fixed this — it's a sync gap
NVIDIA/Megatron-LM main now guards on the THD format specifically and implements packed-sequence support for GDN:
if packed_seq_params is not None and packed_seq_params.qkv_format == 'thd':
... # actual cu_seqlens handling
In BSHD mode it simply proceeds. This landed in:
- NVIDIA/Megatron-LM PR #2645 "[main] feat(moe): Support packed sequence for gated delta net (GDN)", merged 2026-05-14.
The fork's last upstream sync was commit 038e8e5a "Upgrade Megatron from Dec 17 to Feb 13" (2026-03-04, syncing to upstream's ~Feb-13 state). PR #2645 (May 14) postdates that, so the fork never received it.
Proposed fix
Re-sync radixark/Megatron-LM to an upstream commit that includes PR #2645 (≥ 2026-05-14). This brings in the proper guard (and real THD packed-seq support for GDN).
This is the same incomplete-sync class of bug as the MTP naming mismatch (see #1289). Both would be resolved by a proper re-sync of the fork.
Affected versions
- megatron-core
0.16.0rc0 @ radixark/Megatron-LM commit 23924a0b
References
Summary
The
radixark/Megatron-LMfork (miles-main) rejects anypacked_seq_paramsinGatedDeltaNet.forward, but Megatron passes aPackedSeqParamsobject to all attention modules uniformly — even in BSHD mode. This crashes GDN (Gated Delta Net) training/inference for Qwen3.5/3.6 withqkv_format="bshd". Upstream NVIDIA/Megatron-LM has already fixed this (PR #2645, merged 2026-05-14); our fork missed it because its last upstream sync predates that fix.Symptom
Root cause
The fork's
megatron/core/ssm/gated_delta_net.py(at pinned commit23924a0b) has a blanket rejection:In BSHD mode (
qkv_format="bshd"), Megatron still passes aPackedSeqParamsobject to every attention module uniformly. GDN doesn't actually need it in BSHD mode, but the blanketis not Nonecheck raises anyway.Upstream already fixed this — it's a sync gap
NVIDIA/Megatron-LM
mainnow guards on the THD format specifically and implements packed-sequence support for GDN:In BSHD mode it simply proceeds. This landed in:
The fork's last upstream sync was commit
038e8e5a"Upgrade Megatron from Dec 17 to Feb 13" (2026-03-04, syncing to upstream's ~Feb-13 state). PR #2645 (May 14) postdates that, so the fork never received it.Proposed fix
Re-sync
radixark/Megatron-LMto an upstream commit that includes PR #2645 (≥ 2026-05-14). This brings in the proper guard (and real THD packed-seq support for GDN).This is the same incomplete-sync class of bug as the MTP naming mismatch (see #1289). Both would be resolved by a proper re-sync of the fork.
Affected versions
0.16.0rc0@ radixark/Megatron-LM commit23924a0bReferences
transformer_layerbut megatron-bridge expectsmtp_model_layer(breaks Qwen3.6-27B GDN weight conversion) #1289, TE FP8 dist-checkpoint load crashes on BytesIO extra_state (len() TypeError) — not fixed upstream either #1293