Skip to content

GDN rejects packed_seq_params in BSHD mode — upstream fixed in Megatron-LM PR #2645, missed by fork sync #1292

@WindowsXp-Beta

Description

@WindowsXp-Beta

Summary

The radixark/Megatron-LM fork (miles-main) rejects any packed_seq_params in GatedDeltaNet.forward, but Megatron passes a PackedSeqParams object to all attention modules uniformly — even in BSHD mode. This crashes GDN (Gated Delta Net) training/inference for Qwen3.5/3.6 with qkv_format="bshd". Upstream NVIDIA/Megatron-LM has already fixed this (PR #2645, merged 2026-05-14); our fork missed it because its last upstream sync predates that fix.

Symptom

NotImplementedError: GDN does not support packed sequence for now.
  File ".../megatron/core/ssm/gated_delta_net.py", line 300, in forward

Root cause

The fork's megatron/core/ssm/gated_delta_net.py (at pinned commit 23924a0b) has a blanket rejection:

if packed_seq_params is not None:
    # TODO: support packed sequence
    raise NotImplementedError("GDN does not support packed sequence for now.")

In BSHD mode (qkv_format="bshd"), Megatron still passes a PackedSeqParams object to every attention module uniformly. GDN doesn't actually need it in BSHD mode, but the blanket is not None check raises anyway.

Upstream already fixed this — it's a sync gap

NVIDIA/Megatron-LM main now guards on the THD format specifically and implements packed-sequence support for GDN:

if packed_seq_params is not None and packed_seq_params.qkv_format == 'thd':
    ... # actual cu_seqlens handling

In BSHD mode it simply proceeds. This landed in:

  • NVIDIA/Megatron-LM PR #2645 "[main] feat(moe): Support packed sequence for gated delta net (GDN)", merged 2026-05-14.

The fork's last upstream sync was commit 038e8e5a "Upgrade Megatron from Dec 17 to Feb 13" (2026-03-04, syncing to upstream's ~Feb-13 state). PR #2645 (May 14) postdates that, so the fork never received it.

Proposed fix

Re-sync radixark/Megatron-LM to an upstream commit that includes PR #2645 (≥ 2026-05-14). This brings in the proper guard (and real THD packed-seq support for GDN).

This is the same incomplete-sync class of bug as the MTP naming mismatch (see #1289). Both would be resolved by a proper re-sync of the fork.

Affected versions

  • megatron-core 0.16.0rc0 @ radixark/Megatron-LM commit 23924a0b

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions