[dev] [follow-up] Qwen3.5 support: MoE aux loss padding_mask by wplf · Pull Request #4776 · NVIDIA/Megatron-LM

wplf · 2026-05-13T07:14:45Z

Qwen3.5 support series

This is a follow-up to the 5-PR series adding Qwen3.5-VL support; it lands on top of #4751 (the example), not the core changes.

Dev PRs:

[1/5] MTP packed-seq CP+THD fix — [Dev] fix(mtp): use padded cu_seqlens in MTP roll for THD with CP #4494 (BestJuly)
[2/5] FSDP DTensor Bridge ckpt compat — [dev] [2/5] Qwen3.5 support: FSDP DTensor Bridge checkpoint compatibility #4748
[3/5] SharedExpertMLP meta init — [dev] [3/5] Qwen3.5 support: SharedExpertMLP meta init #4749
[4/5] Interleaved MRoPE layout — [dev] [4/5] Qwen3.5 support: Interleaved MRoPE layout #4750
[5/5] Qwen3.5-VL training example — [dev] [5/5] Qwen3.5 support: Qwen3.5-VL training example #4751
[follow-up] MoE aux loss padding_mask — this PR

Main mirror: opened separately against main as a sibling.

Why

In examples/multimodal_dev/models/base.py, MultimodalModel.forward() was calling self.language_model(...) without a padding_mask argument. GPTModel.forward() documents padding_mask as "Only used for MoE layers to exclude padding tokens from routing computations." With padding_mask=None the router skips three masking sites that affect MoE numerics:

Router site (`megatron/core/transformer/moe/router.py`)	Effect when `padding_mask=None`
`apply_z_loss(logits, padding_mask=None)` (line 526)	z-loss averages over all tokens including padding
`_apply_aux_loss(..., with_padding_mask=padding_mask is not None)` (line 736)	aux load-balancing loss computed over all tokens
`_apply_expert_bias(routing_map, padding_mask=None)` (line 604)	expert-bias EMA accumulates padded-token routing

Both data paths in pack_or_pad_batch (forward_step.py) introduce padding:

BSHD (line 272-275): every sample padded to target_seqlens with input_ids=0, labels=-100, loss_mask=0.
THD packed (line 224-227): each sample padded so its length is a multiple of divisible_by; cu_seqlens_q_padded differs from cu_seqlens_q.

For MoE variants (proxy, 35b_a3b, 35b_a3b_light, 122b_a10b, 397b_a17b) this means the load-balancing signal is diluted by padded positions whose router logits don't reflect any real token.

What this PR does

Build padding_mask at collate time in both branches of pack_or_pad_batch:
- BSHD: [B, target_seqlens] bool, True past each sample's real length.
- THD: [1, T] bool, True between cu_seqlens_padded[i] + real_len[i] and cu_seqlens_padded[i+1].
Thread it through forward_step → MultimodalModel.forward → _cp_split_for_forward → language_model.forward, mirroring how loss_mask is handled. CP split uses the same BSHD zigzag / THD tex.thd_get_partitioned_indices index as the rest.

Why not derive from `loss_mask` or `labels == -100`

For SFT data, prompt tokens carry loss_mask=0 / label=-100 but are real tokens that should still participate in routing. Folding them into padding_mask would under-route real activations — a different bug. The collate-time mask only marks tokens added by the padder.

Dependency

Depends on #4751 (PR-5: the Qwen3.5-VL example) — MultimodalModel and pack_or_pad_batch are introduced there. The diff vs dev therefore shows PR-5 + this fix; reviewers should compare against #4751's tip for the isolated padding_mask delta.

Risk

Dense variants unaffected — padding_mask is only consumed inside MoE layers.
_cp_split_for_forward and MultimodalModel.forward gain one optional kwarg with None default; existing callers unaffected.

🤖 Generated with Claude Code

copy-pr-bot · 2026-05-13T07:14:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…ecoder `MultimodalModel.forward()` was calling `self.language_model(...)` without a `padding_mask` argument. With `padding_mask=None`, the language decoder's MoE router skips three masking sites that document themselves as MoE-only: - `apply_z_loss(logits, padding_mask=None)` — z-loss averages over all tokens including collate padding. - `_apply_aux_loss(..., with_padding_mask=padding_mask is not None)` — aux load-balancing loss is computed over all tokens, so padded positions dilute the signal and bias balancing toward whatever the model emits for input_id=0 at those slots. - `_apply_expert_bias(routing_map, padding_mask=None)` — expert-bias EMA accumulates routing statistics from padded tokens. Both code paths in `pack_or_pad_batch` introduce padded positions: BSHD pads each sample to `target_seqlens` with input_id=0 / label=-100 / loss_mask=0; THD pads each sample's length to a multiple of `divisible_by` so `cu_seqlens_q_padded` differs from `cu_seqlens_q`. Fix: - Build `padding_mask` at collate time in both branches (BSHD: ``[B, target_seqlens]``; THD: ``[1, T]``). True marks collate-padded positions only — distinct from `loss_mask` so SFT prompt tokens (which carry `loss_mask=0` but are real tokens) still participate in routing. - Thread `padding_mask` through `forward_step` → `MultimodalModel.forward` → `_cp_split_for_forward` → `GPTModel.forward`, mirroring how `loss_mask` is handled. CP split uses the same BSHD zigzag / THD `tex.thd_get_partitioned_indices` index as the other tensors. Dense Qwen3.5-VL variants are unaffected — `padding_mask` is only consumed inside MoE layers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: BestJuly <19769279+BestJuly@users.noreply.github.com>

wplf · 2026-05-29T12:55:09Z

/ok to test 90fe01f

svcnvidia-nemo-ci · 2026-06-04T05:59:26Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26933866397

svcnvidia-nemo-ci · 2026-06-04T06:26:13Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/26934866199

wplf added the Run tests label May 13, 2026

wplf mentioned this pull request May 13, 2026

[main] [follow-up] Qwen3.5 support: MoE aux loss padding_mask #4777

Open

wplf force-pushed the fix/moe-padding-mask branch from 5418b65 to af7d670 Compare May 13, 2026 10:24

wplf force-pushed the fix/moe-padding-mask branch from af7d670 to 90fe01f Compare May 29, 2026 10:28

wplf marked this pull request as ready for review May 29, 2026 10:32

wplf requested review from a team as code owners May 29, 2026 10:32

svcnvidia-nemo-ci added the complexity: low label May 29, 2026

copy-pr-bot Bot temporarily deployed to test May 29, 2026 12:56 Inactive

yaox12 approved these changes Jun 4, 2026

View reviewed changes

yaox12 added this pull request to the merge queue Jun 4, 2026

Merged via the queue into NVIDIA:dev with commit 05a93e0 Jun 4, 2026
183 of 184 checks passed

Victarry mentioned this pull request Jun 10, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev] [follow-up] Qwen3.5 support: MoE aux loss padding_mask#4776

[dev] [follow-up] Qwen3.5 support: MoE aux loss padding_mask#4776
yaox12 merged 1 commit into
NVIDIA:devfrom
wplf:fix/moe-padding-mask

wplf commented May 13, 2026

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

wplf commented May 29, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 4, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wplf commented May 13, 2026

Qwen3.5 support series

Why

What this PR does

Why not derive from loss_mask or labels == -100

Dependency

Risk

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

wplf commented May 29, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 4, 2026

Uh oh!

svcnvidia-nemo-ci commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Why not derive from `loss_mask` or `labels == -100`