[dev] [3/5] Qwen3.5 support: SharedExpertMLP meta init by wplf · Pull Request #4749 · NVIDIA/Megatron-LM

wplf · 2026-05-12T06:51:48Z

Qwen3.5 support series

This is part of a 5-PR series adding Qwen3.5-VL support, split for review clarity.

Dev PRs (this series):

[1/5] MTP packed-seq CP+THD fix — [Dev] fix(mtp): use padded cu_seqlens in MTP roll for THD with CP #4494
[2/5] FSDP DTensor Bridge checkpoint compatibility — [dev] [2/5] Qwen3.5 support: FSDP DTensor Bridge checkpoint compatibility #4748
[3/5] SharedExpertMLP meta init — [dev] [3/5] Qwen3.5 support: SharedExpertMLP meta init #4749 ← this PR
[4/5] Interleaved MRoPE layout — [dev] [4/5] Qwen3.5 support: Interleaved MRoPE layout #4750
[5/5] Qwen3.5-VL training example — [dev] [5/5] Qwen3.5 support: Qwen3.5-VL training example #4751

Main PRs (corresponding mirrors):

Summary

Add _reset_parameters to SharedExpertMLP so the directly-owned gate_weight is materialized off the meta device when use_shared_expert_gate=True.

Why

Without this, meta-init leaves gate_weight on the meta device and the first forward fails with a meta-tensor error. Submodules already have their own _reset_parameters, so only the directly-owned gate_weight needs handling.

The implementation mirrors the standard per-parameter init pattern:

run init_method when config.perform_initialization
cast to config.params_dtype
set sequence_parallel attribute

Risk

Only fires when use_shared_expert_gate=True and gate_weight is not None — no effect on existing paths.

Test plan

Meta-init a config with moe_shared_expert_overlap and use_shared_expert_gate=True; verify forward succeeds.
Existing shared-expert tests still pass.

Notes

Extracted from a larger Qwen3.5-VL development branch to keep this fix reviewable on its own.

🤖 Generated with Claude Code

copy-pr-bot · 2026-05-12T06:51:52Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

wplf · 2026-05-13T07:04:26Z

/ok to test 794656e

Add `_reset_parameters` to `SharedExpertMLP` so the directly-owned `gate_weight` is materialized off the meta device when `use_shared_expert_gate=True`. Without this, meta-init leaves `gate_weight` on the meta device and forward fails. Mirrors the per-parameter init pattern already used in other Megatron modules (run `init_method` if `perform_initialization`, cast to `params_dtype`, set `sequence_parallel`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: BestJuly <19769279+BestJuly@users.noreply.github.com>

wplf · 2026-05-13T10:48:23Z

/ok to test 3b6fba6

svcnvidia-nemo-ci · 2026-05-14T01:48:41Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25836851225

wplf · 2026-05-14T06:39:32Z

/ok to test ba24d0b

svcnvidia-nemo-ci · 2026-05-14T08:39:29Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25850553708

wplf · 2026-05-15T02:42:09Z

/ok to test d06da0e

svcnvidia-nemo-ci · 2026-05-15T08:44:36Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25908809464

svcnvidia-nemo-ci · 2026-05-15T09:22:40Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25910358898

Upstream dev tip: 77c0f8c Pulled commits: - 77c0f8c [Dev][feat] Support A2A Overlap for Megatron-FSDP (NVIDIA#3796) - 8195337 [dev] [3/5] Qwen3.5 support: SharedExpertMLP meta init (NVIDIA#4749) - 2672ff5 [DEV] fix(megatron-fsdp): preserve non-meta tensors during meta materialization (NVIDIA#4155) - cfbd9df [dev] [4/5] Qwen3.5 support: Interleaved MRoPE layout (NVIDIA#4750) - df12802 [dev] Fix GDN DTensor splitting for FSDP checkpointing (NVIDIA#4799) Resolution: zero conflicts; git auto-merged 12 shared files in megatron/core/{distributed,models,pipeline_parallel,transformer} and tests/unit_tests/a2a_overlap. No ai-blaise custom files touched. Gates: - git diff --check: clean - conflict markers: none - py_compile (16 changed .py files): OK - indexcache: 27/28 pass; the 1 fail (test_nvfp4_non_blackwell_cuda_uses_reference_fallback) reproduces identically at the pre-merge base SHA (sglang occupies all 8 H200s in EXCLUSIVE_PROCESS mode -> cudaErrorDevicesUnavailable). 1 Blackwell-only test auto-skips on H200. - transformer gdn/mtp/moe suite: 53 failed / 7 passed / 55 skipped / 5 errors -- IDENTICAL numbers at pre-merge base; all failures are the same environmental cudaErrorDevicesUnavailable. - 2-rank torchrun layer-wise optimizer smoke: blocked (no free GPUs). Custom preserved: StreamBP, IndexCache config, NVFP4 indexer (7e78f28), HISA topk1024 backward test (c628c13), pyproject emerging_optimizers v0.2.0 pin, mHC/MTP/MoE composition.

wplf added the Run tests label May 12, 2026

This was referenced May 12, 2026

[main] [3/5] Qwen3.5 support: SharedExpertMLP meta init #4754

Open

[dev] [1/5] Qwen3.5 support: MTP packed-seq CP+THD fix #4747

Closed

[dev] [2/5] Qwen3.5 support: FSDP DTensor Bridge checkpoint compatibility #4748

Merged

wplf changed the title ~~fix(moe): initialize SharedExpertMLP gate_weight under meta init~~ [dev] [3/5] Qwen3.5 support: SharedExpertMLP meta init May 12, 2026

wplf marked this pull request as ready for review May 13, 2026 06:49

wplf requested review from a team as code owners May 13, 2026 06:49

svcnvidia-nemo-ci added the complexity: low label May 13, 2026

copy-pr-bot Bot temporarily deployed to test May 13, 2026 07:05 Inactive

wplf mentioned this pull request May 13, 2026

[dev] [follow-up] Qwen3.5 support: MoE aux loss padding_mask #4776

Merged

wplf force-pushed the fix/shared-experts-meta-init branch from 794656e to 3b6fba6 Compare May 13, 2026 10:24

copy-pr-bot Bot temporarily deployed to test May 13, 2026 10:49 Inactive

BestJuly approved these changes May 14, 2026

View reviewed changes

yaox12 approved these changes May 14, 2026

View reviewed changes

BestJuly added this pull request to the merge queue May 14, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 14, 2026

Merge branch 'dev' into fix/shared-experts-meta-init

ba24d0b

BestJuly enabled auto-merge May 14, 2026 06:19

copy-pr-bot Bot temporarily deployed to test May 14, 2026 06:40 Inactive

BestJuly added this pull request to the merge queue May 14, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 14, 2026

Merge branch 'dev' into fix/shared-experts-meta-init

d06da0e

wplf enabled auto-merge May 15, 2026 02:38

copy-pr-bot Bot temporarily deployed to test May 15, 2026 02:43 Inactive

wplf added this pull request to the merge queue May 15, 2026

Merged via the queue into NVIDIA:dev with commit 8195337 May 15, 2026
340 of 348 checks passed

wplf deleted the fix/shared-experts-meta-init branch May 15, 2026 14:58

Victarry mentioned this pull request Jun 10, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dev] [3/5] Qwen3.5 support: SharedExpertMLP meta init#4749

[dev] [3/5] Qwen3.5 support: SharedExpertMLP meta init#4749
wplf merged 3 commits into
NVIDIA:devfrom
wplf:fix/shared-experts-meta-init

wplf commented May 12, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

wplf commented May 13, 2026

Uh oh!

wplf commented May 13, 2026

Uh oh!

svcnvidia-nemo-ci commented May 14, 2026

Uh oh!

Uh oh!

wplf commented May 14, 2026

Uh oh!

svcnvidia-nemo-ci commented May 14, 2026

Uh oh!

Uh oh!

wplf commented May 15, 2026

Uh oh!

svcnvidia-nemo-ci commented May 15, 2026

Uh oh!

svcnvidia-nemo-ci commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wplf commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3.5 support series

Summary

Why

Risk

Test plan

Notes

Uh oh!

copy-pr-bot Bot commented May 12, 2026

Uh oh!

wplf commented May 13, 2026

Uh oh!

wplf commented May 13, 2026

Uh oh!

svcnvidia-nemo-ci commented May 14, 2026

Uh oh!

Uh oh!

wplf commented May 14, 2026

Uh oh!

svcnvidia-nemo-ci commented May 14, 2026

Uh oh!

Uh oh!

wplf commented May 15, 2026

Uh oh!

svcnvidia-nemo-ci commented May 15, 2026

Uh oh!

svcnvidia-nemo-ci commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wplf commented May 12, 2026 •

edited

Loading