Skip to content

[dev] [3/5] Qwen3.5 support: SharedExpertMLP meta init#4749

Merged
wplf merged 3 commits into
NVIDIA:devfrom
wplf:fix/shared-experts-meta-init
May 15, 2026
Merged

[dev] [3/5] Qwen3.5 support: SharedExpertMLP meta init#4749
wplf merged 3 commits into
NVIDIA:devfrom
wplf:fix/shared-experts-meta-init

Conversation

@wplf

@wplf wplf commented May 12, 2026

Copy link
Copy Markdown
Member

Qwen3.5 support series

This is part of a 5-PR series adding Qwen3.5-VL support, split for review clarity.

Dev PRs (this series):

Main PRs (corresponding mirrors):


Summary

Add _reset_parameters to SharedExpertMLP so the directly-owned gate_weight is materialized off the meta device when use_shared_expert_gate=True.

Why

Without this, meta-init leaves gate_weight on the meta device and the first forward fails with a meta-tensor error. Submodules already have their own _reset_parameters, so only the directly-owned gate_weight needs handling.

The implementation mirrors the standard per-parameter init pattern:

  • run init_method when config.perform_initialization
  • cast to config.params_dtype
  • set sequence_parallel attribute

Risk

Only fires when use_shared_expert_gate=True and gate_weight is not None — no effect on existing paths.

Test plan

  • Meta-init a config with moe_shared_expert_overlap and use_shared_expert_gate=True; verify forward succeeds.
  • Existing shared-expert tests still pass.

Notes

Extracted from a larger Qwen3.5-VL development branch to keep this fix reviewable on its own.

🤖 Generated with Claude Code

@copy-pr-bot

copy-pr-bot Bot commented May 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wplf

wplf commented May 13, 2026

Copy link
Copy Markdown
Member Author

/ok to test 794656e

Add `_reset_parameters` to `SharedExpertMLP` so the directly-owned
`gate_weight` is materialized off the meta device when
`use_shared_expert_gate=True`. Without this, meta-init leaves
`gate_weight` on the meta device and forward fails.

Mirrors the per-parameter init pattern already used in other Megatron
modules (run `init_method` if `perform_initialization`, cast to
`params_dtype`, set `sequence_parallel`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: BestJuly <19769279+BestJuly@users.noreply.github.com>
@wplf wplf force-pushed the fix/shared-experts-meta-init branch from 794656e to 3b6fba6 Compare May 13, 2026 10:24
@wplf

wplf commented May 13, 2026

Copy link
Copy Markdown
Member Author

/ok to test 3b6fba6

@BestJuly BestJuly added this pull request to the merge queue May 14, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25836851225

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 14, 2026
@BestJuly BestJuly enabled auto-merge May 14, 2026 06:19
@wplf

wplf commented May 14, 2026

Copy link
Copy Markdown
Member Author

/ok to test ba24d0b

@BestJuly BestJuly added this pull request to the merge queue May 14, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25850553708

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 14, 2026
@wplf wplf enabled auto-merge May 15, 2026 02:38
@wplf

wplf commented May 15, 2026

Copy link
Copy Markdown
Member Author

/ok to test d06da0e

@wplf wplf added this pull request to the merge queue May 15, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25908809464

@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25910358898

Merged via the queue into NVIDIA:dev with commit 8195337 May 15, 2026
340 of 348 checks passed
@wplf wplf deleted the fix/shared-experts-meta-init branch May 15, 2026 14:58
SpencerGarnets added a commit to ai-blaise/Megatron-LM that referenced this pull request May 16, 2026
Upstream dev tip: 77c0f8c

Pulled commits:

- 77c0f8c [Dev][feat] Support A2A Overlap for Megatron-FSDP (NVIDIA#3796)

- 8195337 [dev] [3/5] Qwen3.5 support: SharedExpertMLP meta init (NVIDIA#4749)

- 2672ff5 [DEV] fix(megatron-fsdp): preserve non-meta tensors during meta materialization (NVIDIA#4155)

- cfbd9df [dev] [4/5] Qwen3.5 support: Interleaved MRoPE layout (NVIDIA#4750)

- df12802 [dev] Fix GDN DTensor splitting for FSDP checkpointing (NVIDIA#4799)

Resolution: zero conflicts; git auto-merged 12 shared files in megatron/core/{distributed,models,pipeline_parallel,transformer} and tests/unit_tests/a2a_overlap. No ai-blaise custom files touched.

Gates:

- git diff --check: clean

- conflict markers: none

- py_compile (16 changed .py files): OK

- indexcache: 27/28 pass; the 1 fail (test_nvfp4_non_blackwell_cuda_uses_reference_fallback) reproduces identically at the pre-merge base SHA (sglang occupies all 8 H200s in EXCLUSIVE_PROCESS mode -> cudaErrorDevicesUnavailable). 1 Blackwell-only test auto-skips on H200.

- transformer gdn/mtp/moe suite: 53 failed / 7 passed / 55 skipped / 5 errors -- IDENTICAL numbers at pre-merge base; all failures are the same environmental cudaErrorDevicesUnavailable.

- 2-rank torchrun layer-wise optimizer smoke: blocked (no free GPUs).

Custom preserved: StreamBP, IndexCache config, NVFP4 indexer (7e78f28), HISA topk1024 backward test (c628c13), pyproject emerging_optimizers v0.2.0 pin, mHC/MTP/MoE composition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants