[dev] [3/5] Qwen3.5 support: SharedExpertMLP meta init#4749
Merged
Conversation
This was referenced May 12, 2026
This was referenced May 12, 2026
Member
Author
|
/ok to test 794656e |
Add `_reset_parameters` to `SharedExpertMLP` so the directly-owned `gate_weight` is materialized off the meta device when `use_shared_expert_gate=True`. Without this, meta-init leaves `gate_weight` on the meta device and forward fails. Mirrors the per-parameter init pattern already used in other Megatron modules (run `init_method` if `perform_initialization`, cast to `params_dtype`, set `sequence_parallel`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-Authored-By: BestJuly <19769279+BestJuly@users.noreply.github.com>
794656e to
3b6fba6
Compare
Member
Author
|
/ok to test 3b6fba6 |
BestJuly
approved these changes
May 14, 2026
yaox12
approved these changes
May 14, 2026
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25836851225 |
Member
Author
|
/ok to test ba24d0b |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25850553708 |
Member
Author
|
/ok to test d06da0e |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25908809464 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25910358898 |
SpencerGarnets
added a commit
to ai-blaise/Megatron-LM
that referenced
this pull request
May 16, 2026
Upstream dev tip: 77c0f8c Pulled commits: - 77c0f8c [Dev][feat] Support A2A Overlap for Megatron-FSDP (NVIDIA#3796) - 8195337 [dev] [3/5] Qwen3.5 support: SharedExpertMLP meta init (NVIDIA#4749) - 2672ff5 [DEV] fix(megatron-fsdp): preserve non-meta tensors during meta materialization (NVIDIA#4155) - cfbd9df [dev] [4/5] Qwen3.5 support: Interleaved MRoPE layout (NVIDIA#4750) - df12802 [dev] Fix GDN DTensor splitting for FSDP checkpointing (NVIDIA#4799) Resolution: zero conflicts; git auto-merged 12 shared files in megatron/core/{distributed,models,pipeline_parallel,transformer} and tests/unit_tests/a2a_overlap. No ai-blaise custom files touched. Gates: - git diff --check: clean - conflict markers: none - py_compile (16 changed .py files): OK - indexcache: 27/28 pass; the 1 fail (test_nvfp4_non_blackwell_cuda_uses_reference_fallback) reproduces identically at the pre-merge base SHA (sglang occupies all 8 H200s in EXCLUSIVE_PROCESS mode -> cudaErrorDevicesUnavailable). 1 Blackwell-only test auto-skips on H200. - transformer gdn/mtp/moe suite: 53 failed / 7 passed / 55 skipped / 5 errors -- IDENTICAL numbers at pre-merge base; all failures are the same environmental cudaErrorDevicesUnavailable. - 2-rank torchrun layer-wise optimizer smoke: blocked (no free GPUs). Custom preserved: StreamBP, IndexCache config, NVFP4 indexer (7e78f28), HISA topk1024 backward test (c628c13), pyproject emerging_optimizers v0.2.0 pin, mHC/MTP/MoE composition.
71 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Qwen3.5 support series
This is part of a 5-PR series adding Qwen3.5-VL support, split for review clarity.
Dev PRs (this series):
Main PRs (corresponding mirrors):
Summary
Add
_reset_parameterstoSharedExpertMLPso the directly-ownedgate_weightis materialized off the meta device whenuse_shared_expert_gate=True.Why
Without this, meta-init leaves
gate_weighton the meta device and the first forward fails with a meta-tensor error. Submodules already have their own_reset_parameters, so only the directly-ownedgate_weightneeds handling.The implementation mirrors the standard per-parameter init pattern:
init_methodwhenconfig.perform_initializationconfig.params_dtypesequence_parallelattributeRisk
Only fires when
use_shared_expert_gate=Trueandgate_weight is not None— no effect on existing paths.Test plan
moe_shared_expert_overlapanduse_shared_expert_gate=True; verify forward succeeds.Notes
Extracted from a larger Qwen3.5-VL development branch to keep this fix reviewable on its own.
🤖 Generated with Claude Code