Add expert parallelism (EP) config support for Qwen3 MoE#45436
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Currently, This creates a tension between attn layer TP and EP : If we include attention entries (like Llama4 does):
If we exclude attention entries (like gpt_oss does): Pure EP works at any scale (EP=16, 32, 64) BUT TP+EP on a 2D mesh loses attention sharding :/ What's the preferred approach? Should we go expert-only in the EP plan for maximum flexibility, or include attention for combined TP+EP at the cost of constraining EP size? |
|
I think for now, we should create a new mesh dim |
It would be nice if we add a check on this some way or document this on ep plan / tp plan combination. Recommended approach is to go with the best defaults / what makes most sense. pure EP would probably be "faster" as less coms, EP + FSDP is easier to get and probably makes more sense? but we want to allow people to have EP + TP if they want -> error early if simple! |
Here are some numbers for Qwen3-30B-A3B (MoE, 128 experts, 8 active):
For comparison FSDP2 without EP:
I also ran EP=8 2D mesh vs EP=16 flat mesh on 2 nodes, just to see if the layout matters. so EP =9 in 2 nodes, means 8 experts (intra node) replicated across 2 nodes.
Flat and 2D meshes give identical performance, because I think the current EP implementation uses all-reduce (96% inter-node bandwidth), not all-to-all. So EP seems to be slower than pure FSDP2 or FSDP2+CP for long ctx. 🤔 |
Add base_model_ep_plan to Qwen3VLMoeTextConfig
Defines sharding strategy for MoE experts without affecting attention layers, allowing EP to scale beyond num_kv_heads constraints.
Remove duplicate base_model_ep_plan with attention entries from qwen3_moe and update qwen3_vl_moe to use the expert-only EP plan. Attention is left unsharded — FSDP2 handles attention weight distribution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
875002b to
a29cc42
Compare
|
Yeah, this could just be because we do send 100% of all hidden states to all experts instead of sending just the ones allocated to that expert, same for when we reduce? In any case that's good to have we can investigate later on perf issues / bottlenecks ! |
|
run-slow: qwen3_moe, qwen3_omni_moe, qwen3_vl_moe |
is the fastest no? |
|
This comment contains models: ["models/qwen3_moe", "models/qwen3_omni_moe", "models/qwen3_vl_moe"] |
|
happy to merge once green! |
|
@bot /style |
|
Style fix bot fixed some files and pushed the changes. |
…es and regenerate configs
… converter output
|
[For maintainers] Suggested jobs to run (before merge) run-slow: qwen3_moe, qwen3_omni_moe, qwen3_vl_moe |
|
Finally CI is 🟢 ! |


Summary
base_model_ep_plantoQwen3MoeConfigenabling expert parallelism viaDistributedConfig(enable_expert_parallel=True)Depends on #45473
Test plan
Tested on 8×H100 with
torchrun --nproc_per_node=8using Qwen/Qwen3-30B-A3B (128 experts, 4 KV heads):# Example test command torchrun --nproc_per_node=8 scripts/test_qwen3_moe_tp_ep.py \ --model_name_or_path Qwen/Qwen3-30B-A3B --tp_size 2 --cp_size 2 --seq_len 128test file: test_qwen3_moe_tp_ep.py
Before submitting
guideline, Pull Request section?
Who can review?
@3outeille @ArthurZucker (distributed / TP / EP implementation)