System Info
- transformers
main branch (post-Qwen3.5 MoE addition)
- Any platform with multi-GPU setup
Who can help?
@3outeille @ArthurZucker
Information
Tasks
Reproduction
Qwen3_5MoeForConditionalGeneration (the VL wrapper) is missing _tp_plan, while the text-only Qwen3_5MoeForCausalLM already has _tp_plan = {"lm_head": "colwise_gather_output"}.
from transformers import Qwen3_5MoeForConditionalGeneration
# lm_head is NOT sharded — replicated on every GPU
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
"Qwen/Qwen3.5-35B-A3B", tp_plan="auto", torch_dtype=torch.bfloat16
)
The lm_head Linear layer is not included in any TP plan for this class, so under tp_plan="auto" it remains replicated instead of being sharded with colwise_gather_output. This wastes memory and may produce incorrect logits since the all-gather is not applied.
Expected behavior
lm_head should be sharded with colwise_gather_output when using tp_plan="auto", consistent with Qwen3_5MoeForCausalLM.
Fix: #45124
System Info
mainbranch (post-Qwen3.5 MoE addition)Who can help?
@3outeille @ArthurZucker
Information
Tasks
Reproduction
Qwen3_5MoeForConditionalGeneration(the VL wrapper) is missing_tp_plan, while the text-onlyQwen3_5MoeForCausalLMalready has_tp_plan = {"lm_head": "colwise_gather_output"}.The
lm_headLinear layer is not included in any TP plan for this class, so undertp_plan="auto"it remains replicated instead of being sharded withcolwise_gather_output. This wastes memory and may produce incorrect logits since the all-gather is not applied.Expected behavior
lm_headshould be sharded withcolwise_gather_outputwhen usingtp_plan="auto", consistent withQwen3_5MoeForCausalLM.Fix: #45124