Skip to content

Qwen3_5MoeForConditionalGeneration missing _tp_plan for tensor parallelism #45125

@danielquintas8

Description

@danielquintas8

System Info

  • transformers main branch (post-Qwen3.5 MoE addition)
  • Any platform with multi-GPU setup

Who can help?

@3outeille @ArthurZucker

Information

  • My own modified scripts

Tasks

  • My own task or dataset (give details below)

Reproduction

Qwen3_5MoeForConditionalGeneration (the VL wrapper) is missing _tp_plan, while the text-only Qwen3_5MoeForCausalLM already has _tp_plan = {"lm_head": "colwise_gather_output"}.

from transformers import Qwen3_5MoeForConditionalGeneration

# lm_head is NOT sharded — replicated on every GPU
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3.5-35B-A3B", tp_plan="auto", torch_dtype=torch.bfloat16
)

The lm_head Linear layer is not included in any TP plan for this class, so under tp_plan="auto" it remains replicated instead of being sharded with colwise_gather_output. This wastes memory and may produce incorrect logits since the all-gather is not applied.

Expected behavior

lm_head should be sharded with colwise_gather_output when using tp_plan="auto", consistent with Qwen3_5MoeForCausalLM.

Fix: #45124

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions