Qwen3_5MoeForConditionalGeneration missing _tp_plan for tensor parallelism

### System Info

- transformers `main` branch (post-Qwen3.5 MoE addition)
- Any platform with multi-GPU setup

### Who can help?

@3outeille @ArthurZucker

### Information

- [x] My own modified scripts

### Tasks

- [x] My own task or dataset (give details below)

### Reproduction

`Qwen3_5MoeForConditionalGeneration` (the VL wrapper) is missing `_tp_plan`, while the text-only `Qwen3_5MoeForCausalLM` already has `_tp_plan = {"lm_head": "colwise_gather_output"}`.

```python
from transformers import Qwen3_5MoeForConditionalGeneration

# lm_head is NOT sharded — replicated on every GPU
model = Qwen3_5MoeForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3.5-35B-A3B", tp_plan="auto", torch_dtype=torch.bfloat16
)
```

The `lm_head` Linear layer is not included in any TP plan for this class, so under `tp_plan="auto"` it remains replicated instead of being sharded with `colwise_gather_output`. This wastes memory and may produce incorrect logits since the all-gather is not applied.

### Expected behavior

`lm_head` should be sharded with `colwise_gather_output` when using `tp_plan="auto"`, consistent with `Qwen3_5MoeForCausalLM`.

Fix: huggingface/transformers#45124

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3_5MoeForConditionalGeneration missing _tp_plan for tensor parallelism #45125

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Qwen3_5MoeForConditionalGeneration missing _tp_plan for tensor parallelism #45125

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions