[Qwen3.5] Support Qwen3.5 Pipeline Parallelism by yuan-luo · Pull Request #19670 · sgl-project/sglang

yuan-luo · 2026-03-02T08:52:25Z

Motivation

To close #19500

Currently Qwen3.5 PP will crash with error. With this PR it works.

Server:

➜  /sgl-workspace python -m sglang.launch_server --model Qwen/Qwen3.5-35B-A3B --pp-size 2

gsm8k no drop.

➜  bench_script lm_eval --model local-completions --tasks gsm8k   --model_args base_url=http://localhost:30000/v1/completions,model=Qwen/Qwen3.5-35B-A3B,num_concurrent=109;
2026-03-02:08:38:36 INFO     [_cli.run:376] Selected Tasks: ['gsm8k']
2026-03-02:08:38:36 INFO     [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-02:08:38:36 INFO     [evaluator:236] Initializing local-completions model, with arguments: {'base_url': 'http://localhost:30000/v1/completions', 'model': 'Qwen/Qwen3.5-35B-A3B', 'num_concurrent': 109}
2026-03-02:08:38:36 INFO     [models.openai_completions:42] Remote tokenizer not supported. Using huggingface tokenizer backend.
2026-03-02:08:38:36 INFO     [models.api_models:172] Using max length 2048 - 1
2026-03-02:08:38:36 INFO     [models.api_models:193] Using tokenizer huggingface
2026-03-02:08:38:40 INFO     [tasks:700] Selected tasks:
2026-03-02:08:38:40 INFO     [tasks:691] Task: gsm8k (gsm8k/gsm8k.yaml)
2026-03-02:08:38:40 INFO     [evaluator:314] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}
2026-03-02:08:38:40 INFO     [api.task:311] Building contexts for gsm8k on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:04<00:00, 294.79it/s]
2026-03-02:08:38:44 INFO     [evaluator:584] Running generate_until requests
Requesting API: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [02:02<00:00, 10.73it/s]
fatal: not a git repository (or any of the parent directories): .git
2026-03-02:08:40:56 INFO     [loggers.evaluation_tracker:316] Output path not provided, skipping saving results aggregated
local-completions ({'base_url': 'http://localhost:30000/v1/completions', 'model': 'Qwen/Qwen3.5-35B-A3B', 'num_concurrent': 109}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8529|±  |0.0098|
|     |       |strict-match    |     5|exact_match|↑  |0.8355|±  |0.0102|

Modifications

There are several modifications:

embed_tokens missing PPMissingLayer placeholder which makes launch server crash
make_layers return a module list covering all the layers, we need to sort out the layers belonging to the current PP layer. For those not belonging to the current PP layer, it needs to use PPMissingLayer as placeholder.

Accuracy Tests

gsm8k no drop.

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-02T08:52:42Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses and resolves issues preventing Qwen3.5 models from utilizing Pipeline Parallelism. It introduces mechanisms to correctly manage model layers and components across different pipeline stages, ensuring that each parallel rank only initializes and processes the parts of the model relevant to its assigned segment, thereby enabling stable and functional distributed execution.

Highlights

Pipeline Parallelism Support: Enabled Pipeline Parallelism for Qwen3.5 models, resolving a previous crash during execution.
PPMissingLayer Integration: Introduced PPMissingLayer to correctly handle model components (like embed_tokens and norm) that are not present on a specific pipeline parallel rank, preventing errors.
Layer Distribution Logic: Implemented logic to distribute decoder layers across pipeline parallel ranks, ensuring each rank processes only its assigned subset of layers.
Conditional Embedding/Head Handling: Modified embedding and language model head weight management to be conditional on the pipeline parallel rank, preventing errors when these components are not on the current rank.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/models/qwen3_5.py
- Imported PPMissingLayer for handling missing pipeline parallel components.
- Modified the initialization of embed_tokens to use PPMissingLayer when not on the first pipeline parallel rank.
- Added logic to calculate the start_layer and end_layer for the current pipeline parallel rank.
- Updated the initialization of the final norm layer to use PPMissingLayer when not on the last pipeline parallel rank.
- Adjusted the forward method's loop to iterate only over the layers assigned to the current pipeline parallel rank.
- Updated get_embed_and_head and set_embed_and_head methods to conditionally access embed_tokens and lm_head weights based on the pipeline parallel rank.

Activity

No human activity (comments, reviews, or progress updates) has been recorded for this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

yuan-luo · 2026-03-02T08:54:16Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request adds support for pipeline parallelism to Qwen3.5 models. The changes correctly handle pipeline stages by using PPMissingLayer for embeddings and the final normalization layer on ranks where they are not needed. The logic for accessing weights in get_embed_and_head and set_embed_and_head is also correctly updated to be pipeline-aware.

My review includes two main points. First, a high-severity issue regarding memory efficiency: all decoder layers are currently instantiated on all pipeline ranks, which can lead to unnecessary memory consumption. I've provided a suggestion to fix this by using PPMissingLayer for inactive layers. Second, a medium-severity suggestion to refactor duplicated code in Qwen3_5ForConditionalGeneration and Qwen3_5MoeForConditionalGeneration into a common base class to improve maintainability.

ShangmingCai

You can contact the author of #19254, I see some similar effort, so maybe we can converge the plan a little bit.

yuan-luo · 2026-03-03T03:04:14Z

You can contact the author of #19254, I see some similar effort, so maybe we can converge the plan a little bit.

@ShangmingCai I couldn't contact @zhangxiaolei123456 directly, I left message in #19254.
Since this PR is to address #19500 independently and the scope is relatively smaller, shall we proceed to review this PR now?

yuan-luo · 2026-03-03T08:57:43Z

/rerun-failed-ci

yuan-luo · 2026-03-04T02:09:37Z

/rerun-failed-ci

BBuf

Looks good.

ShangmingCai · 2026-03-07T03:40:50Z

        self.layers = make_layers(
            config.num_hidden_layers,
            get_layer,
            prefix=f"{prefix}.layers",
        )


Should we change this block as well? I think maybe pass the pp size and pp rank into make_layers, we can get start_layer and end_layer without the need to call get_pp_indices separately.

Refactored.

Per double check we can't use make_layers to generate start_layer, end_layer, it will make the result incorrect. The reason is we need to loop all layers, instead of starting from start_layer and set the missing layer accordingly inside make_layers. Change back.

ShangmingCai

Others LGTM, as long as the new test passes CI.

yuan-luo · 2026-03-07T14:39:34Z

/rerun-failed-ci

yuan-luo · 2026-03-07T14:54:59Z

/rerun-failed-ci

yuan-luo · 2026-03-07T15:16:37Z

/rerun-failed-ci

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

Qwen35 PP support and its consistency check are introduced in #19670, but the test turned out to be flaky on H100 and AMD, which blocks the CI. The performance regression can not be reproduced on H20, so we need some time to dig before bringing this test back.

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

yuan-luo requested review from BBuf, Fridge003, ShangmingCai, hnyls2002, ispobock and yizhang2077 March 2, 2026 08:53

github-actions Bot added the run-ci label Mar 2, 2026

gemini-code-assist Bot reviewed Mar 2, 2026

View reviewed changes

Comment thread python/sglang/srt/models/qwen3_5.py

yuan-luo added the performance label Mar 2, 2026

yuan-luo requested review from XucSh and whybeyoung March 2, 2026 09:07

yuan-luo force-pushed the support_qwen35_pp branch from 6e3fe83 to 231f212 Compare March 2, 2026 12:10

ShangmingCai reviewed Mar 2, 2026

View reviewed changes

Comment thread python/sglang/srt/models/qwen3_5.py Outdated

ShangmingCai reviewed Mar 2, 2026

View reviewed changes

yuan-luo force-pushed the support_qwen35_pp branch from 231f212 to 1942cc3 Compare March 3, 2026 02:16

yuan-luo requested review from JustinTong0323, mickqian and yhyang201 March 4, 2026 02:10

BBuf reviewed Mar 7, 2026

View reviewed changes

Comment thread test/registered/distributed/test_pp_single_node.py

BBuf approved these changes Mar 7, 2026

View reviewed changes

ShangmingCai reviewed Mar 7, 2026

View reviewed changes

Comment thread python/sglang/srt/models/qwen3_5.py Outdated

ShangmingCai reviewed Mar 7, 2026

View reviewed changes

yuan-luo force-pushed the support_qwen35_pp branch 2 times, most recently from 3f17199 to c1e4a2d Compare March 7, 2026 06:30

Support Qwen3.5 Pipeline Parallelism

5a35581

yuan-luo force-pushed the support_qwen35_pp branch from c1e4a2d to 5a35581 Compare March 7, 2026 08:54

BBuf merged commit 7da590d into sgl-project:main Mar 7, 2026
228 of 254 checks passed

yuan-luo deleted the support_qwen35_pp branch March 8, 2026 01:23

hlu1 mentioned this pull request Mar 9, 2026

[Tracking] Qwen3.5 bugs #20069

Open

17 tasks

b8zhong mentioned this pull request Mar 12, 2026

Fix Qwen3.5 pipeline parallelism crash #19833

Closed

3 tasks

yuan-luo mentioned this pull request Mar 18, 2026

[Tracking] Qwen3.5/Qwen3-Next Optimizations #18590

Open

38 tasks

alisonshao mentioned this pull request Mar 21, 2026

[Qwen3.5] Fix broken pipeline parallelism layer splitting #21070

Merged

2 tasks

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[Qwen3.5] Support Qwen3.5 Pipeline Parallelism (sgl-project#19670)

e01f26b

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

popsiclexu pushed a commit to popsiclexu/sglang that referenced this pull request Mar 25, 2026

[Qwen3.5] Support Qwen3.5 Pipeline Parallelism (sgl-project#19670)

701d7a3

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

ShangmingCai mentioned this pull request Mar 25, 2026

Update skip condition for TestQwen35PPAccuracy #21370

Merged

5 tasks

sufeng-buaa mentioned this pull request Mar 26, 2026

[Fix] Fix Qwen3.5 MoE model loading and Mamba cache sharding in PP mode #21448

Merged

5 tasks

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[Qwen3.5] Support Qwen3.5 Pipeline Parallelism (sgl-project#19670)

0f906d1

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Qwen3.5] Support Qwen3.5 Pipeline Parallelism (sgl-project#19670)

56de7a5

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

Conversation

yuan-luo commented Mar 2, 2026

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

yuan-luo commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

yuan-luo commented Mar 3, 2026

Uh oh!

yuan-luo commented Mar 3, 2026

Uh oh!

yuan-luo commented Mar 4, 2026

Uh oh!

Uh oh!

BBuf left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

yuan-luo Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

yuan-luo Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

yuan-luo commented Mar 7, 2026

Uh oh!

yuan-luo commented Mar 7, 2026

Uh oh!

yuan-luo commented Mar 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants