[Qwen3.5] Fix broken pipeline parallelism layer splitting by alisonshao · Pull Request #21070 · sgl-project/sglang

alisonshao · 2026-03-21T04:37:55Z

Summary

Root cause: make_layers() in Qwen3_5ForCausalLM ([Qwen3.5] Support Qwen3.5 Pipeline Parallelism #19670) was called without pp_rank/pp_size, so all PP stages instantiated every layer and loaded the full model (~66GB per GPU instead of ~33GB). Pipeline parallelism gave zero memory savings. This was masked on H200 (141GB) but OOMs on H100 (80GB).
Model fix (qwen3_5.py):
- Pass pp_rank/pp_size to make_layers(), matching the working pattern in qwen2_moe.py
- Add guard in load_fused_expert_weights to skip params for PP-missing layers (otherwise KeyError on layer indices outside the rank's range)
Test fix: Use tp=2 for baseline since the full model (~66GB BF16) doesn't fit on a single H100 (80GB)

Server logs show both PP stages loading identical weights (should be ~33GB each):

PP0: Load weight end. avail mem=13.12 GB, mem usage=64.54 GB
PP1: Load weight end. avail mem=13.12 GB, mem usage=64.54 GB
RuntimeError: Not enough memory.

After make_layers fix, weight loading hits KeyError on missing layers:

KeyError: 'model.layers.13.mlp.experts.w2_weight'

[Qwen3.5] Support Qwen3.5 Pipeline Parallelism #19670 added Qwen3.5 PP support but called make_layers() without PP params (copy-paste from non-PP model)
[CI] Improve PP consistency check success rate #20838 only adjusted test thresholds — did not cause this
Test passed before because CI assigned H200 runners (141GB); fails when assigned H100 (80GB)

Failure examples:

Test plan

CI passes on H100 runner (the previously failing config)
PP accuracy consistency check passes (baseline vs pp=2 within 2%)

Add --tp-size 2 to TestQwen35PPAccuracy so the 35B model (~70GB BF16) is split across 2 GPUs instead of loading on a single GPU. On H100 (80GB), a single GPU doesn't have enough headroom for KV cache and CUDA graphs after loading weights. Failure example: https://github.com/sgl-project/sglang/actions/runs/23367673750/job/67984876757

gemini-code-assist · 2026-03-21T04:37:59Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

alisonshao · 2026-03-21T04:40:02Z

/rerun-ut test_pp_single_node.py

github-actions · 2026-03-21T04:40:23Z

✅ Triggered /rerun-ut on 4-gpu-h100 runner:

cd test/ && python3 registered/distributed/test_pp_single_node.py

github-actions · 2026-03-21T04:40:29Z

🔗 View workflow run

alisonshao · 2026-03-21T04:52:27Z

/rerun-failed-ci

The tp=2 pp=2 config spawns 4 processes that exceed system RAM on H100 RadixArk runners during model loading. Use tp=2/pp=1 for baseline and tp=1/pp=2 for PP test — keeps each config at 2 GPUs max while still validating pipeline parallelism consistency.

alisonshao · 2026-03-21T05:04:42Z

/rerun-ut test_pp_single_node.py

github-actions · 2026-03-21T05:05:06Z

✅ Triggered /rerun-ut on 4-gpu-h100 runner:

cd test/ && python3 registered/distributed/test_pp_single_node.py

github-actions · 2026-03-21T05:05:12Z

🔗 View workflow run

The make_layers() call in Qwen3_5ForCausalLM was missing pp_rank and pp_size parameters, so all PP stages instantiated and loaded weights for every layer. Pipeline parallelism gave zero memory savings — each stage held the full ~66GB model instead of its assigned half. Fix: pass pp_rank/pp_size to make_layers() to match the working pattern in qwen2_moe.py. Also keep the test using tp=2 for baseline since the full model doesn't fit on a single H100 GPU.

alisonshao · 2026-03-21T05:29:40Z

/rerun-ut test_pp_single_node.py

github-actions · 2026-03-21T05:30:00Z

✅ Triggered /rerun-ut on 4-gpu-h100 runner:

cd test/ && python3 registered/distributed/test_pp_single_node.py

github-actions · 2026-03-21T05:30:05Z

🔗 View workflow run

alisonshao · 2026-03-21T05:33:49Z

/rerun-stage stage-c-test-4-gpu-b200

With PP enabled, layers outside a rank's range are PPMissingLayer placeholders with no parameters. load_fused_expert_weights must skip these instead of crashing with KeyError on missing param names.

alisonshao · 2026-03-21T05:47:07Z

/rerun-ut test_pp_single_node.py

github-actions · 2026-03-21T05:47:28Z

✅ Triggered /rerun-ut on 4-gpu-h100 runner:

cd test/ && python3 registered/distributed/test_pp_single_node.py

github-actions · 2026-03-21T05:47:34Z

🔗 View workflow run

Baseline (tp=2/pp=1) vs PP (tp=1/pp=2) had a 3.7% accuracy gap due to different TP sizes causing floating-point reduction order differences. Use tp=2 for both configs so the only variable is PP.

alisonshao · 2026-03-21T06:07:33Z

/rerun-ut test_pp_single_node.py

github-actions · 2026-03-21T06:07:57Z

✅ Triggered /rerun-ut on 4-gpu-h100 runner:

cd test/ && python3 registered/distributed/test_pp_single_node.py

github-actions · 2026-03-21T06:08:02Z

🔗 View workflow run

tp=2/pp=2 crashes in Triton linear attention kernel (cpu tensor in PP context). Revert to tp=2/pp=1 baseline vs tp=1/pp=2 PP test. Widen accuracy threshold from 2% to 5% to account for TP-induced floating-point differences between the two configs.

github-actions · 2026-03-21T06:33:48Z

🔗 View workflow run

tp=2/pp=2 crashes in Qwen3.5's linear attention Triton kernel during CUDA graph capture (pre-existing bug in combined TP+PP). Fall back to tp=2/pp=1 vs tp=1/pp=2. The ~4% accuracy gap is from TP difference (other models show <0.5% PP-only variance), so 5% threshold is safe.

alisonshao · 2026-03-21T06:45:13Z

/rerun-ut test_pp_single_node.py

github-actions · 2026-03-21T06:45:35Z

✅ Triggered /rerun-ut on 4-gpu-h100 runner:

cd test/ && python3 registered/distributed/test_pp_single_node.py

github-actions · 2026-03-21T06:45:41Z

🔗 View workflow run

alisonshao · 2026-03-21T06:53:50Z

5% threshold:

We're forced into mismatched TP sizes — tp=1 baseline OOMs on H100 (66GB model, 80GB GPU), and tp=2/pp=2 crashes in Qwen3.5's linear attention Triton kernel during CUDA graph capture (pre-existing bug). So the test uses tp=2/pp=1 vs tp=1/pp=2.

The ~4% accuracy gap is from the tp=2 vs tp=1 float reduction difference, not PP regression. Evidence: Qwen3-30B in the same file uses tp=1 for both and shows only 0.4% PP gap (92.6% → 92.2%).

alisonshao · 2026-03-21T07:03:26Z

Note on original author's concern: In #19670, @yuan-luo tried passing pp_rank/pp_size to make_layers but reverted it saying "it will make the result incorrect." The likely issue was the KeyError in load_fused_expert_weights when it tried to load expert weights for PP-missing layers — this PR fixes that with a guard (if name not in params_dict: return False).

CI confirms the fix works: weights split correctly (33GB/GPU instead of 64GB), and accuracy is reasonable (82.4% on tp=1/pp=2). can you review @yuan-luo

ShangmingCai

LGTM

…rallelism When running Qwen3.5-122B with pp>1, the non-fused expert weight loading path in load_weights accesses params_dict[name_mapped] without checking if the key exists. With pipeline parallelism, layers assigned to other ranks won't have their parameters in the local params_dict, causing a KeyError (e.g., 'model.layers.4.mlp.experts.w13_weight'). The fused expert path (load_fused_expert_weights) was already fixed in sgl-project#21070 but the else branch for non-fused experts was missed. This adds the same guard to both Qwen3_5MoeForCausalLM and Qwen3_5MoeForConditionalGeneration. Fixes sgl-project#21184

…t#21070) Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net>

alisonshao added high priority run-ci labels Mar 21, 2026

Merge branch 'main' into fix/qwen35-pp-test-oom-h100

64b20d2

alisonshao added the bypass-maintenance label Mar 21, 2026

alisonshao changed the title ~~[CI] Fix Qwen3.5-35B PP test OOM on H100 runners~~ [Qwen3.5] Fix broken pipeline parallelism layer splitting Mar 21, 2026

Merge branch 'main' into fix/qwen35-pp-test-oom-h100

3bf52d6

Fix load_fused_expert_weights KeyError for PP missing layers

7f1e997

With PP enabled, layers outside a rank's range are PPMissingLayer placeholders with no parameters. load_fused_expert_weights must skip these instead of crashing with KeyError on missing param names.

Use tp=2 for both baseline and PP to avoid TP-induced accuracy diff

adb1bb7

Baseline (tp=2/pp=1) vs PP (tp=1/pp=2) had a 3.7% accuracy gap due to different TP sizes causing floating-point reduction order differences. Use tp=2 for both configs so the only variable is PP.

Alison Shao added 2 commits March 20, 2026 23:51

Shorten comment explaining 5% threshold

85cd583

Remove inline comment, add PR comment instead

ea85d07

ShangmingCai approved these changes Mar 21, 2026

View reviewed changes

alisonshao mentioned this pull request Mar 21, 2026

[Qwen3.5] tp=2/pp=2 crashes in linear attention Triton kernel during CUDA graph capture #21086

Open

Merge branch 'main' into fix/qwen35-pp-test-oom-h100

a4fa6ae

yuan-luo approved these changes Mar 21, 2026

View reviewed changes

merrymercy approved these changes Mar 21, 2026

View reviewed changes

Merge branch 'main' into fix/qwen35-pp-test-oom-h100

6393e54

hnyls2002 merged commit 852e112 into main Mar 21, 2026
36 of 76 checks passed

hnyls2002 deleted the fix/qwen35-pp-test-oom-h100 branch March 21, 2026 08:02

yuan-luo mentioned this pull request Mar 21, 2026

[Bug] [Qwen3.5 Dense] pp=2 Qwen3.4-4B result is incorrect #21093

Closed

5 tasks

he-yufeng mentioned this pull request Mar 23, 2026

[Bug] fix pp for qwen3_5 (KeyError when reading params) #21184

Open

5 tasks

he-yufeng mentioned this pull request Mar 23, 2026

Fix Qwen3.5 MoE KeyError with pipeline parallelism #21217

Open

edwingao28 mentioned this pull request Mar 24, 2026

[Bugfix] Fix PP tied embeddings weight loading for qwen3.5 4B dense model #21347

Merged

5 tasks

This was referenced Mar 25, 2026

Update skip condition for TestQwen35PPAccuracy #21370

Merged

Fix Qwen3.5 pipeline-parallel layer construction OOM #20937

Closed

fix(qwen3_5): fix PP support - each rank was creating all layers #20202

Closed

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

[Qwen3.5] Fix broken pipeline parallelism layer splitting (sgl-projec…

7571aff

…t#21070) Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net>

sufeng-buaa mentioned this pull request Mar 26, 2026

[Fix] Fix Qwen3.5 MoE model loading and Mamba cache sharding in PP mode #21448

Merged

5 tasks

dutsc pushed a commit to dutsc/sglang that referenced this pull request Mar 30, 2026

[Qwen3.5] Fix broken pipeline parallelism layer splitting (sgl-projec…

5288456

…t#21070) Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net>

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[Qwen3.5] Fix broken pipeline parallelism layer splitting (sgl-projec…

7d986c9

…t#21070) Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Qwen3.5] Fix broken pipeline parallelism layer splitting (sgl-projec…

be7eeed

…t#21070) Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net>

Conversation

alisonshao commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 21, 2026

Uh oh!

alisonshao commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

alisonshao commented Mar 21, 2026

Uh oh!

alisonshao commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

alisonshao commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

alisonshao commented Mar 21, 2026

Uh oh!

alisonshao commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

alisonshao commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

alisonshao commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

github-actions Bot commented Mar 21, 2026

Uh oh!

alisonshao commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alisonshao commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alisonshao commented Mar 21, 2026 •

edited

Loading

alisonshao commented Mar 21, 2026 •

edited

Loading

alisonshao commented Mar 21, 2026 •

edited

Loading