[Intel GPU] Enable pipeline parallelism on XPU by siju-samuel · Pull Request #23472 · sgl-project/sglang

siju-samuel · 2026-04-22T11:32:08Z

Motivation

Pipeline parallelism (PP) only ran on CUDA. On Intel XPU, launching any --pp-size > 1 server crashed at startup with RuntimeError: Tried to instantiate dummy base class Event because SchedulerPPMixin hard-codes torch.cuda.{Event, current_stream, synchronize}. Even after fixing the hard-coded CUDA calls, PP >= 2 livelocked during the first multi-rank communication: with XCCL on XPU, torch.distributed.isend busy-polls waiting for a matching recv rendezvous, so when every PP rank sent before receiving, all ranks spun at 100% CPU inside torch.distributed and none ever reached its recv.

This PR makes PP work on XPU (and generalizes to any non-CUDA backend torch.get_device_module() supports) without changing CUDA behavior.

Modifications

Device-agnostic event/stream/sync calls in python/sglang/srt/managers/scheduler_pp_mixin.py:
- Replaced torch.cuda.Event() → get_device_module().Event()
- Replaced torch.cuda.current_stream() → get_device_module().current_stream()
- Replaced torch.cuda.synchronize() → get_device_module().synchronize()
- Replaced deque[Tuple[torch.cuda.Event, ...]] type hints with backend-agnostic forms
- Added get_device_module to the existing from sglang.srt.utils import ... block
Parity-based send/recv ordering in _pp_send_recv_and_preprocess_output_tensors:
- Even pp_rank ranks: send → recv
- Odd pp_rank ranks: recv → send
- Guarantees every adjacent PP pair has one sender and one receiver posted simultaneously, so isend always finds a matching recv already waiting and the rendezvous completes instead of busy-spinning.
- Helper closures _do_send() / _do_recv() keep the two branches symmetric and avoid duplicating the profiler/copy-stream/d2h-event logic.

No CUDA codepath changes behavior: on CUDA, get_device_module() returns torch.cuda, and parity ordering is a pure reordering of already-independent send and recv operations.

Accuracy Tests

Verified on 4× Intel XPU with meta-llama/Llama-3.1-8B-Instruct:

Config	Status
TP=1 PP=2	✅
TP=1 PP=3	✅
TP=1 PP=4	✅
TP=2 PP=2	✅
`TestPPAccuracy.test_logprob` (TP=2 PP=2)	✅ pass

CUDA behavior unchanged (no codepath difference for torch.cuda backend).

Speed Tests and Profiling

bench_one_batch_server full warmup + bench cycle at PP=4 on Intel XPU, Llama-3.1-8B, batch_size=8, input_len=1024, output_len=128:

======== Warmup Begin ========
Warmup with batch_size=[8]
#Input tokens: 8192
#Output tokens: 128
batch size: 8
input_len: 1024
output_len: 16
latency: 4.04 s
input throughput: 6569.72 tok/s
output throughput: 45.89 tok/s
======== Warmup End ========

#Input tokens: 8192
#Output tokens: 1024
batch size: 8
input_len: 1024
output_len: 128
latency: 25.09 s
input throughput: 13051.32 tok/s
output throughput: 41.85 tok/s
last_ttft: 0.63 s
last generation throughput: 40.53 tok/s

Before this fix the same command hung indefinitely at the first large warmup batch (all 4 ranks at 100% CPU in torch.distributed).

Checklist

Format your code according to Format code with pre-commit.
Add unit tests according to Run and add unit tests. (Existing test_pp_single_node tests cover this path; no new tests added since it's a backend-portability fix.)
Update documentation according to Write documentations. (No user-facing API change.)
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@mingfeima @Kangyan-Zhou, @iforgetmyname, @Fridge003, @merrymercy, @ispobock, @JustinTong0323, @BBuf, @Edwardf0t1, @HaiShaw, @Ying1123, @ch-wan @hnyls2002 and @kushanam please review and merge this

Replace hard-coded torch.cuda.{Event,current_stream,synchronize} calls in SchedulerPPMixin with device-agnostic torch.get_device_module() lookups so the PP scheduler loop runs on XPU (and any other non-CUDA backend) in addition to CUDA. Reorder send/recv in _pp_send_recv_and_preprocess_output_tensors by pp_rank parity (even ranks send-then-recv, odd ranks recv-then-send). The original always-send-first ordering livelocks on backends where point-to-point isend busy-polls for a matching recv rendezvous: with xccl on XPU every PP rank entered isend simultaneously and none posted recv, pinning all ranks at 100% CPU inside torch.distributed. Parity ordering guarantees each adjacent rank pair has one sender and one receiver posted at the same time. Verified on 4x Intel XPU with Llama-3.1-8B: - TP=1 PP=2, PP=3, PP=4 - TP=2 PP=2 - Full warmup + bench_one_batch_server cycle at PP=4

gemini-code-assist

Code Review

This pull request refactors the pipeline parallelism scheduler to be device-agnostic by replacing CUDA-specific calls with a generic device module and updating device assignments. It also introduces a rank-parity-based ordering for send and recv operations to prevent livelocks on specific hardware backends. The review feedback highlights several issues with type hints, specifically noting that torch.Event requires PyTorch 2.4+, identifying mismatches between return type hints and actual returned values, and recommending the use of optional type markers for values that can be null.

Address review feedback on type hints in SchedulerPPMixin:

ShangmingCai

Others look good.

Gate the parity-based send/recv ordering in _pp_send_recv_and_preprocess_output_tensors so it applies only on XPU. CUDA/NCCL keeps the original send-first behavior since isend there is eager stream-enqueue and reordering two non-blocking ops has no effect. On XPU, isend is effectively blocking and does not return until the peer posts a matching recv; if every PP rank sends first, all ranks block waiting for a receiver and the ring deadlocks. Parity ordering (even: send->recv, odd: recv->send) guarantees each adjacent pair has one sender and one receiver posted simultaneously, and generalizes across all PP sizes (PP=2, 3, 4, ...).

siju-samuel · 2026-04-23T04:20:14Z

@mingfeima Could you please review and apply the label run-ci.

mingfeima · 2026-04-23T05:44:08Z

/tag-and-rerun-ci

ShangmingCai

LGTM

ShangmingCai · 2026-04-24T09:38:29Z

@siju-samuel, could you resubmit this PR and let us run the related CI? The device stream sync modification might break the CUDA setup. Something like this: https://github.com/sgl-project/sglang/actions/runs/24879511440/job/72844149700

siju-samuel requested a review from ShangmingCai as a code owner April 22, 2026 11:32

Merge branch 'main' into xpu/pp-support

4a9990a

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/scheduler_pp_mixin.py Outdated

Comment thread python/sglang/srt/managers/scheduler_pp_mixin.py

Comment thread python/sglang/srt/managers/scheduler_pp_mixin.py Outdated

[Intel GPU] Fix type hints on PP send/recv helpers

449b8a3

Address review feedback on type hints in SchedulerPPMixin:

ShangmingCai reviewed Apr 22, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/scheduler_pp_mixin.py Outdated

ShangmingCai reviewed Apr 22, 2026

View reviewed changes

ShangmingCai mentioned this pull request Apr 22, 2026

Fix for pp-size 2 #23477

Open

5 tasks

mingfeima added intel xpu intel gpu with device `torch.xpu` labels Apr 23, 2026

github-actions Bot added the run-ci label Apr 23, 2026

ShangmingCai approved these changes Apr 23, 2026

View reviewed changes

Merge branch 'main' into xpu/pp-support

199b985

mingfeima enabled auto-merge (squash) April 24, 2026 02:41

mingfeima disabled auto-merge April 24, 2026 02:41

mingfeima merged commit bf98eb3 into sgl-project:main Apr 24, 2026
96 of 119 checks passed

ShangmingCai mentioned this pull request Apr 24, 2026

Revert "[Intel GPU] Enable pipeline parallelism on XPU" #23641

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Intel GPU] Enable pipeline parallelism on XPU#23472

[Intel GPU] Enable pipeline parallelism on XPU#23472
mingfeima merged 5 commits intosgl-project:mainfrom
siju-samuel:xpu/pp-support

siju-samuel commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShangmingCai left a comment

Uh oh!

siju-samuel commented Apr 23, 2026

Uh oh!

mingfeima commented Apr 23, 2026

Uh oh!

ShangmingCai left a comment

Uh oh!

Uh oh!

ShangmingCai commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

siju-samuel commented Apr 22, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

siju-samuel commented Apr 23, 2026

Uh oh!

mingfeima commented Apr 23, 2026

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShangmingCai commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants