Add pipeline parallelism for DeepSeekV2 by zhjc1124 · Pull Request #6434 · sgl-project/sglang

zhjc1124 · 2025-05-19T20:07:33Z

Motivation

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

zhjc1124 · 2025-05-19T20:09:51Z

run test_pp_consistency

python3 -m unittest test_pp_single_node.TestDeepSeekPPAccuracy.test_pp_consistency

[DS PP Comparison] Baseline: {'accuracy': np.float64(0.83), 'latency': 22.90764766279608, 'output_throughput': 1124.8645159602909} | PP: {'accuracy': np.float64(0.83), 'latency': 20.94851907994598, 'output_throughput': 1228.153641878648}

zhjc1124 · 2025-05-19T20:32:34Z

Launch DeepSeek-R1 with three node(tp_size=8, pp_size=3)

# node 1
python3 -m sglang.launch_server --model-path /data/modelscope/DeepSeek-R1/ --dist-init-addr 10.0.0.1:5000 --nnodes 3 --trust-remote-code --tp 8 --pp 3 --node-rank 0 --attention-backend=flashinfer
# node 2
python3 -m sglang.launch_server --model-path /data/modelscope/DeepSeek-R1/ --dist-init-addr 10.0.0.1:5000 --nnodes 3 --trust-remote-code  --tp 8 --pp 3 --node-rank 1 --attention-backend=flashinfer
# node 2
python3 -m sglang.launch_server --model-path /data/modelscope/DeepSeek-R1/ --dist-init-addr 10.0.0.1:5000 --nnodes 3 --trust-remote-code  --tp 8 --pp 3 --node-rank 2 --attention-backend=flashinfer

test bench_serving

# python3 -m sglang.bench_serving --dataset-path ~/ShareGPT_V3_unfiltered_cleaned_split.json --backend sglang --model /data/modelscope/DeepSeek-R1/ --dataset-name sharegpt --num-prompts 20 --max-concurrency 1
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 1
Successful requests:                     20
Benchmark duration (s):                  180.25
Total input tokens:                      8203
Total generated tokens:                  4559
Total generated tokens (retokenized):    4549
Request throughput (req/s):              0.11
Input token throughput (tok/s):          45.51
Output token throughput (tok/s):         25.29
Total token throughput (tok/s):          70.80
Concurrency:                             1.00
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   9011.89
Median E2E Latency (ms):                 8842.04
---------------Time to First Token----------------
Mean TTFT (ms):                          391.64
Median TTFT (ms):                        229.77
P99 TTFT (ms):                           1958.60
---------------Inter-Token Latency----------------
Mean ITL (ms):                           38.03
Median ITL (ms):                         37.42
P95 ITL (ms):                            38.84
P99 ITL (ms):                            40.86
Max ITL (ms):                            234.84
==================================================

HaiShaw

@zhjc1124 does this support PP within 1 node? Any usage example?

zhjc1124 · 2025-05-21T01:54:23Z

@zhjc1124 does this support PP within 1 node? Any usage example?

Yes.

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --attention-backen
d flashinfer --trust-remote-code --pp-size 2

billishyahao · 2025-05-22T16:08:54Z

Tried this patch but hit the issue:

The command is :

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote-code --pp-size 2

Meanwhile with the following command:

python3 -m sglang.launch_server --model Qwen/Qwen3-30B-A3B --pp 2

The server is up.

zhjc1124 · 2025-05-23T02:14:03Z

Tried this patch but hit the issue:

The command is :
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote-code --pp-size 2
Meanwhile with the following command:
python3 -m sglang.launch_server --model Qwen/Qwen3-30B-A3B --pp 2
The server is up.

Sorry for that. I lose to import Union in deepseek_v2.py when fixing conflicts. And I found there are other bugs after merging main.
Currently I fix them and launch successfully

MichoChan · 2025-05-23T08:43:04Z

did you test with tp=2,pp=8 on 8 nodes?
i encountered an error when capture cuda graph:
fused_moe_kernel[grid]( File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in <lambda> return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 691, in run kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata, File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 365, in __call__ self.launch(*args, **kwargs) RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

zhjc1124 · 2025-05-23T17:43:36Z

did you test with tp=2,pp=8 on 8 nodes? i encountered an error when capture cuda graph: fused_moe_kernel[grid]( File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in <lambda> return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 691, in run kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata, File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 365, in __call__ self.launch(*args, **kwargs) RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

I only have 3 nodes.
I find I fail to launch DeepSeek-R1 with tp=2,pp=12 on 3 nodes.
But it seems OOM problem because I test it successfully with --cuda-graph-max-bs 1.
In CudaGraphRunner, cuda-graph-max-bs reduces the size of pp_proxy_tensors.

            # pipeline parallelism
            if self.pp_size > 1:
                self.pp_proxy_tensors = {
                    "hidden_states": torch.zeros(
                        (self.max_bs, self.model_runner.model_config.hidden_size),
                        dtype=torch.bfloat16,
                    ),
                    "residual": torch.zeros(
                        (self.max_bs, self.model_runner.model_config.hidden_size),
                        dtype=torch.bfloat16,
                    ),
                }

I also succeed to launch DeepSeek-Coder-V2-Lite-Instruct with tp=2,pp=12 on 3 nodes.
So Could you please test DeepSeek-Coder-V2-Lite-Instruct with tp=2,pp=8 on 8 nodes?
Or test more cases if DeepSeek-R1 or other big DeepSeek models works like tp=4, pp=4 or tp=8, pp=2?

MichoChan · 2025-05-25T03:17:13Z

did you test with tp=2,pp=8 on 8 nodes? i encountered an error when capture cuda graph: fused_moe_kernel[grid]( File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in <lambda> return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 691, in run kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata, File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 365, in __call__ self.launch(*args, **kwargs) RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

I only have 3 nodes. I find I fail to launch DeepSeek-R1 with tp=2,pp=12 on 3 nodes. But it seems OOM problem because I test it successfully with --cuda-graph-max-bs 1. In CudaGraphRunner, cuda-graph-max-bs reduces the size of pp_proxy_tensors.
            # pipeline parallelism
            if self.pp_size > 1:
                self.pp_proxy_tensors = {
                    "hidden_states": torch.zeros(
                        (self.max_bs, self.model_runner.model_config.hidden_size),
                        dtype=torch.bfloat16,
                    ),
                    "residual": torch.zeros(
                        (self.max_bs, self.model_runner.model_config.hidden_size),
                        dtype=torch.bfloat16,
                    ),
                }
I also succeed to launch DeepSeek-Coder-V2-Lite-Instruct with tp=2,pp=12 on 3 nodes. So Could you please test DeepSeek-Coder-V2-Lite-Instruct with tp=2,pp=8 on 8 nodes? Or test more cases if DeepSeek-R1 or other big DeepSeek models works like tp=4, pp=4 or tp=8, pp=2?

DeepSeek-Coder-V2-Lite-Chat with tp=2，pp=8 on 8 nodes is ok， but DeepSeekV3 would error

xiaobochen-amd · 2025-05-26T09:32:49Z

I encountered an error while testing DeepSeek-V3 on MI300X with PP=8. The issue can be reproduced as follows:

python3 -m sglang.bench_offline_throughput
--model-path /PATH/TO/DeepSeek-V3-0324
--disable-radix-cache
--trust-remote-code
--pp-size 8
--dataset-name random
--random-input-len 16384
--random-output-len 10
--random-range-ratio 1.0
--num-prompts 64

zhjc1124 · 2025-05-27T06:56:51Z

new test case
DeepSeek-R1 with tp=4, pp=8 on 4 nodes success
DeepSeek-R1 with tp=8, pp=4 on 4 nodes success
DeepSeek-V3-0324 with tp=8, pp=4 on 4 nodes success
DeepSeek-V3-0324 with tp=4, pp=8 on 4 nodes success
DeepSeek-R1 with tp=2, pp=16 on 4 nodes fail
DeepSeek-V3-0324 with tp=2, pp=16 on 4 nodes fail

zhjc1124 · 2025-05-27T11:06:51Z

I find bug that the pp partition is unbalanced, that may cause OOM. #6666

MichoChan · 2025-06-03T14:22:12Z

@zhjc1124 tp=2 would error with fused moe triton kernel, so i use enable-ep-moe, then can run successful, but i find pipeline parallelism implement now has no async for send hiddenstates，the speed is so slow compare with vllm's pipline parallelism using ray
so can we using ray for pipline parallelism ?

fzyzcjy · 2025-10-16T00:55:59Z

hi could you please rebase the code

zhjc1124 · 2025-10-16T02:35:45Z

hi could you please rebase the code

This PR has been included in #8846
Closed.

zhjc1124 added 4 commits May 20, 2025 03:55

Add pipeline parallelism for DeepSeekV2

f234745

fix mla for pp

629275c

resolve conflicts

918ac00

add PPAccuracy case for deepseek

b6bbd39

zhjc1124 requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann, zhaochenyang20 and zhyncs as code owners May 19, 2025 20:07

zhyncs assigned Ying1123 May 19, 2025

zhjc1124 added 4 commits May 20, 2025 12:00

Merge branch 'main' into pp_deepseek

b6496cb

Update deepseek_v2.py

18be5f9

Merge branch 'sgl-project:main' into pp_deepseek

b71004a

Merge branch 'main' into pp_deepseek

c836647

HaiShaw reviewed May 20, 2025

View reviewed changes

zhjc1124 and others added 2 commits May 21, 2025 14:48

Merge branch 'main' into pp_deepseek

cc71ebb

Merge branch 'main' into pp_deepseek

34ad43f

zhyncs added the high priority label May 21, 2025

zhjc1124 added 2 commits May 23, 2025 10:06

Merge branch 'main' into pp_deepseek

48d4f84

fix missing Union

43ba1dd

fix merge conficts with nextn

d742af9

zhjc1124 added 2 commits May 23, 2025 23:12

add layer_id check

c76d1bc

fix triton_backend v_head_num

fc388a4

zhjc1124 requested review from BBuf and ch-wan as code owners May 23, 2025 15:14

zhjc1124 added 2 commits May 23, 2025 23:15

fix aiter_backend v_head_num

20038ac

fix kv_cache_buffer init

25b501e

zhjc1124 added 2 commits May 26, 2025 14:45

Merge branch 'main' into pp_deepseek

d2413f6

fix wrong v_dim of MLA

b824b69

zhjc1124 added 2 commits May 27, 2025 13:05

Merge branch 'main' into pp_deepseek

f2277e8

Merge branch 'main' into pp_deepseek

7be43d7

zhjc1124 and others added 3 commits May 30, 2025 16:39

Merge branch 'main' into pp_deepseek

08239d0

fix whitespace

b7cd309

Merge branch 'main' into pp_deepseek

971458b

zhyncs assigned CatherineSue Jun 2, 2025

Merge branch 'main' into pp_deepseek

20873aa

zhjc1124 force-pushed the pp_deepseek branch from 046c2d9 to e85e76a Compare June 28, 2025 14:45

Merge branch 'main' into pp_deepseek

20c3416

zhjc1124 force-pushed the pp_deepseek branch from e85e76a to 20c3416 Compare June 28, 2025 14:53

Merge branch 'main' into pp_deepseek

f6664ac

ssssnow mentioned this pull request Aug 6, 2025

[PD] Support PD disaggregation with Prefill PP #8846

Merged

6 tasks

Eva20150932 mentioned this pull request Sep 13, 2025

[Bug] pipeline parrallism doesn't work for deepseek R1 model #10046

Closed

5 tasks

zhjc1124 closed this Oct 16, 2025

Conversation

zhjc1124 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

zhjc1124 commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhjc1124 commented May 19, 2025

Uh oh!

HaiShaw left a comment

Choose a reason for hiding this comment

Uh oh!

zhjc1124 commented May 21, 2025

Uh oh!

billishyahao commented May 22, 2025

Uh oh!

zhjc1124 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichoChan commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhjc1124 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichoChan commented May 25, 2025

Uh oh!

xiaobochen-amd commented May 26, 2025

Uh oh!

zhjc1124 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhjc1124 commented May 27, 2025

Uh oh!

MichoChan commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Oct 16, 2025

Uh oh!

zhjc1124 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

zhjc1124 commented May 19, 2025 •

edited

Loading

zhjc1124 commented May 19, 2025 •

edited

Loading

zhjc1124 commented May 23, 2025 •

edited

Loading

MichoChan commented May 23, 2025 •

edited

Loading

zhjc1124 commented May 23, 2025 •

edited

Loading

zhjc1124 commented May 27, 2025 •

edited

Loading

MichoChan commented Jun 3, 2025 •

edited

Loading

zhjc1124 commented Oct 16, 2025 •

edited

Loading