Skip to content

[PD] Support decode pp for PD disaggregation#14265

Merged
ShangmingCai merged 1 commit intomainfrom
support_decode_pp
Dec 3, 2025
Merged

[PD] Support decode pp for PD disaggregation#14265
ShangmingCai merged 1 commit intomainfrom
support_decode_pp

Conversation

@ShangmingCai
Copy link
Copy Markdown
Collaborator

Motivation

  • Support decode pp, but decode pp size should be equal to prefill pp size or 1

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Signed-off-by: Shangming Cai <csmthu@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Dec 2, 2025
@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

Disaggregation tests have passed:
image

Failed tests are irrelevant.

@ShangmingCai ShangmingCai merged commit 93452a8 into main Dec 3, 2025
151 of 164 checks passed
@ShangmingCai ShangmingCai deleted the support_decode_pp branch December 3, 2025 06:35
tom-jerr pushed a commit to tom-jerr/sglang that referenced this pull request Dec 4, 2025
Signed-off-by: Shangming Cai <csmthu@gmail.com>
yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025
Signed-off-by: Shangming Cai <csmthu@gmail.com>
@nihao1997
Copy link
Copy Markdown

nihao1997 commented Dec 5, 2025

LTGM, but I try with this command
prefill

python -m sglang.launch_server
--model-path $model_path
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 32}'
--served-model-name $model_name
--enable-metrics
--trust-remote-code
--host $local_ip
--port 30000
--dist-init-addr ${master}:5757
--watchdog-timeout 1800
--disaggregation-mode prefill
--disaggregation-ib-device $ib_device
--load-balance-method round_robin
--nnodes $node_num
--node-rank $node_rank
--tp-size 8
--pp-size 2
--context-length $MML
--chunked-prefill-size 16384
--max-prefill-tokens 16384
--page-size 16
--mem-fraction-static 0.80
--max-running-requests 128
--disable-custom-all-reduce
--tokenizer-worker-num 4
--disable-cuda-graph
--disable-radix-cache
--tool-call-parser kimi_k2 \

decode

python -m sglang.launch_server \
  --model-path $model_path \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 32}' \
  --served-model-name $model_name \
  --enable-metrics \
  --enable-metrics-for-all-schedulers \
  --collect-tokens-histogram \
  --trust-remote-code \
  --host $local_ip \
  --port 30000 \
  --dist-init-addr ${master}:5757 \
  --watchdog-timeout 1800 \
  --disaggregation-mode decode \
  --disaggregation-ib-device $ib_device \
  --prefill-round-robin-balance \
  --decode-log-interval 10 \
  --nnodes $node_num \
  --node-rank $node_rank \
  --pp-size 2 \
  --tp-size 8 \
  --load-balance-method shortest_queue \
  --context-length $MML \
  --page-size 16 \
  --mem-fraction-static 0.92 \
  --max-running-requests $((node_num * 8 * max_bs)) \
  --tokenizer-worker-num 4  \
  --cuda-graph-max-bs $max_bs \
  --moe-dense-tp-size 1 \
  --enable-dp-lm-head \
  --tool-call-parser kimi_k2 

and get errors:
[2025-12-05 03:44:56 PP0 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2693, in run_scheduler_process
scheduler.event_loop_normal_disagg_decode()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 803, in event_loop_normal_disagg_decode
self.process_batch_result(batch, result)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2100, in process_batch_result
self.process_batch_result_decode(batch, result)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_output_processor_mixin.py", line 329, in process_batch_result_decode
next_token_ids = next_token_ids.tolist()
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'tolist'

[2025-12-05 03:44:56 PP0 TP3] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2693, in run_scheduler_process
scheduler.event_loop_normal_disagg_decode()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 803, in event_loop_normal_disagg_decode
self.process_batch_result(batch, result)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2100, in process_batch_result
self.process_batch_result_decode(batch, result)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_output_processor_mixin.py", line 329, in process_batch_result_decode
next_token_ids = next_token_ids.tolist()
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'tolist'

same errors with PP2 DP8

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
Signed-off-by: Shangming Cai <csmthu@gmail.com>
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
Signed-off-by: Shangming Cai <csmthu@gmail.com>
@maoqiuli
Copy link
Copy Markdown

maoqiuli commented Dec 5, 2025

LTGM, but I try with this command prefill python -m sglang.launch_server --model-path $model_path --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 32}' --served-model-name $model_name --enable-metrics --trust-remote-code --host $local_ip --port 30000 --dist-init-addr ${master}:5757 --watchdog-timeout 1800 --disaggregation-mode prefill --disaggregation-ib-device $ib_device --load-balance-method round_robin --nnodes $node_num --node-rank $node_rank --tp-size 8 --pp-size 2 --context-length $MML --chunked-prefill-size 16384 --max-prefill-tokens 16384 --page-size 16 --mem-fraction-static 0.80 --max-running-requests 128 --disable-custom-all-reduce --tokenizer-worker-num 4 --disable-cuda-graph --disable-radix-cache --tool-call-parser kimi_k2 \

decode

python -m sglang.launch_server \
  --model-path $model_path \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 32}' \
  --served-model-name $model_name \
  --enable-metrics \
  --enable-metrics-for-all-schedulers \
  --collect-tokens-histogram \
  --trust-remote-code \
  --host $local_ip \
  --port 30000 \
  --dist-init-addr ${master}:5757 \
  --watchdog-timeout 1800 \
  --disaggregation-mode decode \
  --disaggregation-ib-device $ib_device \
  --prefill-round-robin-balance \
  --decode-log-interval 10 \
  --nnodes $node_num \
  --node-rank $node_rank \
  --pp-size 2 \
  --tp-size 8 \
  --load-balance-method shortest_queue \
  --context-length $MML \
  --page-size 16 \
  --mem-fraction-static 0.92 \
  --max-running-requests $((node_num * 8 * max_bs)) \
  --tokenizer-worker-num 4  \
  --cuda-graph-max-bs $max_bs \
  --moe-dense-tp-size 1 \
  --enable-dp-lm-head \
  --tool-call-parser kimi_k2 

and get errors: [2025-12-05 03:44:56 PP0 TP0] Scheduler hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2693, in run_scheduler_process scheduler.event_loop_normal_disagg_decode() File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 803, in event_loop_normal_disagg_decode self.process_batch_result(batch, result) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2100, in process_batch_result self.process_batch_result_decode(batch, result) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_output_processor_mixin.py", line 329, in process_batch_result_decode next_token_ids = next_token_ids.tolist() ^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'tolist'

[2025-12-05 03:44:56 PP0 TP3] Scheduler hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2693, in run_scheduler_process scheduler.event_loop_normal_disagg_decode() File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/disaggregation/decode.py", line 803, in event_loop_normal_disagg_decode self.process_batch_result(batch, result) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2100, in process_batch_result self.process_batch_result_decode(batch, result) File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_output_processor_mixin.py", line 329, in process_batch_result_decode next_token_ids = next_token_ids.tolist() ^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'tolist'

same errors with PP2 DP8

I also encountered the same problem when using the PP2TP4 decode instance

@ShangmingCai
Copy link
Copy Markdown
Collaborator Author

@maoqiuli @nihao1997 Sorry for causing the misunderstanding, the sceduler loop for disaggregated decode hasn't been merged into main yet. You can check this branch for a preview: openanolis#13

@maoqiuli
Copy link
Copy Markdown

maoqiuli commented Dec 5, 2025

@ShangmingCai Thank you very much! I will try it out.

yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Kevin-XiongC pushed a commit to novitalabs/sglang that referenced this pull request Dec 9, 2025
Signed-off-by: Shangming Cai <csmthu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants