Skip to content

[Bug] Pipeline-Paralleism bugs with chunked prefill. #13084

@hnyls2002

Description

@hnyls2002

Describe the bug

When enabling chunked prefill together with pipeline parallelism, the prefill event loop appears to handle chunked-prefill logic incorrectly. In the logs below, a single prefill request (input length = 19) is processed multiple times as if it were separate chunked-prefill batches.

[2025-11-11 15:01:30 PP0] Prefill batch, #new-seq: 1, #new-token: 16, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 0.05,
[Get New Batch Prefill]
[2025-11-11 15:01:30 PP1] Prefill batch, #new-seq: 1, #new-token: 16, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 0.05,
[Get New Batch Prefill]
[2025-11-11 15:01:30 PP0] Prefill batch, #new-seq: 1, #new-token: 16, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 53.46,
[Get New Batch Prefill]
[2025-11-11 15:01:30 PP1] Prefill batch, #new-seq: 1, #new-token: 16, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 917.19,
[Get New Batch Prefill]
[2025-11-11 15:01:30 PP0] Prefill batch, #new-seq: 1, #new-token: 3, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 425.50,
[Get New Batch Prefill]
[2025-11-11 15:01:30 PP1] Prefill batch, #new-seq: 1, #new-token: 3, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 167.84,
[Release] req.kv_committed_len=19, req.kv_allocated_len=19
[Cache finished]: committed_kv_len=19
[2025-11-11 15:01:30] INFO:     127.0.0.1:52690 - "POST /generate HTTP/1.1" 200 OK
[2025-11-11 15:01:30 PP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/host_home/common_sync/sglang/python/sglang/srt/managers/scheduler.py", line 2711, in run_scheduler_process
    scheduler.event_loop_pp_disagg_prefill()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/host_home/common_sync/sglang/python/sglang/srt/managers/scheduler_pp_mixin.py", line 347, in event_loop_pp_disagg_prefill
    self.check_memory()
  File "/host_home/common_sync/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 153, in check_memory
    raise ValueError(msg)
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=1965607, available_size=1965568, evictable_size=23, protected_size=0

Reproduction

Prefill

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --trust-remote-code --disaggregation-mode prefill --pp-size 2 --disable-overlap-schedule --chunked-prefill-size 16 --disaggregation-transfer-backend nixl --host 127.0.0.1 --port 21100

Decode

python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --trust-remote-code --disaggregation-mode decode --tp 2 --base-gpu-id 4 --disaggregation-transfer-backend nixl --host 127.0.0.1 --port 21200

Router

python3 -m sglang_router.launch_router --pd-disaggregation --mini-lb --prefill http://127.0.0.1:21100 --decode http://127.0.0.1:21200 --host 127.0.0.1 --port 21000

Client

python -m sglang.test.send_one --port 21000

The bug does not happen deterministically; should try the above commands multiple times.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions