When enabling chunked prefill together with pipeline parallelism, the prefill event loop appears to handle chunked-prefill logic incorrectly. In the logs below, a single prefill request (input length = 19) is processed multiple times as if it were separate chunked-prefill batches.
[2025-11-11 15:01:30 PP0] Prefill batch, #new-seq: 1, #new-token: 16, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 0.05,
[Get New Batch Prefill]
[2025-11-11 15:01:30 PP1] Prefill batch, #new-seq: 1, #new-token: 16, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 0.05,
[Get New Batch Prefill]
[2025-11-11 15:01:30 PP0] Prefill batch, #new-seq: 1, #new-token: 16, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 53.46,
[Get New Batch Prefill]
[2025-11-11 15:01:30 PP1] Prefill batch, #new-seq: 1, #new-token: 16, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 917.19,
[Get New Batch Prefill]
[2025-11-11 15:01:30 PP0] Prefill batch, #new-seq: 1, #new-token: 3, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 425.50,
[Get New Batch Prefill]
[2025-11-11 15:01:30 PP1] Prefill batch, #new-seq: 1, #new-token: 3, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, #prealloc-req: 0, #inflight-req: 0, input throughput (token/s): 167.84,
[Release] req.kv_committed_len=19, req.kv_allocated_len=19
[Cache finished]: committed_kv_len=19
[2025-11-11 15:01:30] INFO: 127.0.0.1:52690 - "POST /generate HTTP/1.1" 200 OK
[2025-11-11 15:01:30 PP0] Scheduler hit an exception: Traceback (most recent call last):
File "/host_home/common_sync/sglang/python/sglang/srt/managers/scheduler.py", line 2711, in run_scheduler_process
scheduler.event_loop_pp_disagg_prefill()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/host_home/common_sync/sglang/python/sglang/srt/managers/scheduler_pp_mixin.py", line 347, in event_loop_pp_disagg_prefill
self.check_memory()
File "/host_home/common_sync/sglang/python/sglang/srt/managers/scheduler_runtime_checker_mixin.py", line 153, in check_memory
raise ValueError(msg)
ValueError: token_to_kv_pool_allocator memory leak detected! self.max_total_num_tokens=1965607, available_size=1965568, evictable_size=23, protected_size=0
The bug does not happen deterministically; should try the above commands multiple times.
Describe the bug
When enabling chunked prefill together with pipeline parallelism, the prefill event loop appears to handle chunked-prefill logic incorrectly. In the logs below, a single prefill request (input length = 19) is processed multiple times as if it were separate chunked-prefill batches.
Reproduction
Prefill
Decode
Router
Client
The bug does not happen deterministically; should try the above commands multiple times.