[fix] Limit proxy in-flight requests to prevent PD buffer deadlock#2957
[fix] Limit proxy in-flight requests to prevent PD buffer deadlock#2957deng451e merged 8 commits intoLMCache:devfrom
Conversation
Signed-off-by: deng451e <838677410@qq.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a WeightedSemaphore to the disaggregated proxy server to manage decoder PD buffer usage, alongside updates to launch scripts for buffer configuration. The review identifies several necessary improvements to the semaphore implementation: adding capacity checks to prevent indefinite hangs, using try...finally blocks to ensure slots are released during exceptions or client disconnects, and properly initializing async primitives within the event loop to avoid runtime errors.
sammshen
left a comment
There was a problem hiding this comment.
either use paged memory allocator and use the number of pages as the slots or use // 2 if not paged (since we can't rfragment for more than 50%)
KuntaiDu
left a comment
There was a problem hiding this comment.
Can we use in-flight tokens as the metric? I feel like it is in general hard for user to find KV_BYTES_PER_TOKEN for a given model on internet.
Signed-off-by: deng451e <838677410@qq.com>
|
Signed-off-by: deng451e <838677410@qq.com>
| @property | ||
| def available(self) -> int: | ||
| """Number of slots currently available.""" | ||
| return self._available |
There was a problem hiding this comment.
New feature has zero corresponding tests
Medium Severity
The new WeightedSemaphore class and compute_kv_bytes_per_token function implement non-trivial concurrency and model-config logic but have zero tests. Per project rules, new features and bug fixes must include corresponding tests. The WeightedSemaphore acquire/release logic and the model-config parsing in compute_kv_bytes_per_token are both independently testable and critical to correctness.
Additional Locations (1)
Triggered by project rule: LMCache Code Review Style Guide
Reviewed by Cursor Bugbot for commit b466121. Configure here.
|
|
||
| except Exception as e: | ||
| if pd_buffer_semaphore is not None and acquired: | ||
| await pd_buffer_semaphore.release(slots) |
There was a problem hiding this comment.
Task cancellation bypasses semaphore release in outer handler
Medium Severity
The semaphore release in the outer error handler uses except Exception, which does not catch asyncio.CancelledError (a BaseException since Python 3.8). If the task is cancelled during the await send_request_to_service call to the prefill service (between acquire and return StreamingResponse), the CancelledError propagates without releasing the acquired slots. This is a separate leak path from the generator closure issue — it affects the outer handler function, not the inner generator. Permanent slot leaks eventually exhaust the semaphore and block all requests.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit b466121. Configure here.
|
|
||
| global pd_buffer_semaphore | ||
| kv_bytes_per_token = compute_kv_bytes_per_token(global_args.model) | ||
| capacity_slots = global_args.pd_buffer_size // ( |
There was a problem hiding this comment.
directly extract from PagedTensorMemoryAllocator capcity?
There was a problem hiding this comment.
pd_buffer_size is set in the LMCache config file, we can also pass it as an input parameter to the proxy script
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 4 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1888a96. Configure here.
| await wait_decode_kv_ready(req_id, num_tp_rank) | ||
| finally: | ||
| if pd_buffer_semaphore is not None: | ||
| await pd_buffer_semaphore.release(slots) |
There was a problem hiding this comment.
Semaphore slots leak if generator closes before try/finally
High Severity
The try/finally block that releases semaphore slots is placed after the initial yield statements in generate_stream(). If the async generator is closed at any yield before the try block is entered (e.g., client disconnect, cancellation), the finally clause never executes and slots are permanently leaked. The outer except handler also cannot release because return StreamingResponse(...) already succeeded. This is especially impactful in the chat completions handler, which has two unprotected yields before the try/finally. Over time under high concurrency, leaked slots silently reduce semaphore capacity until the proxy deadlocks — re-creating the exact problem this fix aims to prevent.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 1888a96. Configure here.
| if slots > self._capacity: | ||
| raise ValueError( | ||
| f"Requested {slots} slots exceeds total capacity {self._capacity}" | ||
| ) |
There was a problem hiding this comment.
Long prompts always crash instead of waiting for capacity
High Severity
WeightedSemaphore.acquire raises ValueError when slots > capacity, causing any request whose prompt needs more chunks than capacity_slots to unconditionally fail. For Llama-3.1-8B with default 2GB buffer and chunk_size 256, any prompt longer than ~16K tokens will crash — well within the model's 128K context window. The existing WeightedSemaphore in storage_manager.py handles this correctly by waiting for exclusive access when a request exceeds the concurrent budget but fits the total capacity. The proxy version lacks this oversized-request path, turning a previously functional (if deadlock-prone) scenario into a hard failure.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 1888a96. Configure here.
…MCache#2957) * add concurrency limit for pd Signed-off-by: deng451e <838677410@qq.com> * fix Signed-off-by: deng451e <838677410@qq.com> * change input type Signed-off-by: deng451e <838677410@qq.com> * update input flag Signed-off-by: deng451e <838677410@qq.com> --------- Signed-off-by: deng451e <838677410@qq.com>
…MCache#2957) * add concurrency limit for pd Signed-off-by: deng451e <838677410@qq.com> * fix Signed-off-by: deng451e <838677410@qq.com> * change input type Signed-off-by: deng451e <838677410@qq.com> * update input flag Signed-off-by: deng451e <838677410@qq.com> --------- Signed-off-by: deng451e <838677410@qq.com>


Problem
Under high concurrency, the decoder PD buffer can fill up with partial KV chunks from multiple in-flight requests. Since the
proxy only dispatches a decode request after receiving the full KV completion signal, a deadlock forms: the prefiller cannot send remaining chunks (decoder buffer full), and the decoder never starts (proxy never signals ready).
Temp Fix for in-process mode PD
Add a WeightedSemaphore to the proxy that caps total in-flight PD buffer usage. Each request acquires ⌈L / chunk_size⌉ slots before prefill and releases them after wait_decode_kv_ready when proxy forward request to decoder after all kv sent by prefiller .
capacity_slots = pd_buffer_size_bytes // (chunk_size × kv_bytes_per_token)
New proxy args:
Note
Medium Risk
Introduces new concurrency limiting in the disaggregated prefill proxy that can affect request scheduling/throughput and relies on model config-derived sizing; misconfiguration could cause unexpected blocking or reduced concurrency.
Overview
Prevents PD-buffer deadlocks under high concurrency by adding a weighted in-flight limiter in
disagg_proxy_server.py: each request acquiresceil(prompt_tokens / chunk_size)slots before prefill and releases them once the decoder signals KV readiness.Adds proxy CLI flags
--model,--pd-buffer-size, and--chunk-sizeand derives semaphore capacity from HuggingFace model config (compute_kv_bytes_per_token) to approximate KV bytes per token.Updates the 1p1d example scripts to accept a model and PD buffer size, pass them through to the proxy/launchers, and removes
--disable-log-requestsfrom the vLLM launcher invocations.Reviewed by Cursor Bugbot for commit 1888a96. Bugbot is set up for automated code reviews on this repo. Configure here.