Skip to content

[fix] Limit proxy in-flight requests to prevent PD buffer deadlock#2957

Merged
deng451e merged 8 commits intoLMCache:devfrom
deng451e:fix_proxy_inflight
Apr 11, 2026
Merged

[fix] Limit proxy in-flight requests to prevent PD buffer deadlock#2957
deng451e merged 8 commits intoLMCache:devfrom
deng451e:fix_proxy_inflight

Conversation

@deng451e
Copy link
Copy Markdown
Collaborator

@deng451e deng451e commented Apr 5, 2026

Problem

Under high concurrency, the decoder PD buffer can fill up with partial KV chunks from multiple in-flight requests. Since the
proxy only dispatches a decode request after receiving the full KV completion signal, a deadlock forms: the prefiller cannot send remaining chunks (decoder buffer full), and the decoder never starts (proxy never signals ready).

Temp Fix for in-process mode PD

Add a WeightedSemaphore to the proxy that caps total in-flight PD buffer usage. Each request acquires ⌈L / chunk_size⌉ slots before prefill and releases them after wait_decode_kv_ready when proxy forward request to decoder after all kv sent by prefiller .

capacity_slots = pd_buffer_size_bytes // (chunk_size × kv_bytes_per_token)

New proxy args:

  • -pd-buffer-size: decoder PD buffer bytes (matches pd_buffer_size in LMCache config)

Note

Medium Risk
Introduces new concurrency limiting in the disaggregated prefill proxy that can affect request scheduling/throughput and relies on model config-derived sizing; misconfiguration could cause unexpected blocking or reduced concurrency.

Overview
Prevents PD-buffer deadlocks under high concurrency by adding a weighted in-flight limiter in disagg_proxy_server.py: each request acquires ceil(prompt_tokens / chunk_size) slots before prefill and releases them once the decoder signals KV readiness.

Adds proxy CLI flags --model, --pd-buffer-size, and --chunk-size and derives semaphore capacity from HuggingFace model config (compute_kv_bytes_per_token) to approximate KV bytes per token.

Updates the 1p1d example scripts to accept a model and PD buffer size, pass them through to the proxy/launchers, and removes --disable-log-requests from the vLLM launcher invocations.

Reviewed by Cursor Bugbot for commit 1888a96. Bugbot is set up for automated code reviews on this repo. Configure here.

Signed-off-by: deng451e <838677410@qq.com>
@deng451e deng451e requested a review from sammshen April 5, 2026 05:50
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a WeightedSemaphore to the disaggregated proxy server to manage decoder PD buffer usage, alongside updates to launch scripts for buffer configuration. The review identifies several necessary improvements to the semaphore implementation: adding capacity checks to prevent indefinite hangs, using try...finally blocks to ensure slots are released during exceptions or client disconnects, and properly initializing async primitives within the event loop to avoid runtime errors.

Comment thread examples/disagg_prefill/disagg_proxy_server.py
Comment thread examples/disagg_prefill/disagg_proxy_server.py Outdated
Comment thread examples/disagg_prefill/disagg_proxy_server.py Outdated
Comment thread examples/disagg_prefill/disagg_proxy_server.py Outdated
Comment thread examples/disagg_prefill/disagg_proxy_server.py Outdated
Comment thread examples/disagg_prefill/disagg_proxy_server.py Outdated
Comment thread examples/disagg_prefill/disagg_proxy_server.py
Comment thread examples/disagg_prefill/disagg_proxy_server.py Outdated
Comment thread examples/disagg_prefill/disagg_proxy_server.py
Comment thread examples/disagg_prefill/disagg_proxy_server.py
Signed-off-by: deng451e <838677410@qq.com>
Comment thread examples/disagg_prefill/disagg_proxy_server.py
Comment thread examples/disagg_prefill/disagg_proxy_server.py
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either use paged memory allocator and use the number of pages as the slots or use // 2 if not paged (since we can't rfragment for more than 50%)

Comment thread examples/disagg_prefill/1p1d/disagg_example_1p1d.sh Outdated
Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use in-flight tokens as the metric? I feel like it is in general hard for user to find KV_BYTES_PER_TOKEN for a given model on internet.

Signed-off-by: deng451e <838677410@qq.com>
@sammshen
Copy link
Copy Markdown
Contributor

sammshen commented Apr 8, 2026

  1. wieghted sempahore should be by slots
  2. the user should never have to pass it in

Signed-off-by: deng451e <838677410@qq.com>
@property
def available(self) -> int:
"""Number of slots currently available."""
return self._available
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New feature has zero corresponding tests

Medium Severity

The new WeightedSemaphore class and compute_kv_bytes_per_token function implement non-trivial concurrency and model-config logic but have zero tests. Per project rules, new features and bug fixes must include corresponding tests. The WeightedSemaphore acquire/release logic and the model-config parsing in compute_kv_bytes_per_token are both independently testable and critical to correctness.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

Reviewed by Cursor Bugbot for commit b466121. Configure here.


except Exception as e:
if pd_buffer_semaphore is not None and acquired:
await pd_buffer_semaphore.release(slots)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task cancellation bypasses semaphore release in outer handler

Medium Severity

The semaphore release in the outer error handler uses except Exception, which does not catch asyncio.CancelledError (a BaseException since Python 3.8). If the task is cancelled during the await send_request_to_service call to the prefill service (between acquire and return StreamingResponse), the CancelledError propagates without releasing the acquired slots. This is a separate leak path from the generator closure issue — it affects the outer handler function, not the inner generator. Permanent slot leaks eventually exhaust the semaphore and block all requests.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b466121. Configure here.


global pd_buffer_semaphore
kv_bytes_per_token = compute_kv_bytes_per_token(global_args.model)
capacity_slots = global_args.pd_buffer_size // (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

directly extract from PagedTensorMemoryAllocator capcity?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd_buffer_size is set in the LMCache config file, we can also pass it as an input parameter to the proxy script

Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread examples/disagg_prefill/disagg_proxy_server.py
@deng451e deng451e added the full Run comprehensive tests on this PR label Apr 9, 2026
@deng451e deng451e enabled auto-merge (squash) April 9, 2026 07:28
Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 4 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1888a96. Configure here.

await wait_decode_kv_ready(req_id, num_tp_rank)
finally:
if pd_buffer_semaphore is not None:
await pd_buffer_semaphore.release(slots)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semaphore slots leak if generator closes before try/finally

High Severity

The try/finally block that releases semaphore slots is placed after the initial yield statements in generate_stream(). If the async generator is closed at any yield before the try block is entered (e.g., client disconnect, cancellation), the finally clause never executes and slots are permanently leaked. The outer except handler also cannot release because return StreamingResponse(...) already succeeded. This is especially impactful in the chat completions handler, which has two unprotected yields before the try/finally. Over time under high concurrency, leaked slots silently reduce semaphore capacity until the proxy deadlocks — re-creating the exact problem this fix aims to prevent.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1888a96. Configure here.

if slots > self._capacity:
raise ValueError(
f"Requested {slots} slots exceeds total capacity {self._capacity}"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long prompts always crash instead of waiting for capacity

High Severity

WeightedSemaphore.acquire raises ValueError when slots > capacity, causing any request whose prompt needs more chunks than capacity_slots to unconditionally fail. For Llama-3.1-8B with default 2GB buffer and chunk_size 256, any prompt longer than ~16K tokens will crash — well within the model's 128K context window. The existing WeightedSemaphore in storage_manager.py handles this correctly by waiting for exclusive access when a request exceeds the concurrent budget but fits the total capacity. The proxy version lacks this oversized-request path, turning a previously functional (if deadlock-prone) scenario into a hard failure.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1888a96. Configure here.

@deng451e deng451e merged commit 5ce42e6 into LMCache:dev Apr 11, 2026
34 of 35 checks passed
Oasis-Git pushed a commit to Oasis-Git/LMCache that referenced this pull request Apr 13, 2026
…MCache#2957)

* add  concurrency limit for pd

Signed-off-by: deng451e <838677410@qq.com>

* fix

Signed-off-by: deng451e <838677410@qq.com>

* change input type

Signed-off-by: deng451e <838677410@qq.com>

* update input flag

Signed-off-by: deng451e <838677410@qq.com>

---------

Signed-off-by: deng451e <838677410@qq.com>
ftian1 pushed a commit to ftian1/LMCache that referenced this pull request Apr 20, 2026
…MCache#2957)

* add  concurrency limit for pd

Signed-off-by: deng451e <838677410@qq.com>

* fix

Signed-off-by: deng451e <838677410@qq.com>

* change input type

Signed-off-by: deng451e <838677410@qq.com>

* update input flag

Signed-off-by: deng451e <838677410@qq.com>

---------

Signed-off-by: deng451e <838677410@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants