[fix] Limit proxy in-flight requests to prevent PD buffer deadlock by deng451e · Pull Request #2957 · LMCache/LMCache

deng451e · 2026-04-05T05:50:08Z

Problem

Under high concurrency, the decoder PD buffer can fill up with partial KV chunks from multiple in-flight requests. Since the
proxy only dispatches a decode request after receiving the full KV completion signal, a deadlock forms: the prefiller cannot send remaining chunks (decoder buffer full), and the decoder never starts (proxy never signals ready).

Temp Fix for in-process mode PD

Add a WeightedSemaphore to the proxy that caps total in-flight PD buffer usage. Each request acquires ⌈L / chunk_size⌉ slots before prefill and releases them after wait_decode_kv_ready when proxy forward request to decoder after all kv sent by prefiller .

capacity_slots = pd_buffer_size_bytes // (chunk_size × kv_bytes_per_token)

New proxy args:

-pd-buffer-size: decoder PD buffer bytes (matches pd_buffer_size in LMCache config)

Note

Medium Risk
Introduces new concurrency limiting in the disaggregated prefill proxy that can affect request scheduling/throughput and relies on model config-derived sizing; misconfiguration could cause unexpected blocking or reduced concurrency.

Overview
Prevents PD-buffer deadlocks under high concurrency by adding a weighted in-flight limiter in disagg_proxy_server.py: each request acquires ceil(prompt_tokens / chunk_size) slots before prefill and releases them once the decoder signals KV readiness.

Adds proxy CLI flags --model, --pd-buffer-size, and --chunk-size and derives semaphore capacity from HuggingFace model config (compute_kv_bytes_per_token) to approximate KV bytes per token.

Updates the 1p1d example scripts to accept a model and PD buffer size, pass them through to the proxy/launchers, and removes --disable-log-requests from the vLLM launcher invocations.

^{Reviewed by Cursor Bugbot for commit 1888a96. Bugbot is set up for automated code reviews on this repo. Configure here.}

Signed-off-by: deng451e <838677410@qq.com>

gemini-code-assist

Code Review

This pull request introduces a WeightedSemaphore to the disaggregated proxy server to manage decoder PD buffer usage, alongside updates to launch scripts for buffer configuration. The review identifies several necessary improvements to the semaphore implementation: adding capacity checks to prevent indefinite hangs, using try...finally blocks to ensure slots are released during exceptions or client disconnects, and properly initializing async primitives within the event loop to avoid runtime errors.

Signed-off-by: deng451e <838677410@qq.com>

ApostaC

LGTM!

sammshen

LGTM!

sammshen

either use paged memory allocator and use the number of pages as the slots or use // 2 if not paged (since we can't rfragment for more than 50%)

KuntaiDu

Can we use in-flight tokens as the metric? I feel like it is in general hard for user to find KV_BYTES_PER_TOKEN for a given model on internet.

Signed-off-by: deng451e <838677410@qq.com>

sammshen · 2026-04-08T10:48:25Z

wieghted sempahore should be by slots
the user should never have to pass it in

Signed-off-by: deng451e <838677410@qq.com>

cursor · 2026-04-08T18:34:58Z

+    @property
+    def available(self) -> int:
+        """Number of slots currently available."""
+        return self._available


New feature has zero corresponding tests

Medium Severity

The new WeightedSemaphore class and compute_kv_bytes_per_token function implement non-trivial concurrency and model-config logic but have zero tests. Per project rules, new features and bug fixes must include corresponding tests. The WeightedSemaphore acquire/release logic and the model-config parsing in compute_kv_bytes_per_token are both independently testable and critical to correctness.

Additional Locations (1)

examples/disagg_prefill/disagg_proxy_server.py#L248-L271

^{Triggered by project rule: LMCache Code Review Style Guide}

^{Reviewed by Cursor Bugbot for commit b466121. Configure here.}

cursor · 2026-04-08T18:34:58Z


    except Exception as e:
+        if pd_buffer_semaphore is not None and acquired:
+            await pd_buffer_semaphore.release(slots)


Task cancellation bypasses semaphore release in outer handler

Medium Severity

The semaphore release in the outer error handler uses except Exception, which does not catch asyncio.CancelledError (a BaseException since Python 3.8). If the task is cancelled during the await send_request_to_service call to the prefill service (between acquire and return StreamingResponse), the CancelledError propagates without releasing the acquired slots. This is a separate leak path from the generator closure issue — it affects the outer handler function, not the inner generator. Permanent slot leaks eventually exhaust the semaphore and block all requests.

Additional Locations (1)

examples/disagg_prefill/disagg_proxy_server.py#L753-L756

^{Reviewed by Cursor Bugbot for commit b466121. Configure here.}

sammshen · 2026-04-08T22:45:55Z


+    global pd_buffer_semaphore
+    kv_bytes_per_token = compute_kv_bytes_per_token(global_args.model)
+    capacity_slots = global_args.pd_buffer_size // (


directly extract from PagedTensorMemoryAllocator capcity?

pd_buffer_size is set in the LMCache config file, we can also pass it as an input parameter to the proxy script

KuntaiDu

LGTM!

sammshen

LGTM!

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 4 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 1888a96. Configure here.}

cursor · 2026-04-10T22:24:32Z

+                await wait_decode_kv_ready(req_id, num_tp_rank)
+            finally:
+                if pd_buffer_semaphore is not None:
+                    await pd_buffer_semaphore.release(slots)


Semaphore slots leak if generator closes before try/finally

High Severity

The try/finally block that releases semaphore slots is placed after the initial yield statements in generate_stream(). If the async generator is closed at any yield before the try block is entered (e.g., client disconnect, cancellation), the finally clause never executes and slots are permanently leaked. The outer except handler also cannot release because return StreamingResponse(...) already succeeded. This is especially impactful in the chat completions handler, which has two unprotected yields before the try/finally. Over time under high concurrency, leaked slots silently reduce semaphore capacity until the proxy deadlocks — re-creating the exact problem this fix aims to prevent.

Additional Locations (1)

examples/disagg_prefill/disagg_proxy_server.py#L676-L703

^{Reviewed by Cursor Bugbot for commit 1888a96. Configure here.}

cursor · 2026-04-10T22:24:32Z

+        if slots > self._capacity:
+            raise ValueError(
+                f"Requested {slots} slots exceeds total capacity {self._capacity}"
+            )


Long prompts always crash instead of waiting for capacity

High Severity

WeightedSemaphore.acquire raises ValueError when slots > capacity, causing any request whose prompt needs more chunks than capacity_slots to unconditionally fail. For Llama-3.1-8B with default 2GB buffer and chunk_size 256, any prompt longer than ~16K tokens will crash — well within the model's 128K context window. The existing WeightedSemaphore in storage_manager.py handles this correctly by waiting for exclusive access when a request exceeds the concurrent budget but fits the total capacity. The proxy version lacks this oversized-request path, turning a previously functional (if deadlock-prone) scenario into a hard failure.

Additional Locations (1)

examples/disagg_prefill/disagg_proxy_server.py#L498-L501

^{Reviewed by Cursor Bugbot for commit 1888a96. Configure here.}

…MCache#2957) * add concurrency limit for pd Signed-off-by: deng451e <838677410@qq.com> * fix Signed-off-by: deng451e <838677410@qq.com> * change input type Signed-off-by: deng451e <838677410@qq.com> * update input flag Signed-off-by: deng451e <838677410@qq.com> --------- Signed-off-by: deng451e <838677410@qq.com>

add concurrency limit for pd

de0a129

Signed-off-by: deng451e <838677410@qq.com>

deng451e requested a review from sammshen April 5, 2026 05:50

gemini-code-assist Bot reviewed Apr 5, 2026

View reviewed changes

deng451e mentioned this pull request Apr 5, 2026

fix: second loop cleanup of pd_buffer on cache_engine.py #2934

Open

2 tasks

cursor Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread examples/disagg_prefill/disagg_proxy_server.py

Comment thread examples/disagg_prefill/disagg_proxy_server.py Outdated

Comment thread examples/disagg_prefill/disagg_proxy_server.py

Comment thread examples/disagg_prefill/disagg_proxy_server.py

fix

13cd875

Signed-off-by: deng451e <838677410@qq.com>

cursor Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread examples/disagg_prefill/disagg_proxy_server.py

Comment thread examples/disagg_prefill/disagg_proxy_server.py

ApostaC approved these changes Apr 6, 2026

View reviewed changes

sammshen approved these changes Apr 7, 2026

View reviewed changes

sammshen requested changes Apr 7, 2026

View reviewed changes

sammshen reviewed Apr 7, 2026

View reviewed changes

Comment thread examples/disagg_prefill/1p1d/disagg_example_1p1d.sh Outdated

KuntaiDu reviewed Apr 7, 2026

View reviewed changes

change input type

7f368c3

Signed-off-by: deng451e <838677410@qq.com>

update input flag

b466121

Signed-off-by: deng451e <838677410@qq.com>

cursor Bot reviewed Apr 8, 2026

View reviewed changes

sammshen reviewed Apr 8, 2026

View reviewed changes

KuntaiDu approved these changes Apr 8, 2026

View reviewed changes

Merge branch 'dev' into fix_proxy_inflight

1171a33

cursor Bot reviewed Apr 8, 2026

View reviewed changes

Comment thread examples/disagg_prefill/disagg_proxy_server.py

deng451e added the full Run comprehensive tests on this PR label Apr 9, 2026

deng451e enabled auto-merge (squash) April 9, 2026 07:28

Merge branch 'dev' into fix_proxy_inflight

013092a

sammshen approved these changes Apr 10, 2026

View reviewed changes

deng451e added 2 commits April 9, 2026 21:14

Merge branch 'dev' into fix_proxy_inflight

1d30fee

Merge branch 'dev' into fix_proxy_inflight

1888a96

cursor Bot reviewed Apr 10, 2026

View reviewed changes

deng451e merged commit 5ce42e6 into LMCache:dev Apr 11, 2026
34 of 35 checks passed

deng451e mentioned this pull request Apr 30, 2026

feat(pd_backend): fully async PD backend #3038

Open

Conversation

deng451e commented Apr 5, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KuntaiDu left a comment

Choose a reason for hiding this comment

Uh oh!

sammshen commented Apr 8, 2026

Uh oh!

cursor Bot Apr 8, 2026

Choose a reason for hiding this comment

New feature has zero corresponding tests

Uh oh!

cursor Bot Apr 8, 2026

Choose a reason for hiding this comment

Task cancellation bypasses semaphore release in outer handler

Uh oh!

sammshen Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

deng451e Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

KuntaiDu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 10, 2026

Choose a reason for hiding this comment

Semaphore slots leak if generator closes before try/finally

Uh oh!

cursor Bot Apr 10, 2026

Choose a reason for hiding this comment

Long prompts always crash instead of waiting for capacity

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deng451e commented Apr 5, 2026 •

edited by cursor Bot

Loading