[Bugfix]: Fix pin count balancing in PD Disaggregation mode#2786
[Bugfix]: Fix pin count balancing in PD Disaggregation mode#2786sammshen merged 6 commits intoLMCache:devfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical memory management flaw within the PD Disaggregation mode of vLLM when using LMCache. By ensuring that KV cache objects are correctly unpinned on the Decode side, it prevents indefinite memory pinning, which previously caused resource exhaustion, reduced cache efficiency, and performance degradation. The fix specifically targets the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a critical memory leak in the Prefill-Decode Disaggregation mode by correctly balancing the pin count for kv_consumer roles. The change introduces the missing lookup_unpin() call, which is a direct and effective fix for the described issue. The implementation is sound. I've added one minor suggestion to enhance code robustness by asserting a non-null engine instance before use.
bdf3c19 to
6188576
Compare
|
In LMCache PD in-process mode, the prefiller writes KV cache directly to the decoder GPU buffer without going through local_cpu_backend, so it shouldn’t require pinning memory_obj on the decoder host side. |
Thanks for the question! I'm using MooncakeStore as the remote storage backend, not the in-process mode. In MooncakeStore mode, the decoder side calls Config ( chunk_size: 256
remote_url: "mooncakestore://80.11.128.72:50011/"
local_cpu: True
max_local_cpu_size: 10
use_layerwise: False
save_decode_cache: False
enable_async_loading: False
pre_caching_hash_algorithm: sha256_cbor_64bit
extra_config:
local_hostname: "80.11.128.72"
protocol: "tcp"
metadata_server: "http://80.11.128.72:8082/metadata"
master_server_address: "80.11.128.72:50011"
global_segment_size: 53687091200
local_buffer_size: 0
save_chunk_meta: False
use_exists_sync: TrueLaunch commands: export LMCACHE_CONFIG_FILE="$LMCACHE_CONFIG_FILE"
KV_TRANSFER_CONFIG=$(cat <<EOF
{"kv_connector":"LMCacheConnectorV1","kv_role":"$KV_ROLE_VALUE","kv_connector_extra_config":{"discard_partial_chunks":false,"lmcache_rpc_port":"$LMCACHE_RPC_PORT_NAME"}}
EOF
)
# --- 启动服务 ---
echo "--- 正在启动 $ROLE 服务 (TP=$TENSOR_PARALLEL_SIZE) ---"
python3 -m vllm.entrypoints.openai.api_server \
--model "$MODEL_PATH" \
--port "$PORT" \
--tensor-parallel-size "$TENSOR_PARALLEL_SIZE" \
"${COMMON_ARGS[@]}" \
--kv-transfer-config "$KV_TRANSFER_CONFIG" |
|
According to your description, the KV cache flow seems to be prefiller → mooncakestore → decoder — hope I got that right. |
Thanks for the suggestion! My setup uses MooncakeStore is required in my case as I rely on its specific features (RDMA, distributed memory pooling). The pin count leak occurs in this specific call path for 1. Lookup phase ( # get_num_new_matched_tokens() for kv_consumer
num_external_hit_tokens = self.lookup_client.lookup(
token_ids,
lookup_id=req_id,
...
)2. Contains with pin ( def contains(self, key: CacheEngineKey, pin: bool = False) -> bool:
with self.cpu_lock:
if key not in self.hot_cache:
return False
if pin:
self.hot_cache[key].pin() # ← Pin happens here
self.keys_in_request.append(key)
return True3. wait_for_save() returns early ( def wait_for_save(self):
if self.kv_role == "kv_consumer":
return # ← Returns without calling lookup_unpin()!The fix ensures |
|
Make sense to me. PTAL @sammshen |
sammshen
left a comment
There was a problem hiding this comment.
LGTM! Thanks for this fix!
Quick question: do we also have to add lookup_unpin to request_finished for aborted requests?
Good catch! For aborted requests, The fix should be: def request_finished(self, request: "Request", block_ids: list[int]):
# ... existing code ...
if request.status == RequestStatus.FINISHED_ABORTED:
if self.async_loading:
self.lookup_client.cancel_lookup(lookup_id)
# Unpin to balance pin count from lookup
if self.lmcache_engine is not None:
self.lmcache_engine.lookup_unpin(request.request_id)I'll verify this and update the PR when I get a chance. Thanks! |
Head branch was pushed to by a user without write access
|
Skip lookup_unpin for FINISHED_ABORTED requests in wait_for_save() since |
Head branch was pushed to by a user without write access
Signed-off-by: lisiG9 <2442737518@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> Signed-off-by: lisiG9 <2442737518@qq.com>
Signed-off-by: lisiG9 <2442737518@qq.com>
Signed-off-by: lisiG9 <2442737518@qq.com>
Skip lookup_unpin for FINISHED_ABORTED requests in wait_for_save() since they are already unpinned in request_finished(). This prevents pin count from becoming negative and fixes k3-multiprocess-test failures. Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>
Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>
…2786) * fix: balance pin count in LocalCPUBackend.touch_cache() Signed-off-by: lisiG9 <2442737518@qq.com> * Update lmcache/integration/vllm/vllm_v1_adapter.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> Signed-off-by: lisiG9 <2442737518@qq.com> * Update vllm_v1_adapter.py Signed-off-by: lisiG9 <2442737518@qq.com> * Add lookup_unpin for aborted requests in request_finished Signed-off-by: lisiG9 <2442737518@qq.com> * Fix double unpin for aborted requests in kv_consumer role Skip lookup_unpin for FINISHED_ABORTED requests in wait_for_save() since they are already unpinned in request_finished(). This prevents pin count from becoming negative and fixes k3-multiprocess-test failures. Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> * Refactor request unpinning and cancellation logic Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> --------- Signed-off-by: lisiG9 <2442737518@qq.com> Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…2786) * fix: balance pin count in LocalCPUBackend.touch_cache() Signed-off-by: lisiG9 <2442737518@qq.com> * Update lmcache/integration/vllm/vllm_v1_adapter.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> Signed-off-by: lisiG9 <2442737518@qq.com> * Update vllm_v1_adapter.py Signed-off-by: lisiG9 <2442737518@qq.com> * Add lookup_unpin for aborted requests in request_finished Signed-off-by: lisiG9 <2442737518@qq.com> * Fix double unpin for aborted requests in kv_consumer role Skip lookup_unpin for FINISHED_ABORTED requests in wait_for_save() since they are already unpinned in request_finished(). This prevents pin count from becoming negative and fixes k3-multiprocess-test failures. Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> * Refactor request unpinning and cancellation logic Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> --------- Signed-off-by: lisiG9 <2442737518@qq.com> Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…2786) * fix: balance pin count in LocalCPUBackend.touch_cache() Signed-off-by: lisiG9 <2442737518@qq.com> * Update lmcache/integration/vllm/vllm_v1_adapter.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> Signed-off-by: lisiG9 <2442737518@qq.com> * Update vllm_v1_adapter.py Signed-off-by: lisiG9 <2442737518@qq.com> * Add lookup_unpin for aborted requests in request_finished Signed-off-by: lisiG9 <2442737518@qq.com> * Fix double unpin for aborted requests in kv_consumer role Skip lookup_unpin for FINISHED_ABORTED requests in wait_for_save() since they are already unpinned in request_finished(). This prevents pin count from becoming negative and fixes k3-multiprocess-test failures. Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> * Refactor request unpinning and cancellation logic Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> --------- Signed-off-by: lisiG9 <2442737518@qq.com> Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
In layerwise retrieval with LocalCPU backend, the unpin loop at cache_engine.py:1040-1042 was designed for disk-loaded staging objects (added in LMCache#2611). However, LocalCPUBackend.batched_get_non_blocking() returns the same Python object from hot_cache that lookup(pin=True) had already pinned, causing retrieve_layer() to unpin it once, and then wait_for_save() to unpin it again via lookup_unpin() (LMCache#2786). This double unpin drives pin_count to -1 and, more critically, triggers a premature free() of the memory object (unpin() calls free() when both pin_count <= 0 and ref_count <= 0). Fix: guard the unpin with location != 'LocalCPUBackend' so that only disk-loaded staging objects (LocalDisk, Maru, etc.) are unpinned here. LocalCPU objects retain their pin until lookup_unpin() releases them in wait_for_save(), preserving the correct single-free lifecycle. Fixes LMCache#2954 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In layerwise retrieval with LocalCPU backend, the unpin loop at cache_engine.py:1040-1042 was designed for disk-loaded staging objects (added in LMCache#2611). However, LocalCPUBackend.batched_get_non_blocking() returns the same Python object from hot_cache that lookup(pin=True) had already pinned, causing retrieve_layer() to unpin it once, and then wait_for_save() to unpin it again via lookup_unpin() (LMCache#2786). This double unpin drives pin_count to -1 and, more critically, triggers a premature free() of the memory object (unpin() calls free() when both pin_count <= 0 and ref_count <= 0). Fix: guard the unpin with location != 'LocalCPUBackend' so that only disk-loaded staging objects (LocalDisk, Maru, etc.) are unpinned here. LocalCPU objects retain their pin until lookup_unpin() releases them in wait_for_save(), preserving the correct single-free lifecycle. Fixes LMCache#2954 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
Summary
This PR fixes a pin count balancing issue in PD (Prefill-Decode) Disaggregation mode. When using LMCache with vLLM's PD Disaggregation (via
kv_producer/kv_consumerroles), memory objects inLocalCPUBackendwere being pinned indefinitely on the Decoder (D) side, causing pin timeout warnings and potential memory leaks. The Prefill (P) side previously handled pin count balancing correctly, but the Decode (D) side lacked the necessary unpin operation.Problem Description
In PD Disaggregation mode:
kv_producer): Performs lookup withpin=True, saves KV cache, and callslookup_unpin()inwait_for_save()to balance the pin count.kv_consumer): Performs lookup withpin=True, butwait_for_save()was returning early without callinglookup_unpin(), leaving the pin count unbalanced.This caused memory objects to remain pinned indefinitely on the D-side, leading to:
Solution
Modified
LMCacheConnectorV1Impl.wait_for_save()invllm_v1_adapter.pyto calllookup_unpin()for both P-side and D-side:This ensures that both P-side and D-side properly balance the pin count incremented during
contains(pin=True)calls.Impact Analysis
kv_producer/kv_consumerroleslookup_unpin()call on D-sidelookup_unpin()for each request