[Bugfix]: Fix pin count balancing in PD Disaggregation mode by lisiG9 · Pull Request #2786 · LMCache/LMCache

lisiG9 · 2026-03-16T06:45:39Z

Summary

This PR fixes a pin count balancing issue in PD (Prefill-Decode) Disaggregation mode. When using LMCache with vLLM's PD Disaggregation (via kv_producer/kv_consumer roles), memory objects in LocalCPUBackend were being pinned indefinitely on the Decoder (D) side, causing pin timeout warnings and potential memory leaks. The Prefill (P) side previously handled pin count balancing correctly, but the Decode (D) side lacked the necessary unpin operation.

Problem Description

In PD Disaggregation mode:

Prefill (P) side (kv_producer): Performs lookup with pin=True, saves KV cache, and calls lookup_unpin() in wait_for_save() to balance the pin count.
Decode (D) side (kv_consumer): Performs lookup with pin=True, but wait_for_save() was returning early without calling lookup_unpin(), leaving the pin count unbalanced.

This caused memory objects to remain pinned indefinitely on the D-side, leading to:

Pin timeout warnings from the pin monitor
Reduced effectiveness of the cache eviction policy
Increased memory pressure and potential out-of-memory situations
Performance degradation due to suboptimal cache management

Solution

Modified LMCacheConnectorV1Impl.wait_for_save() in vllm_v1_adapter.py to call lookup_unpin() for both P-side and D-side:

if self.kv_role == "kv_consumer":
    # Don't do save if the role is kv_consumer
    # But still need to unpin the kv caches according to req_id
    # to balance the pin count from contains()
    for request in connector_metadata.requests:
        self.lmcache_engine.lookup_unpin(request.req_id)
    return

This ensures that both P-side and D-side properly balance the pin count incremented during contains(pin=True) calls.

Impact Analysis

Scope: This fix affects PD Disaggregation mode using vLLM's kv_producer/kv_consumer roles
Backward Compatibility: No breaking changes - the fix only adds the missing lookup_unpin() call on D-side
Performance: Minimal overhead - just an additional loop to call lookup_unpin() for each request

gemini-code-assist · 2026-03-16T06:45:51Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical memory management flaw within the PD Disaggregation mode of vLLM when using LMCache. By ensuring that KV cache objects are correctly unpinned on the Decode side, it prevents indefinite memory pinning, which previously caused resource exhaustion, reduced cache efficiency, and performance degradation. The fix specifically targets the kv_consumer role to maintain balanced pin counts.

Highlights

Pin Count Balancing Fix: Resolved an issue in PD Disaggregation mode where memory objects in LocalCPUBackend were indefinitely pinned on the Decode (D) side, leading to pin timeout warnings and potential memory leaks.
D-side lookup_unpin() Implementation: Modified LMCacheConnectorV1Impl.wait_for_save() to explicitly call lookup_unpin() for kv_consumer (D-side) requests, ensuring proper balancing of pin counts incremented during contains(pin=True) calls.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

lmcache/integration/vllm/vllm_v1_adapter.py
- Added a loop to call self.lmcache_engine.lookup_unpin() for each request when the kv_role is kv_consumer within the wait_for_save method.
- Introduced comments explaining the necessity of unpinning for kv_consumer to balance pin counts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a critical memory leak in the Prefill-Decode Disaggregation mode by correctly balancing the pin count for kv_consumer roles. The change introduces the missing lookup_unpin() call, which is a direct and effective fix for the described issue. The implementation is sound. I've added one minor suggestion to enhance code robustness by asserting a non-null engine instance before use.

lisiG9 · 2026-03-16T07:04:30Z

@sammshen @YaoJiayi please take a look, thanks!

deng451e · 2026-03-16T23:53:55Z

In LMCache PD in-process mode, the prefiller writes KV cache directly to the decoder GPU buffer without going through local_cpu_backend, so it shouldn’t require pinning memory_obj on the decoder host side.
Could you share the config and script you used to launch PD? That would help us better reproduce the memory leak you observed.

lisiG9 · 2026-03-17T01:26:57Z

In LMCache PD in-process mode, the prefiller writes KV cache directly to the decoder GPU buffer without going through local_cpu_backend, so it shouldn’t require pinning memory_obj on the decoder host side. Could you share the config and script you used to launch PD? That would help us better reproduce the memory leak you observed.

Thanks for the question! I'm using MooncakeStore as the remote storage backend, not the in-process mode.

In MooncakeStore mode, the decoder side calls contains() with pin=True which triggers local_cpu_backend.pin(), but wait_for_save() returns early for kv_consumer without calling lookup_unpin().

Config (LMCACHE_CONFIG_FILE):

chunk_size: 256
remote_url: "mooncakestore://80.11.128.72:50011/"
local_cpu: True
max_local_cpu_size: 10
use_layerwise: False
save_decode_cache: False
enable_async_loading: False
pre_caching_hash_algorithm: sha256_cbor_64bit
extra_config:
  local_hostname: "80.11.128.72"
  protocol: "tcp"
  metadata_server: "http://80.11.128.72:8082/metadata"
  master_server_address: "80.11.128.72:50011"
  global_segment_size: 53687091200
  local_buffer_size: 0
  save_chunk_meta: False
  use_exists_sync: True

Launch commands:

export LMCACHE_CONFIG_FILE="$LMCACHE_CONFIG_FILE"
KV_TRANSFER_CONFIG=$(cat <<EOF
{"kv_connector":"LMCacheConnectorV1","kv_role":"$KV_ROLE_VALUE","kv_connector_extra_config":{"discard_partial_chunks":false,"lmcache_rpc_port":"$LMCACHE_RPC_PORT_NAME"}}
EOF
)
# --- 启动服务 ---
echo "--- 正在启动 $ROLE 服务 (TP=$TENSOR_PARALLEL_SIZE) ---"
python3 -m vllm.entrypoints.openai.api_server \
--model "$MODEL_PATH" \
--port "$PORT" \
--tensor-parallel-size "$TENSOR_PARALLEL_SIZE" \
"${COMMON_ARGS[@]}" \
--kv-transfer-config "$KV_TRANSFER_CONFIG"

deng451e · 2026-03-17T03:27:33Z

According to your description, the KV cache flow seems to be prefiller → mooncakestore → decoder — hope I got that right.
A similar setup exists in LMCache multi-process mode:
https://github.com/LMCache/LMCache/tree/dev/examples/disagg_prefill_mp — feel free to try it. It should also help address the memory leak you observed.

lisiG9 · 2026-03-17T07:26:51Z

According to your description, the KV cache flow seems to be prefiller → mooncakestore → decoder — hope I got that right. A similar setup exists in LMCache multi-process mode: https://github.com/LMCache/LMCache/tree/dev/examples/disagg_prefill_mp — feel free to try it. It should also help address the memory leak you observed.

Thanks for the suggestion! My setup uses LMCacheConnectorV1 with MooncakeStore, which is different from the LMCacheMPConnector example.

MooncakeStore is required in my case as I rely on its specific features (RDMA, distributed memory pooling).

The pin count leak occurs in this specific call path for kv_consumer:

1. Lookup phase (vllm_v1_adapter.py):

# get_num_new_matched_tokens() for kv_consumer
num_external_hit_tokens = self.lookup_client.lookup(
    token_ids,
    lookup_id=req_id,
    ...
)

2. Contains with pin (local_cpu_backend.py):

def contains(self, key: CacheEngineKey, pin: bool = False) -> bool:
    with self.cpu_lock:
        if key not in self.hot_cache:
            return False
        if pin:
            self.hot_cache[key].pin()  # ← Pin happens here
            self.keys_in_request.append(key)
        return True

3. wait_for_save() returns early (vllm_v1_adapter.py):

def wait_for_save(self):
    if self.kv_role == "kv_consumer":
        return  # ← Returns without calling lookup_unpin()!

The fix ensures lookup_unpin() is called for kv_consumer to balance the pin count.

deng451e · 2026-03-17T23:03:17Z

Make sense to me. PTAL @sammshen

ApostaC · 2026-03-22T02:17:21Z

@sammshen

sammshen

LGTM! Thanks for this fix!

Quick question: do we also have to add lookup_unpin to request_finished for aborted requests?

lisiG9 · 2026-03-23T01:02:41Z

LGTM! Thanks for this fix!

Quick question: do we also have to add lookup_unpin to request_finished for aborted requests?

Good catch! For aborted requests, wait_for_save() may not be called, leaving the pin count unbalanced.

The fix should be:

def request_finished(self, request: "Request", block_ids: list[int]):
    # ... existing code ...
    
    if request.status == RequestStatus.FINISHED_ABORTED:
        if self.async_loading:
            self.lookup_client.cancel_lookup(lookup_id)
        
        # Unpin to balance pin count from lookup
        if self.lmcache_engine is not None:
            self.lmcache_engine.lookup_unpin(request.request_id)

I'll verify this and update the PR when I get a chance. Thanks!

lisiG9 · 2026-03-23T03:59:03Z

@sammshen @deng451e I've updated the PR to add lookup_unpin() for aborted requests in request_finished(). Could you please trigger the auto-merge?

lisiG9 · 2026-03-24T01:24:55Z

buildkite/k3-comprehensive-test — Build #842 failed
Rebased to latest dev branch. Please re-enable auto-merge when CI passes. Thanks!@sammshen @deng451e

lisiG9 · 2026-03-24T06:19:25Z

Hi @sammshen @ApostaC , could you help check why buildkite/k3-multiprocess-test Build #827 failed? The logs show "Page not found" for me. Thanks!

lisiG9 · 2026-03-25T02:53:06Z

Skip lookup_unpin for FINISHED_ABORTED requests in wait_for_save() since
they are already unpinned in request_finished(). This prevents pin count
from becoming negative and fixes k3-multiprocess-test failures.

Signed-off-by: lisiG9 <2442737518@qq.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> Signed-off-by: lisiG9 <2442737518@qq.com>

Signed-off-by: lisiG9 <2442737518@qq.com>

Skip lookup_unpin for FINISHED_ABORTED requests in wait_for_save() since they are already unpinned in request_finished(). This prevents pin count from becoming negative and fixes k3-multiprocess-test failures. Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>

Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>

…2786) * fix: balance pin count in LocalCPUBackend.touch_cache() Signed-off-by: lisiG9 <2442737518@qq.com> * Update lmcache/integration/vllm/vllm_v1_adapter.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> Signed-off-by: lisiG9 <2442737518@qq.com> * Update vllm_v1_adapter.py Signed-off-by: lisiG9 <2442737518@qq.com> * Add lookup_unpin for aborted requests in request_finished Signed-off-by: lisiG9 <2442737518@qq.com> * Fix double unpin for aborted requests in kv_consumer role Skip lookup_unpin for FINISHED_ABORTED requests in wait_for_save() since they are already unpinned in request_finished(). This prevents pin count from becoming negative and fixes k3-multiprocess-test failures. Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> * Refactor request unpinning and cancellation logic Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> --------- Signed-off-by: lisiG9 <2442737518@qq.com> Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

In layerwise retrieval with LocalCPU backend, the unpin loop at cache_engine.py:1040-1042 was designed for disk-loaded staging objects (added in LMCache#2611). However, LocalCPUBackend.batched_get_non_blocking() returns the same Python object from hot_cache that lookup(pin=True) had already pinned, causing retrieve_layer() to unpin it once, and then wait_for_save() to unpin it again via lookup_unpin() (LMCache#2786). This double unpin drives pin_count to -1 and, more critically, triggers a premature free() of the memory object (unpin() calls free() when both pin_count <= 0 and ref_count <= 0). Fix: guard the unpin with location != 'LocalCPUBackend' so that only disk-loaded staging objects (LocalDisk, Maru, etc.) are unpinned here. LocalCPU objects retain their pin until lookup_unpin() releases them in wait_for_save(), preserving the correct single-free lifecycle. Fixes LMCache#2954 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

In layerwise retrieval with LocalCPU backend, the unpin loop at cache_engine.py:1040-1042 was designed for disk-loaded staging objects (added in LMCache#2611). However, LocalCPUBackend.batched_get_non_blocking() returns the same Python object from hot_cache that lookup(pin=True) had already pinned, causing retrieve_layer() to unpin it once, and then wait_for_save() to unpin it again via lookup_unpin() (LMCache#2786). This double unpin drives pin_count to -1 and, more critically, triggers a premature free() of the memory object (unpin() calls free() when both pin_count <= 0 and ref_count <= 0). Fix: guard the unpin with location != 'LocalCPUBackend' so that only disk-loaded staging objects (LocalDisk, Maru, etc.) are unpinned here. LocalCPU objects retain their pin until lookup_unpin() releases them in wait_for_save(), preserving the correct single-free lifecycle. Fixes LMCache#2954 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>

lisiG9 changed the title ~~fix: balance pin count in LocalCPUBackend.touch_cache()~~ [Bugfix]: Fix pin count balancing in PD Disaggregation mode Mar 16, 2026

gemini-code-assist Bot reviewed Mar 16, 2026

View reviewed changes

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py

lisiG9 force-pushed the dev branch 2 times, most recently from bdf3c19 to 6188576 Compare March 16, 2026 07:51

sammshen approved these changes Mar 22, 2026

View reviewed changes

sammshen requested a review from deng451e March 22, 2026 18:12

deng451e approved these changes Mar 23, 2026

View reviewed changes

lisiG9 force-pushed the dev branch from f103612 to 5c77cf1 Compare March 23, 2026 06:03

sammshen enabled auto-merge (squash) March 23, 2026 23:35

github-actions Bot added the full Run comprehensive tests on this PR label Mar 23, 2026

auto-merge was automatically disabled March 24, 2026 01:20
Head branch was pushed to by a user without write access

lisiG9 force-pushed the dev branch from 5c77cf1 to cf80cb2 Compare March 24, 2026 01:20

github-actions Bot removed the full Run comprehensive tests on this PR label Mar 24, 2026

lisiG9 force-pushed the dev branch from 73d79cb to 31f7471 Compare March 25, 2026 03:08

sammshen enabled auto-merge (squash) March 25, 2026 07:08

github-actions Bot added the full Run comprehensive tests on this PR label Mar 25, 2026

auto-merge was automatically disabled March 25, 2026 09:33
Head branch was pushed to by a user without write access

github-actions Bot removed the full Run comprehensive tests on this PR label Mar 25, 2026

lisiG9 and others added 6 commits March 25, 2026 17:59

fix: balance pin count in LocalCPUBackend.touch_cache()

c14bd38

Signed-off-by: lisiG9 <2442737518@qq.com>

Update lmcache/integration/vllm/vllm_v1_adapter.py

3ede195

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com> Signed-off-by: lisiG9 <2442737518@qq.com>

Update vllm_v1_adapter.py

95f89fb

Signed-off-by: lisiG9 <2442737518@qq.com>

Add lookup_unpin for aborted requests in request_finished

c8328f1

Signed-off-by: lisiG9 <2442737518@qq.com>

Refactor request unpinning and cancellation logic

e4979ba

Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>

lisiG9 force-pushed the dev branch from e279d63 to e4979ba Compare March 25, 2026 09:59

sammshen enabled auto-merge (squash) March 25, 2026 19:56

github-actions Bot added the full Run comprehensive tests on this PR label Mar 25, 2026

sammshen merged commit 62cf62a into LMCache:dev Mar 25, 2026
34 of 35 checks passed

chjs mentioned this pull request Apr 5, 2026

[Bug] Double unpin in retrieve_layer() causes negative pin_count for layerwise retrieval with LocalCPU backend #2954

Open

ianliuy mentioned this pull request Apr 12, 2026

[Bugfix] Fix double unpin of LocalCPU objects in retrieve_layer() (#2954) #3004

Open

2 tasks

Conversation

lisiG9 commented Mar 16, 2026

Summary

Problem Description

Solution

Impact Analysis

Uh oh!

gemini-code-assist Bot commented Mar 16, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

lisiG9 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deng451e commented Mar 16, 2026

Uh oh!

lisiG9 commented Mar 17, 2026

Uh oh!

deng451e commented Mar 17, 2026

Uh oh!

lisiG9 commented Mar 17, 2026

Uh oh!

deng451e commented Mar 17, 2026

Uh oh!

ApostaC commented Mar 22, 2026

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

lisiG9 commented Mar 23, 2026

Uh oh!

lisiG9 commented Mar 23, 2026

Uh oh!

lisiG9 commented Mar 24, 2026

Uh oh!

lisiG9 commented Mar 24, 2026

Uh oh!

lisiG9 commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lisiG9 commented Mar 16, 2026 •

edited

Loading