Skip to content

[Bugfix]: Fix pin count balancing in PD Disaggregation mode#2786

Merged
sammshen merged 6 commits intoLMCache:devfrom
lisiG9:dev
Mar 25, 2026
Merged

[Bugfix]: Fix pin count balancing in PD Disaggregation mode#2786
sammshen merged 6 commits intoLMCache:devfrom
lisiG9:dev

Conversation

@lisiG9
Copy link
Copy Markdown
Contributor

@lisiG9 lisiG9 commented Mar 16, 2026

Summary

This PR fixes a pin count balancing issue in PD (Prefill-Decode) Disaggregation mode. When using LMCache with vLLM's PD Disaggregation (via kv_producer/kv_consumer roles), memory objects in LocalCPUBackend were being pinned indefinitely on the Decoder (D) side, causing pin timeout warnings and potential memory leaks. The Prefill (P) side previously handled pin count balancing correctly, but the Decode (D) side lacked the necessary unpin operation.

Problem Description

In PD Disaggregation mode:

  • Prefill (P) side (kv_producer): Performs lookup with pin=True, saves KV cache, and calls lookup_unpin() in wait_for_save() to balance the pin count.
  • Decode (D) side (kv_consumer): Performs lookup with pin=True, but wait_for_save() was returning early without calling lookup_unpin(), leaving the pin count unbalanced.

This caused memory objects to remain pinned indefinitely on the D-side, leading to:

  • Pin timeout warnings from the pin monitor
  • Reduced effectiveness of the cache eviction policy
  • Increased memory pressure and potential out-of-memory situations
  • Performance degradation due to suboptimal cache management

Solution

Modified LMCacheConnectorV1Impl.wait_for_save() in vllm_v1_adapter.py to call lookup_unpin() for both P-side and D-side:

if self.kv_role == "kv_consumer":
    # Don't do save if the role is kv_consumer
    # But still need to unpin the kv caches according to req_id
    # to balance the pin count from contains()
    for request in connector_metadata.requests:
        self.lmcache_engine.lookup_unpin(request.req_id)
    return

This ensures that both P-side and D-side properly balance the pin count incremented during contains(pin=True) calls.

Impact Analysis

  • Scope: This fix affects PD Disaggregation mode using vLLM's kv_producer/kv_consumer roles
  • Backward Compatibility: No breaking changes - the fix only adds the missing lookup_unpin() call on D-side
  • Performance: Minimal overhead - just an additional loop to call lookup_unpin() for each request

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical memory management flaw within the PD Disaggregation mode of vLLM when using LMCache. By ensuring that KV cache objects are correctly unpinned on the Decode side, it prevents indefinite memory pinning, which previously caused resource exhaustion, reduced cache efficiency, and performance degradation. The fix specifically targets the kv_consumer role to maintain balanced pin counts.

Highlights

  • Pin Count Balancing Fix: Resolved an issue in PD Disaggregation mode where memory objects in LocalCPUBackend were indefinitely pinned on the Decode (D) side, leading to pin timeout warnings and potential memory leaks.
  • D-side lookup_unpin() Implementation: Modified LMCacheConnectorV1Impl.wait_for_save() to explicitly call lookup_unpin() for kv_consumer (D-side) requests, ensuring proper balancing of pin counts incremented during contains(pin=True) calls.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lmcache/integration/vllm/vllm_v1_adapter.py
    • Added a loop to call self.lmcache_engine.lookup_unpin() for each request when the kv_role is kv_consumer within the wait_for_save method.
    • Introduced comments explaining the necessity of unpinning for kv_consumer to balance pin counts.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@lisiG9 lisiG9 changed the title fix: balance pin count in LocalCPUBackend.touch_cache() [Bugfix]: Fix pin count balancing in PD Disaggregation mode Mar 16, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical memory leak in the Prefill-Decode Disaggregation mode by correctly balancing the pin count for kv_consumer roles. The change introduces the missing lookup_unpin() call, which is a direct and effective fix for the described issue. The implementation is sound. I've added one minor suggestion to enhance code robustness by asserting a non-null engine instance before use.

Comment thread lmcache/integration/vllm/vllm_v1_adapter.py
@lisiG9
Copy link
Copy Markdown
Contributor Author

lisiG9 commented Mar 16, 2026

@sammshen @YaoJiayi please take a look, thanks!

@lisiG9 lisiG9 force-pushed the dev branch 2 times, most recently from bdf3c19 to 6188576 Compare March 16, 2026 07:51
@deng451e
Copy link
Copy Markdown
Collaborator

In LMCache PD in-process mode, the prefiller writes KV cache directly to the decoder GPU buffer without going through local_cpu_backend, so it shouldn’t require pinning memory_obj on the decoder host side.
Could you share the config and script you used to launch PD? That would help us better reproduce the memory leak you observed.

@lisiG9
Copy link
Copy Markdown
Contributor Author

lisiG9 commented Mar 17, 2026

In LMCache PD in-process mode, the prefiller writes KV cache directly to the decoder GPU buffer without going through local_cpu_backend, so it shouldn’t require pinning memory_obj on the decoder host side. Could you share the config and script you used to launch PD? That would help us better reproduce the memory leak you observed.

Thanks for the question! I'm using MooncakeStore as the remote storage backend, not the in-process mode.

In MooncakeStore mode, the decoder side calls contains() with pin=True which triggers local_cpu_backend.pin(), but wait_for_save() returns early for kv_consumer without calling lookup_unpin().

Config (LMCACHE_CONFIG_FILE):

chunk_size: 256
remote_url: "mooncakestore://80.11.128.72:50011/"
local_cpu: True
max_local_cpu_size: 10
use_layerwise: False
save_decode_cache: False
enable_async_loading: False
pre_caching_hash_algorithm: sha256_cbor_64bit
extra_config:
  local_hostname: "80.11.128.72"
  protocol: "tcp"
  metadata_server: "http://80.11.128.72:8082/metadata"
  master_server_address: "80.11.128.72:50011"
  global_segment_size: 53687091200
  local_buffer_size: 0
  save_chunk_meta: False
  use_exists_sync: True

Launch commands:

export LMCACHE_CONFIG_FILE="$LMCACHE_CONFIG_FILE"
KV_TRANSFER_CONFIG=$(cat <<EOF
{"kv_connector":"LMCacheConnectorV1","kv_role":"$KV_ROLE_VALUE","kv_connector_extra_config":{"discard_partial_chunks":false,"lmcache_rpc_port":"$LMCACHE_RPC_PORT_NAME"}}
EOF
)
# --- 启动服务 ---
echo "--- 正在启动 $ROLE 服务 (TP=$TENSOR_PARALLEL_SIZE) ---"
python3 -m vllm.entrypoints.openai.api_server \
--model "$MODEL_PATH" \
--port "$PORT" \
--tensor-parallel-size "$TENSOR_PARALLEL_SIZE" \
"${COMMON_ARGS[@]}" \
--kv-transfer-config "$KV_TRANSFER_CONFIG"

@deng451e
Copy link
Copy Markdown
Collaborator

According to your description, the KV cache flow seems to be prefiller → mooncakestore → decoder — hope I got that right.
A similar setup exists in LMCache multi-process mode:
https://github.com/LMCache/LMCache/tree/dev/examples/disagg_prefill_mp — feel free to try it. It should also help address the memory leak you observed.

@lisiG9
Copy link
Copy Markdown
Contributor Author

lisiG9 commented Mar 17, 2026

According to your description, the KV cache flow seems to be prefiller → mooncakestore → decoder — hope I got that right. A similar setup exists in LMCache multi-process mode: https://github.com/LMCache/LMCache/tree/dev/examples/disagg_prefill_mp — feel free to try it. It should also help address the memory leak you observed.

Thanks for the suggestion! My setup uses LMCacheConnectorV1 with MooncakeStore, which is different from the LMCacheMPConnector example.

MooncakeStore is required in my case as I rely on its specific features (RDMA, distributed memory pooling).

The pin count leak occurs in this specific call path for kv_consumer:

1. Lookup phase (vllm_v1_adapter.py):

# get_num_new_matched_tokens() for kv_consumer
num_external_hit_tokens = self.lookup_client.lookup(
    token_ids,
    lookup_id=req_id,
    ...
)

2. Contains with pin (local_cpu_backend.py):

def contains(self, key: CacheEngineKey, pin: bool = False) -> bool:
    with self.cpu_lock:
        if key not in self.hot_cache:
            return False
        if pin:
            self.hot_cache[key].pin()  # ← Pin happens here
            self.keys_in_request.append(key)
        return True

3. wait_for_save() returns early (vllm_v1_adapter.py):

def wait_for_save(self):
    if self.kv_role == "kv_consumer":
        return  # ← Returns without calling lookup_unpin()!

The fix ensures lookup_unpin() is called for kv_consumer to balance the pin count.

@deng451e
Copy link
Copy Markdown
Collaborator

Make sense to me. PTAL @sammshen

@ApostaC
Copy link
Copy Markdown
Contributor

ApostaC commented Mar 22, 2026

@sammshen

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for this fix!

Quick question: do we also have to add lookup_unpin to request_finished for aborted requests?

@sammshen sammshen requested a review from deng451e March 22, 2026 18:12
@lisiG9
Copy link
Copy Markdown
Contributor Author

lisiG9 commented Mar 23, 2026

LGTM! Thanks for this fix!

Quick question: do we also have to add lookup_unpin to request_finished for aborted requests?

Good catch! For aborted requests, wait_for_save() may not be called, leaving the pin count unbalanced.

The fix should be:

def request_finished(self, request: "Request", block_ids: list[int]):
    # ... existing code ...
    
    if request.status == RequestStatus.FINISHED_ABORTED:
        if self.async_loading:
            self.lookup_client.cancel_lookup(lookup_id)
        
        # Unpin to balance pin count from lookup
        if self.lmcache_engine is not None:
            self.lmcache_engine.lookup_unpin(request.request_id)

I'll verify this and update the PR when I get a chance. Thanks!

@lisiG9
Copy link
Copy Markdown
Contributor Author

lisiG9 commented Mar 23, 2026

@sammshen @deng451e I've updated the PR to add lookup_unpin() for aborted requests in request_finished(). Could you please trigger the auto-merge?

@sammshen sammshen enabled auto-merge (squash) March 23, 2026 23:35
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Mar 23, 2026
auto-merge was automatically disabled March 24, 2026 01:20

Head branch was pushed to by a user without write access

@github-actions github-actions Bot removed the full Run comprehensive tests on this PR label Mar 24, 2026
@lisiG9
Copy link
Copy Markdown
Contributor Author

lisiG9 commented Mar 24, 2026

buildkite/k3-comprehensive-test — Build #842 failed
Rebased to latest dev branch. Please re-enable auto-merge when CI passes. Thanks!@sammshen @deng451e

@lisiG9
Copy link
Copy Markdown
Contributor Author

lisiG9 commented Mar 24, 2026

Hi @sammshen @ApostaC , could you help check why buildkite/k3-multiprocess-test Build #827 failed? The logs show "Page not found" for me. Thanks!

@lisiG9
Copy link
Copy Markdown
Contributor Author

lisiG9 commented Mar 25, 2026

Skip lookup_unpin for FINISHED_ABORTED requests in wait_for_save() since
they are already unpinned in request_finished(). This prevents pin count
from becoming negative and fixes k3-multiprocess-test failures.

@sammshen sammshen enabled auto-merge (squash) March 25, 2026 07:08
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Mar 25, 2026
auto-merge was automatically disabled March 25, 2026 09:33

Head branch was pushed to by a user without write access

@github-actions github-actions Bot removed the full Run comprehensive tests on this PR label Mar 25, 2026
lisiG9 and others added 6 commits March 25, 2026 17:59
Signed-off-by: lisiG9 <2442737518@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>
Signed-off-by: lisiG9 <2442737518@qq.com>
Signed-off-by: lisiG9 <2442737518@qq.com>
Signed-off-by: lisiG9 <2442737518@qq.com>
Skip lookup_unpin for FINISHED_ABORTED requests in wait_for_save() since
they are already unpinned in request_finished(). This prevents pin count
from becoming negative and fixes k3-multiprocess-test failures.

Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>
Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>
@sammshen sammshen enabled auto-merge (squash) March 25, 2026 19:56
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Mar 25, 2026
@sammshen sammshen merged commit 62cf62a into LMCache:dev Mar 25, 2026
34 of 35 checks passed
deng451e pushed a commit to deng451e/LMCache that referenced this pull request Mar 27, 2026
…2786)

* fix: balance pin count in LocalCPUBackend.touch_cache()

Signed-off-by: lisiG9 <2442737518@qq.com>

* Update lmcache/integration/vllm/vllm_v1_adapter.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>
Signed-off-by: lisiG9 <2442737518@qq.com>

* Update vllm_v1_adapter.py

Signed-off-by: lisiG9 <2442737518@qq.com>

* Add lookup_unpin for aborted requests in request_finished

Signed-off-by: lisiG9 <2442737518@qq.com>

* Fix double unpin for aborted requests in kv_consumer role

Skip lookup_unpin for FINISHED_ABORTED requests in wait_for_save() since
they are already unpinned in request_finished(). This prevents pin count
from becoming negative and fixes k3-multiprocess-test failures.

Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>

* Refactor request unpinning and cancellation logic

Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>

---------

Signed-off-by: lisiG9 <2442737518@qq.com>
Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
…2786)

* fix: balance pin count in LocalCPUBackend.touch_cache()

Signed-off-by: lisiG9 <2442737518@qq.com>

* Update lmcache/integration/vllm/vllm_v1_adapter.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>
Signed-off-by: lisiG9 <2442737518@qq.com>

* Update vllm_v1_adapter.py

Signed-off-by: lisiG9 <2442737518@qq.com>

* Add lookup_unpin for aborted requests in request_finished

Signed-off-by: lisiG9 <2442737518@qq.com>

* Fix double unpin for aborted requests in kv_consumer role

Skip lookup_unpin for FINISHED_ABORTED requests in wait_for_save() since
they are already unpinned in request_finished(). This prevents pin count
from becoming negative and fixes k3-multiprocess-test failures.

Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>

* Refactor request unpinning and cancellation logic

Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>

---------

Signed-off-by: lisiG9 <2442737518@qq.com>
Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
…2786)

* fix: balance pin count in LocalCPUBackend.touch_cache()

Signed-off-by: lisiG9 <2442737518@qq.com>

* Update lmcache/integration/vllm/vllm_v1_adapter.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>
Signed-off-by: lisiG9 <2442737518@qq.com>

* Update vllm_v1_adapter.py

Signed-off-by: lisiG9 <2442737518@qq.com>

* Add lookup_unpin for aborted requests in request_finished

Signed-off-by: lisiG9 <2442737518@qq.com>

* Fix double unpin for aborted requests in kv_consumer role

Skip lookup_unpin for FINISHED_ABORTED requests in wait_for_save() since
they are already unpinned in request_finished(). This prevents pin count
from becoming negative and fixes k3-multiprocess-test failures.

Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>

* Refactor request unpinning and cancellation logic

Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>

---------

Signed-off-by: lisiG9 <2442737518@qq.com>
Signed-off-by: lisiG9 <86335240+lisiG9@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
ianliuy added a commit to ianliuy/LMCache that referenced this pull request Apr 12, 2026
In layerwise retrieval with LocalCPU backend, the unpin loop at
cache_engine.py:1040-1042 was designed for disk-loaded staging objects
(added in LMCache#2611). However, LocalCPUBackend.batched_get_non_blocking()
returns the same Python object from hot_cache that lookup(pin=True) had
already pinned, causing retrieve_layer() to unpin it once, and then
wait_for_save() to unpin it again via lookup_unpin() (LMCache#2786).

This double unpin drives pin_count to -1 and, more critically, triggers
a premature free() of the memory object (unpin() calls free() when both
pin_count <= 0 and ref_count <= 0).

Fix: guard the unpin with location != 'LocalCPUBackend' so that only
disk-loaded staging objects (LocalDisk, Maru, etc.) are unpinned here.
LocalCPU objects retain their pin until lookup_unpin() releases them in
wait_for_save(), preserving the correct single-free lifecycle.

Fixes LMCache#2954

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ianliuy added a commit to ianliuy/LMCache that referenced this pull request Apr 13, 2026
In layerwise retrieval with LocalCPU backend, the unpin loop at
cache_engine.py:1040-1042 was designed for disk-loaded staging objects
(added in LMCache#2611). However, LocalCPUBackend.batched_get_non_blocking()
returns the same Python object from hot_cache that lookup(pin=True) had
already pinned, causing retrieve_layer() to unpin it once, and then
wait_for_save() to unpin it again via lookup_unpin() (LMCache#2786).

This double unpin drives pin_count to -1 and, more critically, triggers
a premature free() of the memory object (unpin() calls free() when both
pin_count <= 0 and ref_count <= 0).

Fix: guard the unpin with location != 'LocalCPUBackend' so that only
disk-loaded staging objects (LocalDisk, Maru, etc.) are unpinned here.
LocalCPU objects retain their pin until lookup_unpin() releases them in
wait_for_save(), preserving the correct single-free lifecycle.

Fixes LMCache#2954

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants