Skip to content

fix: use pin=False in _allocate_and_put to prevent pd_buffer leak#2847

Merged
sammshen merged 2 commits intoLMCache:devfrom
ningziwen:upstream/gap9-buffer-leak-fix
Apr 8, 2026
Merged

fix: use pin=False in _allocate_and_put to prevent pd_buffer leak#2847
sammshen merged 2 commits intoLMCache:devfrom
ningziwen:upstream/gap9-buffer-leak-fix

Conversation

@ningziwen
Copy link
Copy Markdown
Contributor

@ningziwen ningziwen commented Mar 23, 2026

What this PR does / why we need it:

_allocate_and_put() on the decoder side leaks GPU buffer slots when it receives an AllocRequest containing keys that already exist in self.data (e.g., from a previous request with a shared prefix, or a multi-round conversation reusing cached KV).

When contains(key, pin=True) finds an existing key, it increments ref_count from 1 to 2. Nothing on the already_sent_indexes path calls ref_count_down() to undo it. Later, remove() sees ref_count == 2 and skips the delete. ref_count_down() brings it to 1, but nobody calls remove() again — the buffer slot is permanently leaked.

The fix changes contains(key, pin=True) to contains(key, pin=False). Pinning is unnecessary here because PDBackend has no eviction mechanism. Unlike LocalCPUBackend, which evicts existing entries when the buffer is full and uses pin to protect entries being read from concurrent eviction, PDBackend's paged GPU allocator only frees blocks explicitly via remove() after the decode forward pass consumes them. There is no concurrent eviction thread that could remove an entry between the contains() check and the NIXL write completing. The upstream storage_manager already enforces this — pin_in_backend = pin if backend_name != "PDBackend" else False — PDBackend is never pinned on the lookup/retrieve path either.

Special notes for your reviewers:

The ref_count lifecycle on the decoder side is:

  1. allocate() creates MemoryObj with ref_count=1
  2. put() stores it in self.data
  3. get_blocking() returns it (no ref_count change)
  4. batched_to_gpu() copies data to vLLM's paged KV buffer
  5. remove() checks ref_count == 1del self.data[key]
  6. ref_count_down() → ref_count 1→0 → triggers parent_allocator.free() → block returns to free_blocks

With pin=True at step (2-repeat), ref_count goes to 2, and step 5 skips the delete, breaking the chain.

How to reproduce:

Use the existing examples/disagg_prefill/1p1d/ setup:

# Start the 1P1D DPD setup (prefiller on GPU 0, decoder on GPU 1)
cd examples/disagg_prefill/1p1d
bash disagg_example_1p1d.sh

Then send multiple requests that share a common prefix (e.g., same system prompt). The key is that different requests produce overlapping chunk keys in the decoder's _allocate_and_put:

# Send requests with shared prefix — repeat 10+ times
for i in $(seq 1 20); do
  curl -s http://localhost:9100/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "prompt": "You are a helpful assistant. Explain the concept of '"request_$i"' in detail.",
      "max_tokens": 256
    }' &
done
wait

Before (pin=True) — decoder log shows:

WARNING ... PDBackend.remove SKIPPED for key ... (ref_count=2)
WARNING ... PDBackend.remove SKIPPED for key ... (ref_count=2)
# After enough requests, allocate() blocks indefinitely — no new completions

After (pin=False) — decoder log is clean:

# No "remove SKIPPED" warnings. Buffer slots freed after each decode completes.
# Requests complete indefinitely.

The leak is cumulative. With a 2GB pd_buffer and 256-token chunks, the buffer exhausts after ~30-50 requests depending on prefix overlap. Once exhausted, all subsequent allocate() calls block and no new requests complete.

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Low Risk
Low risk: one-line change limited to PD disaggregation allocation flow; main risk is unintended change in lifetime/ref-count behavior for reused cache keys.

Overview
Prevents PD decoder buffer slots from being leaked when an AllocRequest includes keys that already exist by changing _allocate_and_put() to call contains(key, pin=False) instead of pinning/ref-counting existing entries.

This ensures the "already sent" path does not increment MemoryObj ref counts, allowing subsequent remove()/free to reclaim paged GPU blocks normally.

Reviewed by Cursor Bugbot for commit 339764e. Bugbot is set up for automated code reviews on this repo. Configure here.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical GPU buffer leak within the PDBackend's allocation mechanism. By adjusting a single parameter in a key lookup function, it fixes an issue where existing data entries were incorrectly pinned, leading to ref_count mismanagement and preventing the release of GPU memory. This ensures the stable operation of the distributed prefill setup by allowing buffer slots to be properly recycled, preventing service exhaustion and indefinite blocking of new requests.

Highlights

  • Fixes GPU buffer leak in PDBackend: The _allocate_and_put method on the decoder side was leaking GPU buffer slots when contains(key, pin=True) was called for existing keys. This incorrectly incremented the ref_count, preventing proper deallocation of buffer slots.
  • Corrects pin usage in _allocate_and_put: The fix changes self.contains(key, pin=True) to self.contains(key, pin=False). Pinning is unnecessary for PDBackend as it lacks an eviction mechanism, ensuring ref_count is managed correctly and buffer slots are freed after use.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a memory leak in the PDBackend by changing pin=True to pin=False during allocation checks for existing keys. The reasoning is sound, as pinning is unnecessary in a backend without an eviction policy and was causing a reference count leak. The change is correct and no issues were found in the implementation.

@ningziwen ningziwen force-pushed the upstream/gap9-buffer-leak-fix branch 2 times, most recently from 1ffb041 to 2175bbb Compare March 26, 2026 01:11
@ningziwen
Copy link
Copy Markdown
Contributor Author

@hlin99 Could you pls look into this one line change?

@ningziwen ningziwen force-pushed the upstream/gap9-buffer-leak-fix branch from 2175bbb to 54bee70 Compare March 27, 2026 09:03
@sammshen
Copy link
Copy Markdown
Contributor

@cursor review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread lmcache/v1/storage_backend/pd_backend.py
Signed-off-by: Ziwen Ning <ningziwe@amazon.com>
@ningziwen ningziwen force-pushed the upstream/gap9-buffer-leak-fix branch from 1fce322 to eec8164 Compare April 6, 2026 22:44
Copy link
Copy Markdown
Collaborator

@DongDongJu DongDongJu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ningziwen
Copy link
Copy Markdown
Contributor Author

@sammshen Could you take a look?

@hlin99
Copy link
Copy Markdown
Contributor

hlin99 commented Apr 7, 2026

@hlin99 Could you pls look into this one line change?

LGTM :)but you need it from maintainers....but would be nice if a ut is added.

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sammshen sammshen enabled auto-merge (squash) April 7, 2026 22:40
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Apr 7, 2026
@sammshen sammshen merged commit b4d95dc into LMCache:dev Apr 8, 2026
35 checks passed
Oasis-Git pushed a commit to Oasis-Git/LMCache that referenced this pull request Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants