fix: use pin=False in _allocate_and_put to prevent pd_buffer leak by ningziwen · Pull Request #2847 · LMCache/LMCache

ningziwen · 2026-03-23T03:35:59Z

What this PR does / why we need it:

_allocate_and_put() on the decoder side leaks GPU buffer slots when it receives an AllocRequest containing keys that already exist in self.data (e.g., from a previous request with a shared prefix, or a multi-round conversation reusing cached KV).

When contains(key, pin=True) finds an existing key, it increments ref_count from 1 to 2. Nothing on the already_sent_indexes path calls ref_count_down() to undo it. Later, remove() sees ref_count == 2 and skips the delete. ref_count_down() brings it to 1, but nobody calls remove() again — the buffer slot is permanently leaked.

The fix changes contains(key, pin=True) to contains(key, pin=False). Pinning is unnecessary here because PDBackend has no eviction mechanism. Unlike LocalCPUBackend, which evicts existing entries when the buffer is full and uses pin to protect entries being read from concurrent eviction, PDBackend's paged GPU allocator only frees blocks explicitly via remove() after the decode forward pass consumes them. There is no concurrent eviction thread that could remove an entry between the contains() check and the NIXL write completing. The upstream storage_manager already enforces this — pin_in_backend = pin if backend_name != "PDBackend" else False — PDBackend is never pinned on the lookup/retrieve path either.

Special notes for your reviewers:

The ref_count lifecycle on the decoder side is:

allocate() creates MemoryObj with ref_count=1
put() stores it in self.data
get_blocking() returns it (no ref_count change)
batched_to_gpu() copies data to vLLM's paged KV buffer
remove() checks ref_count == 1 → del self.data[key]
ref_count_down() → ref_count 1→0 → triggers parent_allocator.free() → block returns to free_blocks

With pin=True at step (2-repeat), ref_count goes to 2, and step 5 skips the delete, breaking the chain.

How to reproduce:

Use the existing examples/disagg_prefill/1p1d/ setup:

# Start the 1P1D DPD setup (prefiller on GPU 0, decoder on GPU 1)
cd examples/disagg_prefill/1p1d
bash disagg_example_1p1d.sh

Then send multiple requests that share a common prefix (e.g., same system prompt). The key is that different requests produce overlapping chunk keys in the decoder's _allocate_and_put:

# Send requests with shared prefix — repeat 10+ times
for i in $(seq 1 20); do
  curl -s http://localhost:9100/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "prompt": "You are a helpful assistant. Explain the concept of '"request_$i"' in detail.",
      "max_tokens": 256
    }' &
done
wait

Before (pin=True) — decoder log shows:

WARNING ... PDBackend.remove SKIPPED for key ... (ref_count=2)
WARNING ... PDBackend.remove SKIPPED for key ... (ref_count=2)
# After enough requests, allocate() blocks indefinitely — no new completions

After (pin=False) — decoder log is clean:

# No "remove SKIPPED" warnings. Buffer slots freed after each decode completes.
# Requests complete indefinitely.

The leak is cumulative. With a 2GB pd_buffer and 256-token chunks, the buffer exhausts after ~30-50 requests depending on prefix overlap. Once exhausted, all subsequent allocate() calls block and no new requests complete.

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

Note

Low Risk
Low risk: one-line change limited to PD disaggregation allocation flow; main risk is unintended change in lifetime/ref-count behavior for reused cache keys.

Overview
Prevents PD decoder buffer slots from being leaked when an AllocRequest includes keys that already exist by changing _allocate_and_put() to call contains(key, pin=False) instead of pinning/ref-counting existing entries.

This ensures the "already sent" path does not increment MemoryObj ref counts, allowing subsequent remove()/free to reclaim paged GPU blocks normally.

^{Reviewed by Cursor Bugbot for commit 339764e. Bugbot is set up for automated code reviews on this repo. Configure here.}

gemini-code-assist · 2026-03-23T03:36:09Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical GPU buffer leak within the PDBackend's allocation mechanism. By adjusting a single parameter in a key lookup function, it fixes an issue where existing data entries were incorrectly pinned, leading to ref_count mismanagement and preventing the release of GPU memory. This ensures the stable operation of the distributed prefill setup by allowing buffer slots to be properly recycled, preventing service exhaustion and indefinite blocking of new requests.

Highlights

Fixes GPU buffer leak in PDBackend: The _allocate_and_put method on the decoder side was leaking GPU buffer slots when contains(key, pin=True) was called for existing keys. This incorrectly incremented the ref_count, preventing proper deallocation of buffer slots.
Corrects pin usage in _allocate_and_put: The fix changes self.contains(key, pin=True) to self.contains(key, pin=False). Pinning is unnecessary for PDBackend as it lacks an eviction mechanism, ensuring ref_count is managed correctly and buffer slots are freed after use.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly fixes a memory leak in the PDBackend by changing pin=True to pin=False during allocation checks for existing keys. The reasoning is sound, as pinning is unnecessary in a backend without an eviction policy and was causing a reference count leak. The change is correct and no issues were found in the implementation.

ningziwen · 2026-03-26T01:49:09Z

@hlin99 Could you pls look into this one line change?

sammshen · 2026-03-31T18:25:36Z

@cursor review

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

DongDongJu

LGTM.

ningziwen · 2026-04-07T05:37:37Z

@sammshen Could you take a look?

hlin99 · 2026-04-07T12:01:19Z

@hlin99 Could you pls look into this one line change?

LGTM ：）but you need it from maintainers....but would be nice if a ut is added.

sammshen

LGTM!

…Cache#2847) Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

gemini-code-assist Bot reviewed Mar 23, 2026

View reviewed changes

ningziwen force-pushed the upstream/gap9-buffer-leak-fix branch 2 times, most recently from 1ffb041 to 2175bbb Compare March 26, 2026 01:11

ningziwen force-pushed the upstream/gap9-buffer-leak-fix branch from 2175bbb to 54bee70 Compare March 27, 2026 09:03

cursor Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/pd_backend.py

ningziwen force-pushed the upstream/gap9-buffer-leak-fix branch from 54bee70 to 1fce322 Compare April 6, 2026 22:44

ningziwen requested review from DongDongJu, chunxiaozheng, maobaolong and sammshen as code owners April 6, 2026 22:44

fix: use pin=False in _allocate_and_put to prevent pd_buffer leak

eec8164

Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

ningziwen force-pushed the upstream/gap9-buffer-leak-fix branch from 1fce322 to eec8164 Compare April 6, 2026 22:44

DongDongJu approved these changes Apr 7, 2026

View reviewed changes

hlin99 approved these changes Apr 7, 2026

View reviewed changes

XuanCS approved these changes Apr 7, 2026

View reviewed changes

sammshen approved these changes Apr 7, 2026

View reviewed changes

sammshen enabled auto-merge (squash) April 7, 2026 22:40

github-actions Bot added the full Run comprehensive tests on this PR label Apr 7, 2026

Merge branch 'dev' into upstream/gap9-buffer-leak-fix

339764e

sammshen merged commit b4d95dc into LMCache:dev Apr 8, 2026
35 checks passed

Oasis-Git pushed a commit to Oasis-Git/LMCache that referenced this pull request Apr 13, 2026

fix: use pin=False in _allocate_and_put to prevent pd_buffer leak (LM…

e9f4f32

…Cache#2847) Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use pin=False in _allocate_and_put to prevent pd_buffer leak#2847

fix: use pin=False in _allocate_and_put to prevent pd_buffer leak#2847
sammshen merged 2 commits intoLMCache:devfrom
ningziwen:upstream/gap9-buffer-leak-fix

ningziwen commented Mar 23, 2026 •

edited by cursor Bot

Loading

Uh oh!

gemini-code-assist Bot commented Mar 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

ningziwen commented Mar 26, 2026

Uh oh!

sammshen commented Mar 31, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

DongDongJu left a comment

Uh oh!

ningziwen commented Apr 7, 2026

Uh oh!

hlin99 commented Apr 7, 2026 •

edited

Loading

Uh oh!

sammshen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ningziwen commented Mar 23, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Mar 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

ningziwen commented Mar 26, 2026

Uh oh!

sammshen commented Mar 31, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DongDongJu left a comment

Choose a reason for hiding this comment

Uh oh!

ningziwen commented Apr 7, 2026

Uh oh!

hlin99 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ningziwen commented Mar 23, 2026 •

edited by cursor Bot

Loading

hlin99 commented Apr 7, 2026 •

edited

Loading