fix: use pin=False in _allocate_and_put to prevent pd_buffer leak#2847
fix: use pin=False in _allocate_and_put to prevent pd_buffer leak#2847sammshen merged 2 commits intoLMCache:devfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical GPU buffer leak within the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly fixes a memory leak in the PDBackend by changing pin=True to pin=False during allocation checks for existing keys. The reasoning is sound, as pinning is unnecessary in a backend without an eviction policy and was causing a reference count leak. The change is correct and no issues were found in the implementation.
1ffb041 to
2175bbb
Compare
|
@hlin99 Could you pls look into this one line change? |
2175bbb to
54bee70
Compare
|
@cursor review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
54bee70 to
1fce322
Compare
Signed-off-by: Ziwen Ning <ningziwe@amazon.com>
1fce322 to
eec8164
Compare
|
@sammshen Could you take a look? |
LGTM :)but you need it from maintainers....but would be nice if a ut is added. |
…Cache#2847) Signed-off-by: Ziwen Ning <ningziwe@amazon.com>

What this PR does / why we need it:
_allocate_and_put()on the decoder side leaks GPU buffer slots when it receives anAllocRequestcontaining keys that already exist inself.data(e.g., from a previous request with a shared prefix, or a multi-round conversation reusing cached KV).When
contains(key, pin=True)finds an existing key, it incrementsref_countfrom 1 to 2. Nothing on thealready_sent_indexespath callsref_count_down()to undo it. Later,remove()seesref_count == 2and skips the delete.ref_count_down()brings it to 1, but nobody callsremove()again — the buffer slot is permanently leaked.The fix changes
contains(key, pin=True)tocontains(key, pin=False). Pinning is unnecessary here because PDBackend has no eviction mechanism. UnlikeLocalCPUBackend, which evicts existing entries when the buffer is full and usespinto protect entries being read from concurrent eviction, PDBackend's paged GPU allocator only frees blocks explicitly viaremove()after the decode forward pass consumes them. There is no concurrent eviction thread that could remove an entry between thecontains()check and the NIXL write completing. The upstreamstorage_manageralready enforces this —pin_in_backend = pin if backend_name != "PDBackend" else False— PDBackend is never pinned on the lookup/retrieve path either.Special notes for your reviewers:
The ref_count lifecycle on the decoder side is:
allocate()creates MemoryObj withref_count=1put()stores it inself.dataget_blocking()returns it (no ref_count change)batched_to_gpu()copies data to vLLM's paged KV bufferremove()checksref_count == 1→del self.data[key]ref_count_down()→ ref_count 1→0 → triggersparent_allocator.free()→ block returns tofree_blocksWith
pin=Trueat step (2-repeat), ref_count goes to 2, and step 5 skips the delete, breaking the chain.How to reproduce:
Use the existing
examples/disagg_prefill/1p1d/setup:Then send multiple requests that share a common prefix (e.g., same system prompt). The key is that different requests produce overlapping chunk keys in the decoder's
_allocate_and_put:Before (pin=True) — decoder log shows:
After (pin=False) — decoder log is clean:
The leak is cumulative. With a 2GB pd_buffer and 256-token chunks, the buffer exhausts after ~30-50 requests depending on prefix overlap. Once exhausted, all subsequent
allocate()calls block and no new requests complete.If applicable:
Note
Low Risk
Low risk: one-line change limited to PD disaggregation allocation flow; main risk is unintended change in lifetime/ref-count behavior for reused cache keys.
Overview
Prevents PD decoder buffer slots from being leaked when an
AllocRequestincludes keys that already exist by changing_allocate_and_put()to callcontains(key, pin=False)instead of pinning/ref-counting existing entries.This ensures the "already sent" path does not increment
MemoryObjref counts, allowing subsequentremove()/free to reclaim paged GPU blocks normally.Reviewed by Cursor Bugbot for commit 339764e. Bugbot is set up for automated code reviews on this repo. Configure here.