Skip to content

[Feat] staging buffer mode when local_cpu is false for diskbackend#2370

Closed
DongDongJu wants to merge 5 commits intoLMCache:devfrom
DongDongJu:feat/staging-buffer-mode
Closed

[Feat] staging buffer mode when local_cpu is false for diskbackend#2370
DongDongJu wants to merge 5 commits intoLMCache:devfrom
DongDongJu:feat/staging-buffer-mode

Conversation

@DongDongJu
Copy link
Copy Markdown
Collaborator

What this PR does / why we need it:

This PR introduces a new use_only_staging_buffer configuration flag that enables disk-only caching mode in LMCache.
When enabled with local_cpu=false, CPU memory is used only as a temporary staging buffer for GPU-to-disk transfers, and all cache lookups go directly to disk.

Problem

Currently, even with local_cpu=false, CPU memory still participates in cache lookups. This means:

  • Data retrieved from disk is written back to CPU cache
  • Subsequent lookups hit CPU cache instead of disk
  • CPU memory usage grows over time with cached data

This behavior is problematic when users want pure disk-only caching for scenarios like:

  • Limited CPU memory environments
  • Testing disk backend performance in isolation
  • Ensuring data persistence on disk without CPU cache interference

Solution

Add use_only_staging_buffer flag that, when combined with local_cpu=false:

  1. Skips CPU backend in lookups - Cache lookups go directly to disk
  2. Disables write-back - Data retrieved from disk is not cached in CPU
  3. Releases staging buffer after disk write - CPU memory is freed after GPU→Disk transfer completes

Data Flow Comparison

┌─────────────────────────────────────────────────────────────────────────────┐
│                    BEFORE (local_cpu=false only)                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  PUT (Store KV Cache):                                                      │
│  ┌─────┐    ┌─────────────┐    ┌──────┐                                     │
│  │ GPU │───►│ CPU (stage) │───►│ Disk │                                     │
│  └─────┘    └─────────────┘    └──────┘                                     │
│                   │                                                         │
│                   └── CPU keeps data (memory grows)                         │
│                                                                             │
│  GET (Lookup & Retrieve):                                                   │
│  ┌──────┐    ┌─────────────┐    ┌─────┐                                     │
│  │ Disk │───►│ CPU (cache) │───►│ GPU │                                     │
│  └──────┘    └─────────────┘    └─────┘                                     │
│                   │                                                         │
│                   └── Write-back to CPU (unexpected CPU hits later)         │
│                                                                             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│           AFTER (local_cpu=false + use_only_staging_buffer=true)            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  PUT (Store KV Cache):                                                      │
│  ┌─────┐    ┌─────────────┐    ┌──────┐                                     │
│  │ GPU │───►│ CPU (stage) │───►│ Disk │                                     │
│  └─────┘    └─────────────┘    └──────┘                                     │
│                   │                                                         │
│                   └──────────────────── CPU buffer released after write     │
│                                                                             │
│  GET (Lookup & Retrieve):                                                   │
│  ┌──────┐    ┌─────────────┐    ┌─────┐                                     │
│  │ Disk │───►│ CPU (stage) │───►│ GPU │                                     │
│  └──────┘    └─────────────┘    └─────┘                                     │
│                   │                                                         │
│                   └── No write-back, staging only                           │
│                                                                             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Configuration Comparison

Configuration CPU in Lookups Behavior
local_cpu: true Yes Normal CPU+Disk tiered caching
local_cpu: false Yes CPU still participates in lookups (legacy)
local_cpu: false + use_only_staging_buffer: true No Disk-only caching (CPU is staging buffer only)

Usage

Environment Variables:

export LMCACHE_LOCAL_CPU=false
export LMCACHE_USE_ONLY_STAGING_BUFFER=true

Configuration File:

local_cpu: false
use_only_staging_buffer: true
local_disk: "file:///path/to/disk/cache/"
max_local_disk_size: 100.0

Special notes for your reviewers:

  1. Default value is false, so existing behavior is unchanged
  2. Used completion callback pattern to release staging buffer after async disk write completes, avoiding race conditions

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Dongjoo Seo and others added 5 commits January 8, 2026 16:25
Add a new configuration option `use_only_staging_buffer` that, when enabled
with `local_cpu=False`, makes CPU memory serve only as a staging buffer for
disk/remote backends. This prevents CPU cache hits when the intent is to use
disk as the persistent storage tier.

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Co-authored-by: DongDongJu <commisori28@gmail.com>
Add optional on_complete_callback parameter to submit_put_task,
batched_submit_put_task, and async_save_bytes_to_disk methods.
This callback is invoked after the disk write completes, enabling
callers to perform cleanup actions such as releasing staging buffers.

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Co-authored-by: DongDongJu <commisori28@gmail.com>
When use_only_staging_buffer is enabled with local_cpu=False:
- Skip LocalCPUBackend in get_active_storage_backends for lookups/gets
- Disable write-back to CPU cache after disk retrieval
- Release CPU staging buffer entries after disk write completes via callback

This ensures CPU memory is used only as a temporary staging buffer for
GPU-to-disk transfers, with all cache lookups going directly to disk.

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Co-authored-by: DongDongJu <commisori28@gmail.com>
Document the new use_only_staging_buffer configuration option:
- Explain the difference between local_cpu=false and staging buffer mode
- Add configuration comparison table
- Update disk-offload.yaml example with the new flag
- Clarify CPU and Disk interaction behavior

Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com>
Co-authored-by: DongDongJu <commisori28@gmail.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @DongDongJu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances LMCache's disk backend functionality by introducing a "staging buffer mode." This mode, activated by a new use_only_staging_buffer flag in conjunction with local_cpu=false, allows users to configure LMCache for pure disk-only caching. This is particularly useful for environments with limited CPU memory or when isolating disk backend performance, as it prevents CPU memory from accumulating cached data and ensures all cache interactions are directed to disk.

Highlights

  • Disk-Only Caching Mode: Introduces a new use_only_staging_buffer flag that, when combined with local_cpu=false, enables a pure disk-only caching mode in LMCache.
  • CPU as Staging Buffer: CPU memory is now used exclusively as a temporary staging buffer for GPU-to-disk transfers, rather than participating in cache lookups or persistent storage.
  • Optimized Cache Lookups: Cache lookups bypass the CPU backend entirely, going directly to disk, and data retrieved from disk is no longer written back to CPU cache.
  • Memory Management: CPU memory used for staging is released immediately after the asynchronous disk write completes, minimizing CPU memory footprint.
  • Backward Compatibility: The new use_only_staging_buffer flag defaults to false, ensuring existing behavior remains unchanged for current configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@DongDongJu DongDongJu requested review from ApostaC and YaoJiayi January 8, 2026 16:49
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a use_only_staging_buffer flag to enable a true disk-only caching mode, which is a valuable addition. The implementation is well-executed: it correctly modifies the storage manager to bypass the CPU cache for lookups and write-backs and adds a callback mechanism to the disk backend for releasing the CPU staging buffer post-write. The documentation updates are also clear and comprehensive. My feedback primarily focuses on improving code readability by refactoring a repeated complex conditional check.

Comment thread lmcache/v1/storage_backend/storage_manager.py
Comment thread lmcache/v1/storage_backend/storage_manager.py
Comment thread lmcache/v1/storage_backend/storage_manager.py
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice documentation 👍

Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick review. Please see the details, thanks!

self,
key: CacheEngineKey,
memory_obj: MemoryObj,
on_complete_callback: Optional[Callable[[CacheEngineKey], None]] = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is inherited from the base class. Do we want to add the argument into the base class as well?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, we probably need some clarification in the doc string about when the callback will be triggered (e.g., after each object has finished putting, or after all the objects have finished putting).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree. will do

Comment on lines +433 to +446
# Pass callback to disk backend for staging buffer release
if (
backend_name == "LocalDiskBackend"
and staging_buffer_callback is not None
):
disk_backend = cast("LocalDiskBackend", backend)
disk_backend.batched_submit_put_task(
ks,
objs,
transfer_spec=transfer_spec,
on_complete_callback=staging_buffer_callback,
)
else:
backend.batched_submit_put_task(ks, objs, transfer_spec=transfer_spec)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the same callback abstraction work for other L2 storage backends?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lemme check

Comment on lines +467 to +474
# Write-back to CPU cache (skip if use_only_staging_buffer is enabled)
if (
backend_name not in ["LocalCPUBackend", "PDBackend"]
and "LocalCPUBackend" in self.storage_backends
and not (
not self.config.local_cpu
and self.config.use_only_staging_buffer
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the code seems to appear multiple times. Maybe consider having a helper function like self._should_write_back_to_cpu()

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@ApostaC
Copy link
Copy Markdown
Contributor

ApostaC commented Jan 8, 2026

Actually, after a second thought, I started to think about whether we need to fully skip the CPU buffer.

At a high level, we should recommend people to use async loading when the KV cache is from L2 storage, like disk or remote. The async loading will first load the KV cache from L2 to L1 (CPU buffer), and then vLLM will load the KV cache to the GPU.

In general, I feel like use_only_staging_buffer is not compatible with the async mode (I could be wrong as well). That said, I'm still okay to proceed if the goal of this mode is to help with debugging and development.

@DongDongJu
Copy link
Copy Markdown
Collaborator Author

DongDongJu commented Jan 8, 2026

Actually, after a second thought, I started to think about whether we need to fully skip the CPU buffer.

At a high level, we should recommend people to use async loading when the KV cache is from L2 storage, like disk or remote. The async loading will first load the KV cache from L2 to L1 (CPU buffer), and then vLLM will load the KV cache to the GPU.

In general, I feel like use_only_staging_buffer is not compatible with the async mode (I could be wrong as well). That said, I'm still okay to proceed if the goal of this mode is to help with debugging and development.

Hello @ApostaC, Thanks for the quick review.
I didnt thought about async mode. Let me complete to including with that one's behavior as well.

@DongDongJu
Copy link
Copy Markdown
Collaborator Author

After talking with @YaoJiayi and @ApostaC, I will start the make abstraction for put completion for every backend fisrt. And I will revisit to here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants