[Feat] staging buffer mode when local_cpu is false for diskbackend by DongDongJu · Pull Request #2370 · LMCache/LMCache

DongDongJu · 2026-01-08T16:46:54Z

What this PR does / why we need it:

This PR introduces a new use_only_staging_buffer configuration flag that enables disk-only caching mode in LMCache.
When enabled with local_cpu=false, CPU memory is used only as a temporary staging buffer for GPU-to-disk transfers, and all cache lookups go directly to disk.

Problem

Currently, even with local_cpu=false, CPU memory still participates in cache lookups. This means:

Data retrieved from disk is written back to CPU cache
Subsequent lookups hit CPU cache instead of disk
CPU memory usage grows over time with cached data

This behavior is problematic when users want pure disk-only caching for scenarios like:

Limited CPU memory environments
Testing disk backend performance in isolation
Ensuring data persistence on disk without CPU cache interference

Solution

Add use_only_staging_buffer flag that, when combined with local_cpu=false:

Skips CPU backend in lookups - Cache lookups go directly to disk
Disables write-back - Data retrieved from disk is not cached in CPU
Releases staging buffer after disk write - CPU memory is freed after GPU→Disk transfer completes

Data Flow Comparison

┌─────────────────────────────────────────────────────────────────────────────┐
│                    BEFORE (local_cpu=false only)                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  PUT (Store KV Cache):                                                      │
│  ┌─────┐    ┌─────────────┐    ┌──────┐                                     │
│  │ GPU │───►│ CPU (stage) │───►│ Disk │                                     │
│  └─────┘    └─────────────┘    └──────┘                                     │
│                   │                                                         │
│                   └── CPU keeps data (memory grows)                         │
│                                                                             │
│  GET (Lookup & Retrieve):                                                   │
│  ┌──────┐    ┌─────────────┐    ┌─────┐                                     │
│  │ Disk │───►│ CPU (cache) │───►│ GPU │                                     │
│  └──────┘    └─────────────┘    └─────┘                                     │
│                   │                                                         │
│                   └── Write-back to CPU (unexpected CPU hits later)         │
│                                                                             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│           AFTER (local_cpu=false + use_only_staging_buffer=true)            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  PUT (Store KV Cache):                                                      │
│  ┌─────┐    ┌─────────────┐    ┌──────┐                                     │
│  │ GPU │───►│ CPU (stage) │───►│ Disk │                                     │
│  └─────┘    └─────────────┘    └──────┘                                     │
│                   │                                                         │
│                   └──────────────────── CPU buffer released after write     │
│                                                                             │
│  GET (Lookup & Retrieve):                                                   │
│  ┌──────┐    ┌─────────────┐    ┌─────┐                                     │
│  │ Disk │───►│ CPU (stage) │───►│ GPU │                                     │
│  └──────┘    └─────────────┘    └─────┘                                     │
│                   │                                                         │
│                   └── No write-back, staging only                           │
│                                                                             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Configuration Comparison

Configuration	CPU in Lookups	Behavior
`local_cpu: true`	Yes	Normal CPU+Disk tiered caching
`local_cpu: false`	Yes	CPU still participates in lookups (legacy)
`local_cpu: false` + `use_only_staging_buffer: true`	No	Disk-only caching (CPU is staging buffer only)

Usage

Environment Variables:

export LMCACHE_LOCAL_CPU=false
export LMCACHE_USE_ONLY_STAGING_BUFFER=true

Configuration File:

local_cpu: false
use_only_staging_buffer: true
local_disk: "file:///path/to/disk/cache/"
max_local_disk_size: 100.0

Special notes for your reviewers:

Default value is false, so existing behavior is unchanged
Used completion callback pattern to release staging buffer after async disk write completes, avoiding race conditions

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

Add a new configuration option `use_only_staging_buffer` that, when enabled with `local_cpu=False`, makes CPU memory serve only as a staging buffer for disk/remote backends. This prevents CPU cache hits when the intent is to use disk as the persistent storage tier. Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Co-authored-by: DongDongJu <commisori28@gmail.com>

Add optional on_complete_callback parameter to submit_put_task, batched_submit_put_task, and async_save_bytes_to_disk methods. This callback is invoked after the disk write completes, enabling callers to perform cleanup actions such as releasing staging buffers. Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Co-authored-by: DongDongJu <commisori28@gmail.com>

When use_only_staging_buffer is enabled with local_cpu=False: - Skip LocalCPUBackend in get_active_storage_backends for lookups/gets - Disable write-back to CPU cache after disk retrieval - Release CPU staging buffer entries after disk write completes via callback This ensures CPU memory is used only as a temporary staging buffer for GPU-to-disk transfers, with all cache lookups going directly to disk. Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Co-authored-by: DongDongJu <commisori28@gmail.com>

Document the new use_only_staging_buffer configuration option: - Explain the difference between local_cpu=false and staging buffer mode - Add configuration comparison table - Update disk-offload.yaml example with the new flag - Clarify CPU and Disk interaction behavior Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Co-authored-by: DongDongJu <commisori28@gmail.com>

gemini-code-assist · 2026-01-08T16:47:15Z

Summary of Changes

Hello @DongDongJu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances LMCache's disk backend functionality by introducing a "staging buffer mode." This mode, activated by a new use_only_staging_buffer flag in conjunction with local_cpu=false, allows users to configure LMCache for pure disk-only caching. This is particularly useful for environments with limited CPU memory or when isolating disk backend performance, as it prevents CPU memory from accumulating cached data and ensures all cache interactions are directed to disk.

Highlights

Disk-Only Caching Mode: Introduces a new use_only_staging_buffer flag that, when combined with local_cpu=false, enables a pure disk-only caching mode in LMCache.
CPU as Staging Buffer: CPU memory is now used exclusively as a temporary staging buffer for GPU-to-disk transfers, rather than participating in cache lookups or persistent storage.
Optimized Cache Lookups: Cache lookups bypass the CPU backend entirely, going directly to disk, and data retrieved from disk is no longer written back to CPU cache.
Memory Management: CPU memory used for staging is released immediately after the asynchronous disk write completes, minimizing CPU memory footprint.
Backward Compatibility: The new use_only_staging_buffer flag defaults to false, ensuring existing behavior remains unchanged for current configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a use_only_staging_buffer flag to enable a true disk-only caching mode, which is a valuable addition. The implementation is well-executed: it correctly modifies the storage manager to bypass the CPU cache for lookups and write-backs and adds a callback mechanism to the disk backend for releasing the CPU staging buffer post-write. The documentation updates are also clear and comprehensive. My feedback primarily focuses on improving code readability by refactoring a repeated complex conditional check.

ApostaC · 2026-01-08T19:09:39Z

Nice documentation 👍

ApostaC

Did a quick review. Please see the details, thanks!

ApostaC · 2026-01-08T19:17:00Z

        self,
        key: CacheEngineKey,
        memory_obj: MemoryObj,
+        on_complete_callback: Optional[Callable[[CacheEngineKey], None]] = None,


I think this function is inherited from the base class. Do we want to add the argument into the base class as well?

Additionally, we probably need some clarification in the doc string about when the callback will be triggered (e.g., after each object has finished putting, or after all the objects have finished putting).

I do agree. will do

ApostaC · 2026-01-08T19:20:18Z

+            # Pass callback to disk backend for staging buffer release
+            if (
+                backend_name == "LocalDiskBackend"
+                and staging_buffer_callback is not None
+            ):
+                disk_backend = cast("LocalDiskBackend", backend)
+                disk_backend.batched_submit_put_task(
+                    ks,
+                    objs,
+                    transfer_spec=transfer_spec,
+                    on_complete_callback=staging_buffer_callback,
+                )
+            else:
+                backend.batched_submit_put_task(ks, objs, transfer_spec=transfer_spec)


Will the same callback abstraction work for other L2 storage backends?

lemme check

ApostaC · 2026-01-08T19:21:22Z

+                # Write-back to CPU cache (skip if use_only_staging_buffer is enabled)
                if (
                    backend_name not in ["LocalCPUBackend", "PDBackend"]
                    and "LocalCPUBackend" in self.storage_backends
+                    and not (
+                        not self.config.local_cpu
+                        and self.config.use_only_staging_buffer
+                    )


This part of the code seems to appear multiple times. Maybe consider having a helper function like self._should_write_back_to_cpu()

ApostaC · 2026-01-08T21:11:01Z

Actually, after a second thought, I started to think about whether we need to fully skip the CPU buffer.

At a high level, we should recommend people to use async loading when the KV cache is from L2 storage, like disk or remote. The async loading will first load the KV cache from L2 to L1 (CPU buffer), and then vLLM will load the KV cache to the GPU.

In general, I feel like use_only_staging_buffer is not compatible with the async mode (I could be wrong as well). That said, I'm still okay to proceed if the goal of this mode is to help with debugging and development.

DongDongJu · 2026-01-08T22:15:01Z

Actually, after a second thought, I started to think about whether we need to fully skip the CPU buffer.

At a high level, we should recommend people to use async loading when the KV cache is from L2 storage, like disk or remote. The async loading will first load the KV cache from L2 to L1 (CPU buffer), and then vLLM will load the KV cache to the GPU.

In general, I feel like use_only_staging_buffer is not compatible with the async mode (I could be wrong as well). That said, I'm still okay to proceed if the goal of this mode is to help with debugging and development.

Hello @ApostaC, Thanks for the quick review.
I didnt thought about async mode. Let me complete to including with that one's behavior as well.

DongDongJu · 2026-01-11T23:42:28Z

After talking with @YaoJiayi and @ApostaC, I will start the make abstraction for put completion for every backend fisrt. And I will revisit to here.

Dongjoo Seo and others added 5 commits January 8, 2026 16:25

Merge branch 'dev' into feat/staging-buffer-mode

9da7e15

DongDongJu requested review from ApostaC and YaoJiayi January 8, 2026 16:49

gemini-code-assist Bot reviewed Jan 8, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/storage_manager.py

Comment thread lmcache/v1/storage_backend/storage_manager.py

Comment thread lmcache/v1/storage_backend/storage_manager.py

ApostaC reviewed Jan 8, 2026

View reviewed changes

ApostaC requested changes Jan 8, 2026

View reviewed changes

DongDongJu closed this Jan 11, 2026

DongDongJu mentioned this pull request Jan 12, 2026

[Refactor] adds a unified on_complete_callback abstraction to all storage backend interfaces. #2393

Merged

2 tasks

DongDongJu mentioned this pull request Mar 12, 2026

[MP] Support buffer only mode for MP mode #2760

Merged

2 tasks

Conversation

DongDongJu commented Jan 8, 2026

Problem

Solution

Data Flow Comparison

Configuration Comparison

Usage

Uh oh!

gemini-code-assist Bot commented Jan 8, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ApostaC commented Jan 8, 2026

Uh oh!

DongDongJu commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DongDongJu commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DongDongJu commented Jan 8, 2026 •

edited

Loading