[Feat] staging buffer mode when local_cpu is false for diskbackend#2370
[Feat] staging buffer mode when local_cpu is false for diskbackend#2370DongDongJu wants to merge 5 commits intoLMCache:devfrom
Conversation
Add a new configuration option `use_only_staging_buffer` that, when enabled with `local_cpu=False`, makes CPU memory serve only as a staging buffer for disk/remote backends. This prevents CPU cache hits when the intent is to use disk as the persistent storage tier. Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Co-authored-by: DongDongJu <commisori28@gmail.com>
Add optional on_complete_callback parameter to submit_put_task, batched_submit_put_task, and async_save_bytes_to_disk methods. This callback is invoked after the disk write completes, enabling callers to perform cleanup actions such as releasing staging buffers. Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Co-authored-by: DongDongJu <commisori28@gmail.com>
When use_only_staging_buffer is enabled with local_cpu=False: - Skip LocalCPUBackend in get_active_storage_backends for lookups/gets - Disable write-back to CPU cache after disk retrieval - Release CPU staging buffer entries after disk write completes via callback This ensures CPU memory is used only as a temporary staging buffer for GPU-to-disk transfers, with all cache lookups going directly to disk. Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Co-authored-by: DongDongJu <commisori28@gmail.com>
Document the new use_only_staging_buffer configuration option: - Explain the difference between local_cpu=false and staging buffer mode - Add configuration comparison table - Update disk-offload.yaml example with the new flag - Clarify CPU and Disk interaction behavior Signed-off-by: Dongjoo Seo <dongjoo.seo1@samsung.com> Co-authored-by: DongDongJu <commisori28@gmail.com>
Summary of ChangesHello @DongDongJu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances LMCache's disk backend functionality by introducing a "staging buffer mode." This mode, activated by a new Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a use_only_staging_buffer flag to enable a true disk-only caching mode, which is a valuable addition. The implementation is well-executed: it correctly modifies the storage manager to bypass the CPU cache for lookups and write-backs and adds a callback mechanism to the disk backend for releasing the CPU staging buffer post-write. The documentation updates are also clear and comprehensive. My feedback primarily focuses on improving code readability by refactoring a repeated complex conditional check.
ApostaC
left a comment
There was a problem hiding this comment.
Did a quick review. Please see the details, thanks!
| self, | ||
| key: CacheEngineKey, | ||
| memory_obj: MemoryObj, | ||
| on_complete_callback: Optional[Callable[[CacheEngineKey], None]] = None, |
There was a problem hiding this comment.
I think this function is inherited from the base class. Do we want to add the argument into the base class as well?
There was a problem hiding this comment.
Additionally, we probably need some clarification in the doc string about when the callback will be triggered (e.g., after each object has finished putting, or after all the objects have finished putting).
There was a problem hiding this comment.
I do agree. will do
| # Pass callback to disk backend for staging buffer release | ||
| if ( | ||
| backend_name == "LocalDiskBackend" | ||
| and staging_buffer_callback is not None | ||
| ): | ||
| disk_backend = cast("LocalDiskBackend", backend) | ||
| disk_backend.batched_submit_put_task( | ||
| ks, | ||
| objs, | ||
| transfer_spec=transfer_spec, | ||
| on_complete_callback=staging_buffer_callback, | ||
| ) | ||
| else: | ||
| backend.batched_submit_put_task(ks, objs, transfer_spec=transfer_spec) |
There was a problem hiding this comment.
Will the same callback abstraction work for other L2 storage backends?
| # Write-back to CPU cache (skip if use_only_staging_buffer is enabled) | ||
| if ( | ||
| backend_name not in ["LocalCPUBackend", "PDBackend"] | ||
| and "LocalCPUBackend" in self.storage_backends | ||
| and not ( | ||
| not self.config.local_cpu | ||
| and self.config.use_only_staging_buffer | ||
| ) |
There was a problem hiding this comment.
This part of the code seems to appear multiple times. Maybe consider having a helper function like self._should_write_back_to_cpu()
|
Actually, after a second thought, I started to think about whether we need to fully skip the CPU buffer. At a high level, we should recommend people to use async loading when the KV cache is from L2 storage, like disk or remote. The async loading will first load the KV cache from L2 to L1 (CPU buffer), and then vLLM will load the KV cache to the GPU. In general, I feel like |
Hello @ApostaC, Thanks for the quick review. |
What this PR does / why we need it:
This PR introduces a new
use_only_staging_bufferconfiguration flag that enables disk-only caching mode in LMCache.When enabled with
local_cpu=false, CPU memory is used only as a temporary staging buffer for GPU-to-disk transfers, and all cache lookups go directly to disk.Problem
Currently, even with
local_cpu=false, CPU memory still participates in cache lookups. This means:This behavior is problematic when users want pure disk-only caching for scenarios like:
Solution
Add
use_only_staging_bufferflag that, when combined withlocal_cpu=false:Data Flow Comparison
Configuration Comparison
local_cpu: truelocal_cpu: falselocal_cpu: false+use_only_staging_buffer: trueUsage
Environment Variables:
Configuration File:
Special notes for your reviewers:
false, so existing behavior is unchangedIf applicable: