[Connector] Maru: zero-copy KV cache sharing via CXL shared memory #2705
[Connector] Maru: zero-copy KV cache sharing via CXL shared memory #2705DongDongJu merged 5 commits intoLMCache:devfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances LMCache's shared storage capabilities by integrating Maru, a CXL shared-memory-based KV cache engine. This addition facilitates zero-copy data transfer between vLLM instances, which is crucial for high-performance LLM inference. By utilizing CXL memory, the system can achieve higher throughput and lower latency compared to traditional network-based solutions, while also freeing up valuable CPU and NIC resources. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the Maru connector, a new storage backend for LMCache that leverages CXL shared memory for zero-copy KV cache sharing, aiming for performance improvements by eliminating data-copy overhead. However, a critical security concern was identified: raw URLs, which may contain sensitive credentials, are logged in plain text in two instances. These must be sanitized before logging to prevent credential leakage in system logs. The review also focuses on improving the robustness of configuration parsing, ensuring correctness under different LMCache configurations, and enhancing code maintainability.
|
Before review it, LMCache currently support mp mode which is the same purpose that sharing the kvcache btw the different instances. Could you share the performance diff btw mp mode dram based one too? |
|
IMO, it seems new LMCache for CXL. This should probably live as an external plugin/backend package first, with proof that it is stable, testable, and worth the permanent maintenance cost. |
|
Thanks for the hard work — overall LGTM. It appears this adds an additional remote storage backend without modifying the existing in-memory code path.
|
|
Thank you for the review and the great questions! CPU overhead: Aggregate bandwidth: |
| self, | ||
| url: str, | ||
| loop: asyncio.AbstractEventLoop, | ||
| local_cpu_backend: LocalCPUBackend, |
There was a problem hiding this comment.
can you add a comment that this is unused and all the memory management is done completely by the MaruHandler?
There was a problem hiding this comment.
or it is used in the Store path
| logger.debug("maru decode data=%d bytes", len(mv)) | ||
|
|
||
| # memoryview -> torch tensor (zero-copy) | ||
| raw_data = torch.frombuffer(mv, dtype=torch.uint8) |
There was a problem hiding this comment.
does this do a transfer from CXL to CPU DRAM?
There was a problem hiding this comment.
can you clarify with a comment
|
this probably shouldn't be a remote connector but should be a storage backend instead |
|
specifically, a new memory allocator is probably needed (please see GDS or PD backends as examples) for GPU DMA. this allocator should handle:
|
|
@sammshen |
|
@sammshen , @deng451e If you'd like to test it, current work is in sync with the feat/maru_backend branch of Maru — you can install from there to try it out. |
|
@jooho-XCENA I can not find the CxlMemoryAdapter in the maru code (https://github.com/search?q=repo%3Axcena-dev%2Fmaru+CxlMemoryAdapter&type=code). Could you check it once? |
|
Hello, @DongDongJu https://github.com/xcena-dev/maru/tree/feat/maru_backend/maru_lmcache |
|
Integration code looks generally fine. A few curious questions on semantics: |
|
@deng451e |
|
@DongDongJu defer final opinion to you if you think this is a great addition on top of Dev DAX |
Good to have it. why not for increasing the market size. Let me take a look now. |
There was a problem hiding this comment.
Same file is in the maru git. I think we dont need this file in here.
There was a problem hiding this comment.
Makes sense, will remove. How about adding a diagram that shows the architecture from LMCache's side when using the Maru backend instead?
There was a problem hiding this comment.
Updated the diagram to show the architecture from LMCache's side.
Would appreciate your feedback
| `Maru <https://github.com/xcena-dev/maru>`_ is a high-performance KV cache storage engine built on CXL shared memory, | ||
| designed for LLM inference scenarios where multiple instances need to share a KV cache with minimal latency. | ||
|
|
||
| .. image:: ../../assets/maru-kvcache.gif |
There was a problem hiding this comment.
then need to remove this.
| if "LocalCPUBackend" in self.storage_backends: | ||
| allocator_backend = self.storage_backends["LocalCPUBackend"] | ||
| else: | ||
| allocator_backend = self.storage_backends["MaruBackend"] |
There was a problem hiding this comment.
What is this case? just using localcpubackend for staging buffer?
|
Thanks for the hard work , LGTM! 👍 |
|
Please sign the DCO |
8ef30e8 to
e137286
Compare
8ef30e8 to
670d500
Compare
|
Thanks for the review! Let me know if there's anything else needed more. |
|
Hope to see multi node setup result soon. |
|
@jooho-XCENA Could you check pre-commit? |
Head branch was pushed to by a user without write access
Signed-off-by: jooho-xcena <jooho.lee@xcena.com> Co-authored-by: youngrok-XCENA <yr.song@xcena.com> Co-authored-by: hyunyul-XCENA <hyunyul.cho@xcena.com> Co-authored-by: seohui-XCENA <seohui.son@xcena.com> Co-authored-by: kihwan-XCENA <kihwan.kim@xcena.com>
93eb7a0 to
7f790ce
Compare
- _async_store now uses handler.store() return value instead of unconditionally setting success=True, preventing CXL memory leak on server-side rejection - Fix batched_async_contains docstring to reflect actual batch_pin RPC support Signed-off-by: youngrok-XCENA <yr.song@xcena.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| memory_objs: List[MemoryObj], | ||
| transfer_spec: Any = None, | ||
| on_complete_callback: Optional[Callable[[CacheEngineKey], None]] = None, | ||
| ) -> Union[List[Future], None]: |
There was a problem hiding this comment.
Parameter name mismatch with abstract interface signature
Medium Severity
batched_submit_put_task uses parameter name memory_objs instead of objs as defined in AllocatorBackendInterface. Any caller using the keyword argument objs= would get a TypeError at runtime. This violates the interface contract and is a maintainability risk.
Triggered by project rule: LMCache Code Review Style Guide
Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
Head branch was pushed to by a user without write access
|
@DongDongJu |
Sounds great. |



What this PR does / why we need it:
This PR adds Maru, a CXL shared-memory-based KV cache storage engine, as a new remote backend for LMCache.
With CXL memory, instances share KV cache via direct memory access rather than network transfer, eliminating the data-copy overhead of network-based backends. Beyond raw throughput, this also reduces CPU and NIC utilization on the host, freeing system resources for inference itself.
It uses the existing plugin architecture, so no changes to core logic are required. We hope this addition serves as a meaningful step for LMCache's shared storage capabilities.
For details, see documentation / github.
P2P KV Sharing Benchmark:
setup
Hardware Configuration
Software
Data
For detail
Special notes for your reviewers:
If applicable:
Change History:
Note
Medium Risk
Adds a new storage backend with async RPC registration and CXL-backed memory allocation, plus changes in backend selection/write-back paths; issues would mainly surface as cache misses, leaks, or deadlocks under load.
Overview
Adds Maru as a new KV-cache storage backend that enables zero-copy sharing via CXL shared memory, including new engine config knobs (
maru_path,maru_pool_size) and docs for setup/usage.Wires
MaruBackendinto backend creation (with optional dependency gating) and adjustsStorageManagerto select Maru as an allocator when appropriate and to avoid writing back Maru hits intoLocalCPUBackend. Includes a comprehensive unit test suite for allocation, put/get, prefix batched operations, pin/unpin/remove, and lifecycle/drain behavior.Written by Cursor Bugbot for commit 58cbd84. This will update automatically on new commits. Configure here.