[Connector] Maru: zero-copy KV cache sharing via CXL shared memory by jooho-XCENA · Pull Request #2705 · LMCache/LMCache

jooho-XCENA · 2026-03-06T13:22:08Z

What this PR does / why we need it:

This PR adds Maru, a CXL shared-memory-based KV cache storage engine, as a new remote backend for LMCache.
With CXL memory, instances share KV cache via direct memory access rather than network transfer, eliminating the data-copy overhead of network-based backends. Beyond raw throughput, this also reduces CPU and NIC utilization on the host, freeing system resources for inference itself.
It uses the existing plugin architecture, so no changes to core logic are required. We hope this addition serves as a meaningful step for LMCache's shared storage capabilities.

For details, see documentation / github.

P2P KV Sharing Benchmark:

setup

Hardware Configuration

GPU: NVIDIA RTX PRO 6000 (96GB) × 2
CPU: AMD EPYC 9555 64-Core × 2 (128 threads)
DRAM: 756 GB DDR5 DRAM
CXL Memory: 6× CXL Type-3 Memory Expander (229 GB each, 1,374 GB total)
Topology:
- Single-node
- 2 GPUs on separate NUMA nodes (GPU0→NUMA0, GPU1→NUMA1)
- PCIe Gen5 x16
Transfer: Cross-Node NIXL TCP (UCX_TLS=cuda_ipc,cuda_copy,tcp)

Software

Model: Meta-Llama/Llama-3.1-8B-Instruct (TP=1)
vLLM Version: 0.13.0+cu128
LMCache Version: 0.3.13.dev97
NIXL Version: 0.10.1
Transfer Methods:
- Maru:
  - CXL shared memory pool (200 GB, /dev/dax)
  - Zero-copy mmap between vLLM instances via maru_meta_server
- LMCache P2P backend (NIXL):
  - UCX transport (cuda_ipc,cuda_copy,tcp) over TCP loopback
  - CPU DRAM storage (200 GB)
OS: Ubuntu 24.04.3 LTS, Kernel 6.17.0-14-generic
CUDA: 12.8.0, Driver 580.126.09

Data

Dataset: LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

For detail

Special notes for your reviewers:

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

Change History:

2026-03-20: Reworked as a storage backend, DRAM bypass on store path

Note

Medium Risk
Adds a new storage backend with async RPC registration and CXL-backed memory allocation, plus changes in backend selection/write-back paths; issues would mainly surface as cache misses, leaks, or deadlocks under load.

Overview
Adds Maru as a new KV-cache storage backend that enables zero-copy sharing via CXL shared memory, including new engine config knobs (maru_path, maru_pool_size) and docs for setup/usage.

Wires MaruBackend into backend creation (with optional dependency gating) and adjusts StorageManager to select Maru as an allocator when appropriate and to avoid writing back Maru hits into LocalCPUBackend. Includes a comprehensive unit test suite for allocation, put/get, prefix batched operations, pin/unpin/remove, and lifecycle/drain behavior.

^{Written by Cursor Bugbot for commit 58cbd84. This will update automatically on new commits. Configure here.}

gemini-code-assist · 2026-03-06T13:22:27Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances LMCache's shared storage capabilities by integrating Maru, a CXL shared-memory-based KV cache engine. This addition facilitates zero-copy data transfer between vLLM instances, which is crucial for high-performance LLM inference. By utilizing CXL memory, the system can achieve higher throughput and lower latency compared to traditional network-based solutions, while also freeing up valuable CPU and NIC resources.

Highlights

New KV Cache Backend: Maru: Introduced Maru, a CXL shared-memory-based KV cache storage engine, as a new remote backend for LMCache. This enables zero-copy KV cache sharing between instances.
CXL Shared Memory Integration: Leverages CXL memory for direct memory access, aiming to eliminate data-copy overhead associated with network-based backends and reduce CPU/NIC utilization.
Comprehensive Documentation: Added detailed documentation for Maru, covering its overview, quick start guide, installation steps, deployment instructions, and configuration parameters.
Connector Implementation: Implemented MaruConnectorAdapter and MaruConnector to seamlessly integrate Maru into the existing LMCache plugin architecture, supporting asynchronous and batched operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/source/kv_cache/storage_backends/index.rst
- Added 'maru' to the list of supported KV cache storage backends.
docs/source/kv_cache/storage_backends/maru.rst
- Added new documentation for the Maru KV cache storage engine.
- Included sections for overview, quick start, installation, deployment, and configuration.
- Provided details on LMCache and Maru-specific parameters.
lmcache/v1/storage_backend/connector/maru_adapter.py
- Added a new file defining MaruConnectorAdapter to handle the 'maru://' scheme.
- Implemented logic for parsing Maru URLs and creating MaruConnector instances.
lmcache/v1/storage_backend/connector/maru_connector.py
- Added a new file implementing the MaruConnector class.
- Included methods for exists, get, put, remove_sync, close, and various batched operations.
- Defined MaruConnectorConfig for managing Maru-specific settings, including CXL memory pool size and RPC parameters.
- Implemented cache_key_to_int for stable hashing of cache keys.
- Added support for health checks and error monitoring.

Activity

No human activity (comments, reviews) has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the Maru connector, a new storage backend for LMCache that leverages CXL shared memory for zero-copy KV cache sharing, aiming for performance improvements by eliminating data-copy overhead. However, a critical security concern was identified: raw URLs, which may contain sensitive credentials, are logged in plain text in two instances. These must be sanitized before logging to prevent credential leakage in system logs. The review also focuses on improving the robustness of configuration parsing, ensuring correctness under different LMCache configurations, and enhancing code maintainability.

DongDongJu · 2026-03-06T16:58:21Z

Before review it, LMCache currently support mp mode which is the same purpose that sharing the kvcache btw the different instances. Could you share the performance diff btw mp mode dram based one too?
Otherwise looks make sense to me. I will take a look during weekend.

DongDongJu · 2026-03-09T14:59:10Z

IMO, it seems new LMCache for CXL. This should probably live as an external plugin/backend package first, with proof that it is stable, testable, and worth the permanent maintenance cost.

youngrok-XCENA · 2026-03-11T07:34:01Z

Thanks for the review.

Maru provides an abstraction over CXL shared memory as a remote storage backend. The connector is built on top of the existing LMCache connector architecture, following the same approach as other connectors.
To help guide our next steps, we would appreciate any guidance on the requirements for in-tree inclusion.

Regarding mp mode — we've added the comparison as requested.
Maru targets multi-instance KV cache sharing over a CXL memory pool. In this single-node benchmark, MP mode tends to be slightly faster as it uses host DRAM directly, while CXL memory has inherently higher access latency than host DRAM.
We plan to share multi-node results which better reflect its target environment.

deng451e · 2026-03-13T00:07:58Z

Thanks for the hard work — overall LGTM. It appears this adds an additional remote storage backend without modifying the existing in-memory code path.
I do have a couple of questions regarding potential impacts on sharing efficiency:

Does Maru’s retrieval/offloading consume local CPU cores (or use DSA)? If it uses CPU, is the overhead significant?
What aggregate bandwidth is provisioned for the CXL shared memory, given that multiple nodes will share it?

jooho-XCENA · 2026-03-13T02:42:20Z

@deng451e

Thank you for the review and the great questions!

CPU overhead:
Currently, on the store path, KV data is first staged in CPU DRAM (as with other remote backends) and then written to the CXL shared memory pool via memcpy. The CPU cost on the store side is bounded by a single memcpy per chunk into the mmap'd region — inherent to the current CPU-staged offloading path. If the existing in-memory code path could be extended to support direct device-to-CXL transfers, we believe the CPU overhead during offloading could be further reduced.
For retrieval, Maru leverages a GPU-CXL DMA path, which reduces CPU involvement and results in even lower CPU utilization compared to the offloading path. The remaining CPU cost is limited to lightweight RPC communication overhead, which is likely comparable to other remote backends.

Aggregate bandwidth:
The aggregate bandwidth is primarily determined by the bandwidth between the host and the CXL switch. In the current generation (PCIe Gen5), a single x16 CXL HBA provides ~64 GB/s of bandwidth, which will increase to ~128 GB/s with Gen6. Current-generation CXL switches typically have x256 lanes, providing a total switching capacity of approximately 1 TB/s. The actual bandwidth available to each host depends on the ratio of host ports to device (memory) ports on the switch.

sammshen · 2026-03-13T20:11:56Z

+        self,
+        url: str,
+        loop: asyncio.AbstractEventLoop,
+        local_cpu_backend: LocalCPUBackend,


can you add a comment that this is unused and all the memory management is done completely by the MaruHandler?

or it is used in the Store path

sammshen · 2026-03-13T20:14:03Z

+        logger.debug("maru decode data=%d bytes", len(mv))
+
+        # memoryview -> torch tensor (zero-copy)
+        raw_data = torch.frombuffer(mv, dtype=torch.uint8)


does this do a transfer from CXL to CPU DRAM?

can you clarify with a comment

sammshen · 2026-03-13T20:22:30Z

this probably shouldn't be a remote connector but should be a storage backend instead

sammshen · 2026-03-13T20:23:21Z

specifically, a new memory allocator is probably needed (please see GDS or PD backends as examples) for GPU DMA.

this allocator should handle:

abstracting / wrapping the CXL memory (so we can avoid any DRAM copies if possible)
managing any GPU side buffers for DMA

youngrok-XCENA · 2026-03-14T05:23:21Z

@sammshen
Thanks for the suggestion!
We initially went with remote connector for a lighter integration footprint, but we agree that integrating it as a storage backend is the right approach.
Will rework the implementation and push an update sometime next week.

jooho-XCENA · 2026-03-20T09:56:41Z

@sammshen , @deng451e
Thank you for the feedback. We've migrated from the remote backend approach to a storage backend implementation. With this change, the store path also bypasses DRAM.

If you'd like to test it, current work is in sync with the feat/maru_backend branch of Maru — you can install from there to try it out.

DongDongJu · 2026-03-20T14:57:33Z

@jooho-XCENA I can not find the CxlMemoryAdapter in the maru code (https://github.com/search?q=repo%3Axcena-dev%2Fmaru+CxlMemoryAdapter&type=code). Could you check it once?

youngrok-XCENA · 2026-03-20T15:10:45Z

Hello, @DongDongJu
CxlMemoryAdapter source code is available in the branch below. In Maru, the storage backend-related logic is maintained in a separate branch because it is not part of LMCache’s native codebase. Please feel free to reach out if you have any questions.

https://github.com/xcena-dev/maru/tree/feat/maru_backend/maru_lmcache

deng451e · 2026-03-30T05:39:55Z

Integration code looks generally fine. A few curious questions on semantics:
• With direct GPU ↔ CXL access (no host staging), how is consistency handled across multiple nodes sharing CXL—specifically, how are pinned-memory semantics preserved, and how do you ensure regions are not evicted or modified by other nodes?
• Do you have a safe fallback mechanism if these guarantees cannot be enforced?

jooho-XCENA · 2026-03-30T07:46:44Z

@deng451e
Thanks for the questions! MaruServer acts as a centralized metadata authority — all concurrent operations (store, retrieve, pin, eviction) are serialized through the server, so consistency is guaranteed as long as the server is operational. If the server becomes unavailable, the storage backend becomes unusable and errors are surfaced through failure logs. We're planning to add automatic reconnection on server restart in a future update.

sammshen · 2026-03-31T00:54:33Z

@DongDongJu defer final opinion to you if you think this is a great addition on top of Dev DAX

DongDongJu · 2026-04-02T00:41:02Z

@DongDongJu defer final opinion to you if you think this is a great addition on top of Dev DAX

Good to have it. why not for increasing the market size. Let me take a look now.

DongDongJu

LGTM!

DongDongJu · 2026-04-02T00:42:01Z

Same file is in the maru git. I think we dont need this file in here.

Makes sense, will remove. How about adding a diagram that shows the architecture from LMCache's side when using the Maru backend instead?

Updated the diagram to show the architecture from LMCache's side.

Would appreciate your feedback

DongDongJu · 2026-04-02T00:42:29Z

+`Maru <https://github.com/xcena-dev/maru>`_ is a high-performance KV cache storage engine built on CXL shared memory,
+designed for LLM inference scenarios where multiple instances need to share a KV cache with minimal latency.
+
+.. image:: ../../assets/maru-kvcache.gif


then need to remove this.

DongDongJu · 2026-04-02T00:53:57Z

+            if "LocalCPUBackend" in self.storage_backends:
+                allocator_backend = self.storage_backends["LocalCPUBackend"]
+            else:
+                allocator_backend = self.storage_backends["MaruBackend"]


What is this case? just using localcpubackend for staging buffer?

deng451e · 2026-04-02T05:47:33Z

Thanks for the hard work , LGTM! 👍

deng451e · 2026-04-02T05:49:07Z

Please sign the DCO

jooho-XCENA · 2026-04-02T07:37:01Z

@deng451e @DongDongJu

Thanks for the review! Let me know if there's anything else needed more.

DongDongJu · 2026-04-02T14:04:43Z

Hope to see multi node setup result soon.

DongDongJu · 2026-04-02T18:01:36Z

@jooho-XCENA Could you check pre-commit?

Signed-off-by: jooho-xcena <jooho.lee@xcena.com> Co-authored-by: youngrok-XCENA <yr.song@xcena.com> Co-authored-by: hyunyul-XCENA <hyunyul.cho@xcena.com> Co-authored-by: seohui-XCENA <seohui.son@xcena.com> Co-authored-by: kihwan-XCENA <kihwan.kim@xcena.com>

- _async_store now uses handler.store() return value instead of unconditionally setting success=True, preventing CXL memory leak on server-side rejection - Fix batched_async_contains docstring to reflect actual batch_pin RPC support Signed-off-by: youngrok-XCENA <yr.song@xcena.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-04-03T01:44:12Z

+        memory_objs: List[MemoryObj],
+        transfer_spec: Any = None,
+        on_complete_callback: Optional[Callable[[CacheEngineKey], None]] = None,
+    ) -> Union[List[Future], None]:


Parameter name mismatch with abstract interface signature

Medium Severity

batched_submit_put_task uses parameter name memory_objs instead of objs as defined in AllocatorBackendInterface. Any caller using the keyword argument objs= would get a TypeError at runtime. This violates the interface contract and is a maintainability risk.

^{Triggered by project rule: LMCache Code Review Style Guide}

Signed-off-by: jooho-xcena <jooho.lee@xcena.com>

jooho-XCENA · 2026-04-03T02:41:19Z

@DongDongJu
Done. Everything looks good.
Multi-node setup is in progress — I'll share the results with the community once it's ready!

DongDongJu · 2026-04-03T02:46:34Z

@DongDongJu Done. Everything looks good. Multi-node setup is in progress — I'll share the results with the community once it's ready!

Sounds great.

gemini-code-assist Bot reviewed Mar 6, 2026

View reviewed changes

DongDongJu self-requested a review March 6, 2026 16:54

hyunyul-XCENA mentioned this pull request Mar 9, 2026

Address Gemini code review comments on Maru connector xcena-dev/LMCache#1

Merged

2 tasks

DongDongJu removed their request for review March 9, 2026 14:58

sammshen reviewed Mar 13, 2026

View reviewed changes

deng451e self-requested a review March 27, 2026 01:10

DongDongJu approved these changes Apr 2, 2026

View reviewed changes

cursor Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/maru_backend.py

deng451e approved these changes Apr 2, 2026

View reviewed changes

deng451e added the full Run comprehensive tests on this PR label Apr 2, 2026

jooho-XCENA force-pushed the feat/maru-connector branch from 8ef30e8 to e137286 Compare April 2, 2026 06:10

jooho-XCENA force-pushed the feat/maru-connector branch 4 times, most recently from 8ef30e8 to 670d500 Compare April 2, 2026 06:42

DongDongJu enabled auto-merge (squash) April 2, 2026 14:04

auto-merge was automatically disabled April 3, 2026 01:11
Head branch was pushed to by a user without write access

github-actions Bot removed the full Run comprehensive tests on this PR label Apr 3, 2026

jooho-XCENA force-pushed the feat/maru-connector branch from 93eb7a0 to 7f790ce Compare April 3, 2026 01:11

Merge branch 'dev' into feat/maru-connector

da1d551

cursor Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/maru_backend.py Outdated

Comment thread lmcache/v1/storage_backend/maru_backend.py Outdated

youngrok-XCENA and others added 2 commits April 3, 2026 10:32

Merge branch 'dev' into feat/maru-connector

065fe20

cursor Bot reviewed Apr 3, 2026

View reviewed changes

deng451e enabled auto-merge (squash) April 3, 2026 01:58

deng451e added the full Run comprehensive tests on this PR label Apr 3, 2026

style: fix ruff-format in maru_backend.py

58cbd84

Signed-off-by: jooho-xcena <jooho.lee@xcena.com>

auto-merge was automatically disabled April 3, 2026 01:59
Head branch was pushed to by a user without write access

github-actions Bot removed the full Run comprehensive tests on this PR label Apr 3, 2026

DongDongJu enabled auto-merge (squash) April 3, 2026 02:46

github-actions Bot added the full Run comprehensive tests on this PR label Apr 3, 2026

DongDongJu merged commit 5502419 into LMCache:dev Apr 3, 2026
37 checks passed

seohui-XCENA mentioned this pull request Apr 14, 2026

[Bug] Async retrieve path never calls backend.unpin(), causing MaruBackend pin accumulation #3027

Open

ianliuy mentioned this pull request Apr 15, 2026

[BugFix]: Fix async retrieve path missing backend.unpin() in cleanup #3035

Open

2 tasks

Conversation

jooho-XCENA commented Mar 6, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Hardware Configuration

Software

Data

Uh oh!

gemini-code-assist Bot commented Mar 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DongDongJu commented Mar 6, 2026

Uh oh!

DongDongJu commented Mar 9, 2026

Uh oh!

youngrok-XCENA commented Mar 11, 2026

Uh oh!

deng451e commented Mar 13, 2026

Uh oh!

jooho-XCENA commented Mar 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sammshen commented Mar 13, 2026

Uh oh!

sammshen commented Mar 13, 2026

Uh oh!

youngrok-XCENA commented Mar 14, 2026

Uh oh!

jooho-XCENA commented Mar 20, 2026

Uh oh!

DongDongJu commented Mar 20, 2026

Uh oh!

youngrok-XCENA commented Mar 20, 2026

Uh oh!

deng451e commented Mar 30, 2026

Uh oh!

jooho-XCENA commented Mar 30, 2026

Uh oh!

sammshen commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DongDongJu commented Apr 2, 2026

Uh oh!

DongDongJu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

deng451e commented Apr 2, 2026

Uh oh!

deng451e commented Apr 2, 2026

jooho-XCENA commented Mar 6, 2026 •

edited by cursor Bot

Loading

sammshen commented Mar 31, 2026 •

edited

Loading

jooho-XCENA commented Apr 2, 2026 •

edited

Loading