Skip to content

[Connector] Maru: zero-copy KV cache sharing via CXL shared memory #2705

Merged
DongDongJu merged 5 commits intoLMCache:devfrom
xcena-dev:feat/maru-connector
Apr 3, 2026
Merged

[Connector] Maru: zero-copy KV cache sharing via CXL shared memory #2705
DongDongJu merged 5 commits intoLMCache:devfrom
xcena-dev:feat/maru-connector

Conversation

@jooho-XCENA
Copy link
Copy Markdown
Contributor

@jooho-XCENA jooho-XCENA commented Mar 6, 2026

What this PR does / why we need it:

This PR adds Maru, a CXL shared-memory-based KV cache storage engine, as a new remote backend for LMCache.
With CXL memory, instances share KV cache via direct memory access rather than network transfer, eliminating the data-copy overhead of network-based backends. Beyond raw throughput, this also reduces CPU and NIC utilization on the host, freeing system resources for inference itself.
It uses the existing plugin architecture, so no changes to core logic are required. We hope this addition serves as a meaningful step for LMCache's shared storage capabilities.

For details, see documentation / github.

P2P KV Sharing Benchmark:

setup

Hardware Configuration

  • GPU: NVIDIA RTX PRO 6000 (96GB) × 2
  • CPU: AMD EPYC 9555 64-Core × 2 (128 threads)
  • DRAM: 756 GB DDR5 DRAM
  • CXL Memory: 6× CXL Type-3 Memory Expander (229 GB each, 1,374 GB total)
  • Topology:
    • Single-node
    • 2 GPUs on separate NUMA nodes (GPU0→NUMA0, GPU1→NUMA1)
    • PCIe Gen5 x16
  • Transfer: Cross-Node NIXL TCP (UCX_TLS=cuda_ipc,cuda_copy,tcp)

Software

  • Model: Meta-Llama/Llama-3.1-8B-Instruct (TP=1)
  • vLLM Version: 0.13.0+cu128
  • LMCache Version: 0.3.13.dev97
  • NIXL Version: 0.10.1
  • Transfer Methods:
    • Maru:
      • CXL shared memory pool (200 GB, /dev/dax)
      • Zero-copy mmap between vLLM instances via maru_meta_server
    • LMCache P2P backend (NIXL):
      • UCX transport (cuda_ipc,cuda_copy,tcp) over TCP loopback
      • CPU DRAM storage (200 GB)
  • OS: Ubuntu 24.04.3 LTS, Kernel 6.17.0-14-generic
  • CUDA: 12.8.0, Driver 580.126.09

Data

  • Dataset: LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K
image

For detail

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Change History:

  • 2026-03-20: Reworked as a storage backend, DRAM bypass on store path

Note

Medium Risk
Adds a new storage backend with async RPC registration and CXL-backed memory allocation, plus changes in backend selection/write-back paths; issues would mainly surface as cache misses, leaks, or deadlocks under load.

Overview
Adds Maru as a new KV-cache storage backend that enables zero-copy sharing via CXL shared memory, including new engine config knobs (maru_path, maru_pool_size) and docs for setup/usage.

Wires MaruBackend into backend creation (with optional dependency gating) and adjusts StorageManager to select Maru as an allocator when appropriate and to avoid writing back Maru hits into LocalCPUBackend. Includes a comprehensive unit test suite for allocation, put/get, prefix batched operations, pin/unpin/remove, and lifecycle/drain behavior.

Written by Cursor Bugbot for commit 58cbd84. This will update automatically on new commits. Configure here.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances LMCache's shared storage capabilities by integrating Maru, a CXL shared-memory-based KV cache engine. This addition facilitates zero-copy data transfer between vLLM instances, which is crucial for high-performance LLM inference. By utilizing CXL memory, the system can achieve higher throughput and lower latency compared to traditional network-based solutions, while also freeing up valuable CPU and NIC resources.

Highlights

  • New KV Cache Backend: Maru: Introduced Maru, a CXL shared-memory-based KV cache storage engine, as a new remote backend for LMCache. This enables zero-copy KV cache sharing between instances.
  • CXL Shared Memory Integration: Leverages CXL memory for direct memory access, aiming to eliminate data-copy overhead associated with network-based backends and reduce CPU/NIC utilization.
  • Comprehensive Documentation: Added detailed documentation for Maru, covering its overview, quick start guide, installation steps, deployment instructions, and configuration parameters.
  • Connector Implementation: Implemented MaruConnectorAdapter and MaruConnector to seamlessly integrate Maru into the existing LMCache plugin architecture, supporting asynchronous and batched operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/source/kv_cache/storage_backends/index.rst
    • Added 'maru' to the list of supported KV cache storage backends.
  • docs/source/kv_cache/storage_backends/maru.rst
    • Added new documentation for the Maru KV cache storage engine.
    • Included sections for overview, quick start, installation, deployment, and configuration.
    • Provided details on LMCache and Maru-specific parameters.
  • lmcache/v1/storage_backend/connector/maru_adapter.py
    • Added a new file defining MaruConnectorAdapter to handle the 'maru://' scheme.
    • Implemented logic for parsing Maru URLs and creating MaruConnector instances.
  • lmcache/v1/storage_backend/connector/maru_connector.py
    • Added a new file implementing the MaruConnector class.
    • Included methods for exists, get, put, remove_sync, close, and various batched operations.
    • Defined MaruConnectorConfig for managing Maru-specific settings, including CXL memory pool size and RPC parameters.
    • Implemented cache_key_to_int for stable hashing of cache keys.
    • Added support for health checks and error monitoring.
Activity
  • No human activity (comments, reviews) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Maru connector, a new storage backend for LMCache that leverages CXL shared memory for zero-copy KV cache sharing, aiming for performance improvements by eliminating data-copy overhead. However, a critical security concern was identified: raw URLs, which may contain sensitive credentials, are logged in plain text in two instances. These must be sanitized before logging to prevent credential leakage in system logs. The review also focuses on improving the robustness of configuration parsing, ensuring correctness under different LMCache configurations, and enhancing code maintainability.

Comment thread lmcache/v1/storage_backend/connector/maru_connector.py Outdated
Comment thread lmcache/v1/storage_backend/connector/maru_adapter.py Outdated
Comment thread lmcache/v1/storage_backend/connector/maru_connector.py Outdated
Comment thread lmcache/v1/storage_backend/connector/maru_adapter.py Outdated
Comment thread lmcache/v1/storage_backend/connector/maru_connector.py Outdated
Comment thread lmcache/v1/storage_backend/connector/maru_connector.py Outdated
@DongDongJu DongDongJu self-requested a review March 6, 2026 16:54
@DongDongJu
Copy link
Copy Markdown
Collaborator

Before review it, LMCache currently support mp mode which is the same purpose that sharing the kvcache btw the different instances. Could you share the performance diff btw mp mode dram based one too?
Otherwise looks make sense to me. I will take a look during weekend.

@DongDongJu
Copy link
Copy Markdown
Collaborator

IMO, it seems new LMCache for CXL. This should probably live as an external plugin/backend package first, with proof that it is stable, testable, and worth the permanent maintenance cost.

@youngrok-XCENA
Copy link
Copy Markdown
Contributor

Thanks for the review.

Maru provides an abstraction over CXL shared memory as a remote storage backend. The connector is built on top of the existing LMCache connector architecture, following the same approach as other connectors.
To help guide our next steps, we would appreciate any guidance on the requirements for in-tree inclusion.

Regarding mp mode — we've added the comparison as requested.
Maru targets multi-instance KV cache sharing over a CXL memory pool. In this single-node benchmark, MP mode tends to be slightly faster as it uses host DRAM directly, while CXL memory has inherently higher access latency than host DRAM.
We plan to share multi-node results which better reflect its target environment.

image

@deng451e
Copy link
Copy Markdown
Collaborator

Thanks for the hard work — overall LGTM. It appears this adds an additional remote storage backend without modifying the existing in-memory code path.
I do have a couple of questions regarding potential impacts on sharing efficiency:

  • Does Maru’s retrieval/offloading consume local CPU cores (or use DSA)? If it uses CPU, is the overhead significant?
  • What aggregate bandwidth is provisioned for the CXL shared memory, given that multiple nodes will share it?

@jooho-XCENA
Copy link
Copy Markdown
Contributor Author

@deng451e

Thank you for the review and the great questions!

CPU overhead:
Currently, on the store path, KV data is first staged in CPU DRAM (as with other remote backends) and then written to the CXL shared memory pool via memcpy. The CPU cost on the store side is bounded by a single memcpy per chunk into the mmap'd region — inherent to the current CPU-staged offloading path. If the existing in-memory code path could be extended to support direct device-to-CXL transfers, we believe the CPU overhead during offloading could be further reduced.
For retrieval, Maru leverages a GPU-CXL DMA path, which reduces CPU involvement and results in even lower CPU utilization compared to the offloading path. The remaining CPU cost is limited to lightweight RPC communication overhead, which is likely comparable to other remote backends.

Aggregate bandwidth:
The aggregate bandwidth is primarily determined by the bandwidth between the host and the CXL switch. In the current generation (PCIe Gen5), a single x16 CXL HBA provides ~64 GB/s of bandwidth, which will increase to ~128 GB/s with Gen6. Current-generation CXL switches typically have x256 lanes, providing a total switching capacity of approximately 1 TB/s. The actual bandwidth available to each host depends on the ratio of host ports to device (memory) ports on the switch.

self,
url: str,
loop: asyncio.AbstractEventLoop,
local_cpu_backend: LocalCPUBackend,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment that this is unused and all the memory management is done completely by the MaruHandler?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or it is used in the Store path

logger.debug("maru decode data=%d bytes", len(mv))

# memoryview -> torch tensor (zero-copy)
raw_data = torch.frombuffer(mv, dtype=torch.uint8)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this do a transfer from CXL to CPU DRAM?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clarify with a comment

@sammshen
Copy link
Copy Markdown
Contributor

this probably shouldn't be a remote connector but should be a storage backend instead

@sammshen
Copy link
Copy Markdown
Contributor

specifically, a new memory allocator is probably needed (please see GDS or PD backends as examples) for GPU DMA.

this allocator should handle:

  • abstracting / wrapping the CXL memory (so we can avoid any DRAM copies if possible)
  • managing any GPU side buffers for DMA

@youngrok-XCENA
Copy link
Copy Markdown
Contributor

@sammshen
Thanks for the suggestion!
We initially went with remote connector for a lighter integration footprint, but we agree that integrating it as a storage backend is the right approach.
Will rework the implementation and push an update sometime next week.

@jooho-XCENA
Copy link
Copy Markdown
Contributor Author

@sammshen , @deng451e
Thank you for the feedback. We've migrated from the remote backend approach to a storage backend implementation. With this change, the store path also bypasses DRAM.

If you'd like to test it, current work is in sync with the feat/maru_backend branch of Maru — you can install from there to try it out.

@DongDongJu
Copy link
Copy Markdown
Collaborator

@jooho-XCENA I can not find the CxlMemoryAdapter in the maru code (https://github.com/search?q=repo%3Axcena-dev%2Fmaru+CxlMemoryAdapter&type=code). Could you check it once?

@youngrok-XCENA
Copy link
Copy Markdown
Contributor

Hello, @DongDongJu
CxlMemoryAdapter source code is available in the branch below. In Maru, the storage backend-related logic is maintained in a separate branch because it is not part of LMCache’s native codebase. Please feel free to reach out if you have any questions.

https://github.com/xcena-dev/maru/tree/feat/maru_backend/maru_lmcache

@deng451e deng451e self-requested a review March 27, 2026 01:10
@deng451e
Copy link
Copy Markdown
Collaborator

Integration code looks generally fine. A few curious questions on semantics:
• With direct GPU ↔ CXL access (no host staging), how is consistency handled across multiple nodes sharing CXL—specifically, how are pinned-memory semantics preserved, and how do you ensure regions are not evicted or modified by other nodes?
• Do you have a safe fallback mechanism if these guarantees cannot be enforced?

@jooho-XCENA
Copy link
Copy Markdown
Contributor Author

@deng451e
Thanks for the questions! MaruServer acts as a centralized metadata authority — all concurrent operations (store, retrieve, pin, eviction) are serialized through the server, so consistency is guaranteed as long as the server is operational. If the server becomes unavailable, the storage backend becomes unusable and errors are surfaced through failure logs. We're planning to add automatic reconnection on server restart in a future update.

@sammshen
Copy link
Copy Markdown
Contributor

sammshen commented Mar 31, 2026

@DongDongJu defer final opinion to you if you think this is a great addition on top of Dev DAX

@DongDongJu
Copy link
Copy Markdown
Collaborator

@DongDongJu defer final opinion to you if you think this is a great addition on top of Dev DAX

Good to have it. why not for increasing the market size. Let me take a look now.

Copy link
Copy Markdown
Collaborator

@DongDongJu DongDongJu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment thread docs/source/assets/maru-kvcache.gif Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same file is in the maru git. I think we dont need this file in here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will remove. How about adding a diagram that shows the architecture from LMCache's side when using the Maru backend instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the diagram to show the architecture from LMCache's side.

Would appreciate your feedback

`Maru <https://github.com/xcena-dev/maru>`_ is a high-performance KV cache storage engine built on CXL shared memory,
designed for LLM inference scenarios where multiple instances need to share a KV cache with minimal latency.

.. image:: ../../assets/maru-kvcache.gif
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then need to remove this.

if "LocalCPUBackend" in self.storage_backends:
allocator_backend = self.storage_backends["LocalCPUBackend"]
else:
allocator_backend = self.storage_backends["MaruBackend"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this case? just using localcpubackend for staging buffer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Comment thread lmcache/v1/storage_backend/maru_backend.py
@deng451e
Copy link
Copy Markdown
Collaborator

deng451e commented Apr 2, 2026

Thanks for the hard work , LGTM! 👍

@deng451e
Copy link
Copy Markdown
Collaborator

deng451e commented Apr 2, 2026

Please sign the DCO

@deng451e deng451e added the full Run comprehensive tests on this PR label Apr 2, 2026
@jooho-XCENA jooho-XCENA force-pushed the feat/maru-connector branch from 8ef30e8 to e137286 Compare April 2, 2026 06:10
@jooho-XCENA jooho-XCENA force-pushed the feat/maru-connector branch 4 times, most recently from 8ef30e8 to 670d500 Compare April 2, 2026 06:42
@jooho-XCENA
Copy link
Copy Markdown
Contributor Author

jooho-XCENA commented Apr 2, 2026

@deng451e @DongDongJu

Thanks for the review! Let me know if there's anything else needed more.

@DongDongJu DongDongJu enabled auto-merge (squash) April 2, 2026 14:04
@DongDongJu
Copy link
Copy Markdown
Collaborator

Hope to see multi node setup result soon.

@DongDongJu
Copy link
Copy Markdown
Collaborator

@jooho-XCENA Could you check pre-commit?

auto-merge was automatically disabled April 3, 2026 01:11

Head branch was pushed to by a user without write access

Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
Co-authored-by: youngrok-XCENA <yr.song@xcena.com>
Co-authored-by: hyunyul-XCENA <hyunyul.cho@xcena.com>
Co-authored-by: seohui-XCENA <seohui.son@xcena.com>
Co-authored-by: kihwan-XCENA <kihwan.kim@xcena.com>
@github-actions github-actions Bot removed the full Run comprehensive tests on this PR label Apr 3, 2026
@jooho-XCENA jooho-XCENA force-pushed the feat/maru-connector branch from 93eb7a0 to 7f790ce Compare April 3, 2026 01:11
Comment thread lmcache/v1/storage_backend/maru_backend.py Outdated
Comment thread lmcache/v1/storage_backend/maru_backend.py Outdated
youngrok-XCENA and others added 2 commits April 3, 2026 10:32
- _async_store now uses handler.store() return value instead of
  unconditionally setting success=True, preventing CXL memory leak
  on server-side rejection
- Fix batched_async_contains docstring to reflect actual batch_pin
  RPC support

Signed-off-by: youngrok-XCENA <yr.song@xcena.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

memory_objs: List[MemoryObj],
transfer_spec: Any = None,
on_complete_callback: Optional[Callable[[CacheEngineKey], None]] = None,
) -> Union[List[Future], None]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter name mismatch with abstract interface signature

Medium Severity

batched_submit_put_task uses parameter name memory_objs instead of objs as defined in AllocatorBackendInterface. Any caller using the keyword argument objs= would get a TypeError at runtime. This violates the interface contract and is a maintainability risk.

Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

@deng451e deng451e enabled auto-merge (squash) April 3, 2026 01:58
@deng451e deng451e added the full Run comprehensive tests on this PR label Apr 3, 2026
Signed-off-by: jooho-xcena <jooho.lee@xcena.com>
auto-merge was automatically disabled April 3, 2026 01:59

Head branch was pushed to by a user without write access

@github-actions github-actions Bot removed the full Run comprehensive tests on this PR label Apr 3, 2026
@jooho-XCENA
Copy link
Copy Markdown
Contributor Author

@DongDongJu
Done. Everything looks good.
Multi-node setup is in progress — I'll share the results with the community once it's ready!

@DongDongJu
Copy link
Copy Markdown
Collaborator

@DongDongJu Done. Everything looks good. Multi-node setup is in progress — I'll share the results with the community once it's ready!

Sounds great.

@DongDongJu DongDongJu enabled auto-merge (squash) April 3, 2026 02:46
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Apr 3, 2026
@DongDongJu DongDongJu merged commit 5502419 into LMCache:dev Apr 3, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants