Add /dev/dax (Device-DAX) backend for KV cache storage#2714
Add /dev/dax (Device-DAX) backend for KV cache storage#2714jayhpark530 wants to merge 4 commits intoLMCache:devfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances LMCache's capabilities by integrating a new backend for /dev/dax devices. This allows the KV cache to leverage emerging byte-addressable memory technologies, offering flexible configurations for memory tiering and potentially improving performance by reducing data movement between CPU and GPU memory. The implementation includes robust memory management and error handling, ensuring reliable operation within the LMCache ecosystem. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new /dev/dax storage backend for the KV cache, supporting both a tiered mode (as a backing store for CPU memory) and a primary mode (as a zero-copy primary allocator for direct GPU access). While the implementation is generally well-structured with good resource management and testing, a security audit identified two medium-severity vulnerabilities. First, a resource leak may occur due to the use of WeakKeyDictionary in the cleanup path, potentially leading to resource exhaustion and Denial of Service. Second, the device_path configuration is used in a file open operation without validation, posing a risk of path traversal or arbitrary file access. It is recommended to use weakref.finalize for robust resource cleanup and to implement strict path validation for the device_path. Additionally, a critical correctness issue was found where ctypes.memmove is used for potential GPU-to-host transfers, which is unsafe and should be replaced with a device-aware copy mechanism like torch.Tensor.copy_. A minor style improvement also suggests moving a local import to the top level. Addressing these issues will enhance the backend's robustness, stability, and security, particularly in primary mode.
| def _do_write(self, offset: int, memory_obj: MemoryObj, size: int) -> None: | ||
| ctypes.memmove( | ||
| ctypes.c_void_p(self._base_ptr + offset), | ||
| ctypes.c_void_p(memory_obj.data_ptr), | ||
| ctypes.c_size_t(size), | ||
| ) | ||
|
|
||
| def _do_read(self, offset: int, memory_obj: MemoryObj, size: int) -> None: | ||
| ctypes.memmove( | ||
| ctypes.c_void_p(memory_obj.data_ptr), | ||
| ctypes.c_void_p(self._base_ptr + offset), | ||
| ctypes.c_size_t(size), | ||
| ) |
There was a problem hiding this comment.
The _do_write and _do_read methods use ctypes.memmove for data transfers. This is only safe for host-to-host memory copies. If the source MemoryObj in _do_write or the destination MemoryObj in _do_read is backed by a GPU tensor, memmove will receive a GPU pointer and likely cause a segmentation fault, as it's not designed for device memory.
Given that this backend is intended to work with GPU-based models and the "primary" mode aims for direct GPU-DAX transfers, using a device-agnostic copy mechanism is crucial. I suggest using torch.Tensor.copy_ which correctly handles copies between different devices (CPU, CUDA). This would make the implementation more robust and prevent crashes when operating with GPU tensors.
def _do_write(self, offset: int, memory_obj: MemoryObj, size: int) -> None:
dst_view = self._arena_tensor.narrow(0, offset, size)
src_view = memory_obj.raw_tensor.narrow(0, 0, size)
dst_view.copy_(src_view, non_blocking=True)
def _do_read(self, offset: int, memory_obj: MemoryObj, size: int) -> None:
dst_view = memory_obj.raw_tensor.narrow(0, 0, size)
src_view = self._arena_tensor.narrow(0, offset, size)
dst_view.copy_(src_view, non_blocking=True)| def _release_memory_obj(self, memory_obj: MemoryObj) -> None: | ||
| with self._state_lock: | ||
| state = self._mark_memory_obj_released_locked(memory_obj) | ||
| if state is None: | ||
| return | ||
| self._release_arena_handle_locked(state.handle) | ||
| state.lease.release() |
There was a problem hiding this comment.
The DaxBackend uses a WeakKeyDictionary (self._memory_obj_states) to store the state (including the arena lease and slot handle) of MemoryObj instances. When a MemoryObj is garbage collected, its __del__ method calls the allocator's free method, which in turn calls _release_memory_obj. However, because the object is already being finalized, it may have been removed from the WeakKeyDictionary before the lookup occurs, causing _mark_memory_obj_released_locked to return None. This results in leaked DAX arena leases and un-reclaimable memory slots. Over time, this can lead to resource exhaustion (file descriptors and memory mappings) and a Denial of Service (DoS) as the backend runs out of available slots.
def _release_memory_obj(self, memory_obj: MemoryObj) -> None:
# Note: Relying on WeakKeyDictionary during __del__ is unreliable.
# Consider using weakref.finalize for robust cleanup or storing state directly on the object.
with self._state_lock:
state = self._mark_memory_obj_released_locked(memory_obj)
if state is None:
# If the object is already being finalized and removed from the weak dict,
# we may need an alternative way to retrieve its handle and lease.
return
self._release_arena_handle_locked(state.handle)
state.lease.release()| self.device_path = str(extra.get("dax.device_path", "")).strip() | ||
| if not self.device_path: | ||
| raise ValueError("extra_config['dax.device_path'] is required") |
There was a problem hiding this comment.
The device_path parameter, obtained from the extra_config dictionary, is passed directly to os.open without any validation or sanitization. An attacker who can control the configuration (e.g., through environment variables or a configuration file in a multi-tenant environment) could point this to arbitrary files on the system. Since the file is opened with os.O_RDWR and subsequently mapped with PROT_WRITE, this could allow an attacker to read from or overwrite sensitive system files, leading to information disclosure or system compromise.
self.device_path = str(extra.get("dax.device_path", "")).strip()
if not self.device_path:
raise ValueError("extra_config['dax.device_path'] is required")
if not os.path.abspath(self.device_path).startswith("/dev/"):
raise ValueError(f"Invalid device path: {self.device_path}. Path must be within /dev/")| arena_tensor = torch.frombuffer(arena_view, dtype=torch.uint8) | ||
| else: | ||
| # Third Party | ||
| import numpy as np |
DongDongJu
left a comment
There was a problem hiding this comment.
Hello @jayhpark530,
Thanks for the terrific work!
IMO, It seems too big to merge it at once.
Could you break down this one into smaller series?
Also, primary allocator is hard to get it the advantage comparing with DRAM based.
|
Thanks for the feedback, @DongDongJu! I agree that the change set is quite large to merge at once. I believe the primary allocator could have potential advantages in some scenarios, but the benefits over the DRAM-based allocator are not yet clear and it significantly increases the PR size. |
|
A new PR for the basic tiered Device-DAX backend is available here: #2788 |
Thanks. I will take a look. Could you close this one for now? |
Sure, I’ll close this for now. Appreciate it! |
…2714) (#2788) * feat(kv_cache): add Device-DAX backend Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * cleanup: remove dead code Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * Fix issues raised in Gemini code review Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * fix(dax): use refcounted pinning and tighten DAX validation/docs Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * docs(dax): add missing docstrings and reorganize helper methods Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * fix: resolve pre-commit lint, format, and type errors in DAX backend Fix ruff E501 line-length violations, ruff format inconsistencies, isort import ordering, and mypy type errors across dax_backend.py and test_dax_backend.py so that CI code quality checks pass cleanly. Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * Add DAX arena field comments Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> --------- Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com>
…MCache#2714) (LMCache#2788) * feat(kv_cache): add Device-DAX backend Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * cleanup: remove dead code Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * Fix issues raised in Gemini code review Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * fix(dax): use refcounted pinning and tighten DAX validation/docs Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * docs(dax): add missing docstrings and reorganize helper methods Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * fix: resolve pre-commit lint, format, and type errors in DAX backend Fix ruff E501 line-length violations, ruff format inconsistencies, isort import ordering, and mypy type errors across dax_backend.py and test_dax_backend.py so that CI code quality checks pass cleanly. Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * Add DAX arena field comments Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> --------- Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com>
…MCache#2714) (LMCache#2788) * feat(kv_cache): add Device-DAX backend Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * cleanup: remove dead code Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * Fix issues raised in Gemini code review Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * fix(dax): use refcounted pinning and tighten DAX validation/docs Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * docs(dax): add missing docstrings and reorganize helper methods Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * fix: resolve pre-commit lint, format, and type errors in DAX backend Fix ruff E501 line-length violations, ruff format inconsistencies, isort import ordering, and mypy type errors across dax_backend.py and test_dax_backend.py so that CI code quality checks pass cleanly. Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * Add DAX arena field comments Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> --------- Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com>
…MCache#2714) (LMCache#2788) * feat(kv_cache): add Device-DAX backend Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * cleanup: remove dead code Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * Fix issues raised in Gemini code review Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * fix(dax): use refcounted pinning and tighten DAX validation/docs Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * docs(dax): add missing docstrings and reorganize helper methods Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * fix: resolve pre-commit lint, format, and type errors in DAX backend Fix ruff E501 line-length violations, ruff format inconsistencies, isort import ordering, and mypy type errors across dax_backend.py and test_dax_backend.py so that CI code quality checks pass cleanly. Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * Add DAX arena field comments Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> --------- Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com>
…MCache#2714) (LMCache#2788) * feat(kv_cache): add Device-DAX backend Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * cleanup: remove dead code Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * Fix issues raised in Gemini code review Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * fix(dax): use refcounted pinning and tighten DAX validation/docs Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * docs(dax): add missing docstrings and reorganize helper methods Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * fix: resolve pre-commit lint, format, and type errors in DAX backend Fix ruff E501 line-length violations, ruff format inconsistencies, isort import ordering, and mypy type errors across dax_backend.py and test_dax_backend.py so that CI code quality checks pass cleanly. Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> * Add DAX arena field comments Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com> --------- Signed-off-by: JaeHyeong Park <tino.park@samsung.com> Signed-off-by: jayhpark530 <jayhpark530@gmail.com>
Fixes #2572
What this PR does / why we need it
This PR implements the
/dev/dax(Device-DAX) backend discussed in RFC issue #2572.The goal is to enable KV cache storage on byte-addressable memory devices exposed through
/dev/dax, such as:Since LMCache already supports multiple storage backends, adding a DAX backend enables experimentation with KV cache tiering on emerging heterogeneous memory systems.
The backend maps a
/dev/daxdevice into userspace and uses the mapped region as a KV cache arena managed by LMCache.Two usage modes are supported:
1. Tiered backend mode
/dev/daxis used as a storage tier afterLocalCPUBackend, similar to other LMCache storage backendsThis is the recommended configuration.
2. Primary backend mode
Unlike traditional storage backends,
/dev/daxexposes byte-addressable memory that can be directly mapped into userspace.Because of this property, the mapped DAX arena can also serve as an allocator backend in LMCache.
/dev/daxreplaces the DRAM tier and acts as both the allocator and storage backendMemoryObjs are backed by the mapped regionWhen supported by the platform, the backend registers the mapped address range using
cudaHostRegister, allowing GPU ↔ DAX transfers to bypass an additional DRAM staging copy.As discussed in the RFC thread, this mode is platform-dependent, so it is implemented as an opt-in fast path.
Potential benefits include:
/dev/daxmmap arena as KV cache storageKey features
/dev/daxstorage backend implemented as an LMCache storage pluginAllocatorBackendInterfaceDocumentation
Documentation for the DAX backend configuration and usage is included:
If applicable