[Core]: Support HND KV Format by sammshen · Pull Request #2826 · LMCache/LMCache

sammshen · 2026-03-19T22:27:44Z

Test plan

E2E test with VLLM_KV_CACHE_LAYOUT=HND (Qwen2.5-3B): basic completion, KV cache
reuse, deterministic output after cache reset
NHD regression test: deterministic output matches baseline
E2E with flash infer backend (NL_X_NB_TWO_NH_BS_HS)
Multi-process (tensor parallel) HND test

Note

Medium Risk
Touches CUDA KV-cache transfer kernels and the multi-process registration protocol; incorrect layout detection or offset math could corrupt KV data or crash, but changes are localized and guarded by new checks/tests.

Overview
Adds end-to-end support for vLLM’s HND KV cache layout (heads-before-block) alongside existing NHD/MLA formats.

Updates the CUDA transfer path (multi_layer_kv_transfer/single_layer_kv_transfer) to compute correct offsets for new HND GPUKVFormats, adds required head_size plumbing/validation, and exposes the new formats + head_size arg via pybind.

Extends the vLLM integration and multi-process server protocol to pass layout_hints (e.g. {"kv_layout": "HND"}), auto-detect layout at runtime, and permute non-contiguous vLLM HND tensors back to a contiguous physical view before pointer capture/format discovery. Adds unit tests covering HND round-trips and updates existing registration tests for the new payload shape.

^{Written by Cursor Bugbot for commit 9325b08. This will update automatically on new commits. Configure here.}

gemini-code-assist · 2026-03-19T22:28:18Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's flexibility by introducing comprehensive support for the Head-Number-Dimension (HND) Key-Value (KV) cache format, alongside the existing Number-Head-Dimension (NHD) format. This enables broader compatibility with various vLLM configurations and ensures that KV cache data is correctly structured and efficiently transferred, particularly for inter-process communication, by dynamically adapting to the detected memory layout and enforcing data contiguity.

Highlights

New KV Cache Formats: Introduced two new GPUKVFormat enum values (NL_X_TWO_NB_NH_BS_HS and NL_X_NB_TWO_NH_BS_HS) to explicitly support HND (Head-Number-Dimension) layouts for vLLM flash attention and flash infer.
HND Format Handling in CUDA Kernels: Modified the page_buffer_offset CUDA kernel to correctly calculate memory offsets for the new HND formats, including the addition of a head_size parameter to relevant kernel functions and their calls.
Dynamic KV Layout Detection and Permutation: Implemented logic to dynamically detect the vLLM KV cache layout ("NHD" or "HND") at runtime and, if HND is detected, permute the KV cache tensors to a contiguous physical shape before inter-process communication (IPC) wrapping.
Updated GPU Connector and Utility Functions: Extended various Python utility functions and GPU connector classes to correctly interpret, manage, and transfer KV caches in the new HND formats, ensuring compatibility and proper data handling across the system.
IPC Contiguity Enforcement: The CudaIPCWrapper now strictly asserts that tensors are contiguous and have a zero storage offset, relying on the new permutation logic to ensure this condition is met for HND formats before IPC.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

sammshen · 2026-03-19T22:28:22Z

+        # blocks to have coalesced memory accesses
+        # do NOT blindly call .contiguous() nor .permute()
+        # we WANT to fail here when our assumptions fail
+        assert tensor.storage_offset() == 0


@ApostaC changed this back :)

gemini-code-assist

Code Review

This pull request adds support for the HND KV-cache layout from vLLM. The changes are comprehensive, touching CUDA kernels, Python GPU connectors, and multiprocessing components to handle the new memory layout. Key changes include introducing new GPUKVFormat variants, updating offset calculation logic in CUDA kernels to handle HND, and adding logic to permute HND tensors to a contiguous format for IPC. The CudaIPCWrapper has also been simplified to enforce contiguity. Overall, the changes are well-implemented and consistent. I've provided a couple of suggestions to improve code maintainability by reducing duplication.

sammshen · 2026-03-20T22:58:27Z

    detected_format = None

    if serving_engine == EngineType.VLLM:
+        kv_layout = layout_hints.get("kv_layout")


need some really good documentation to ensure that the passer of the layout_hints knows exactly what needs to be passed

@ApostaC introducing a new TypedDict! :)

sammshen · 2026-03-20T23:09:31Z

+
+        # Permute HND tensors to contiguous physical shape before IPC
+        # wrapping — CudaIPCWrapper asserts contiguity.
+        if kv_layout == "HND":


blanket permute_kv_cahces_to_contiguous

add a warning for now that we detected non contiguous for a case we haven't accounted for

is it HND or some unknown reason

sammshen · 2026-03-20T23:11:23Z

+        self.gpu_kv_format = discover_gpu_kv_format(
+            kv_caches, EngineType.VLLM, layout_hints=layout_hints
+        )
+        if is_hnd(self.gpu_kv_format):


also do a blanket permutation here.

also: double check whether we shoudl discover first or permute first

permute first semantics enforced

sammshen · 2026-03-20T23:11:59Z

+                kv_caches, EngineType.VLLM, layout_hints=layout_hints
+            )
            assert_is_vllm_flash_attn_or_flash_infer(self.gpu_kv_format)
+            if is_hnd(self.gpu_kv_format):


all just permute

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

ApostaC

Some comments regarding high-level placement for the modules.

ApostaC · 2026-03-25T18:08:53Z

+        # First Party
+        from lmcache.v1.gpu_connector.utils import (
+            ensure_contiguous_kv_caches,
+            try_get_vllm_kv_cache_layout,


Since try_get_vllm_kv_cache_layout is related to vLLM, can we put it under lmcache/integration/vllm instead of in gpu_connector module?

In this case, layout_hints itself becomes a LMCache-standard interface, and how to set the layout hints should be done by the serving engine integration

great suggestion! will do

ApostaC · 2026-03-25T18:12:24Z

+def _vllm_layout_hints() -> LayoutHints:
+    """Build layout_hints dict by querying vLLM at runtime."""
+    hints: LayoutHints = {}
+    kv_layout = try_get_vllm_kv_cache_layout()
+    if kv_layout is not None:
+        hints["kv_layout"] = kv_layout
+    return hints


Following up with the above comment, this can be moved to vLLM integration folder

ApostaC · 2026-03-25T18:16:42Z

 from lmcache.v1.memory_management import GPUMemoryAllocator  # noqa: E501
 from lmcache.v1.memory_management import MemoryFormat, MemoryObj
 from lmcache.v1.metadata import LMCacheMetadata
+from lmcache.v1.multiprocess.custom_types import LayoutHints


It becomes a bit weird to have gpu_connector importing things from the multiprocess module. Do you have a better idea to place the type definition?

Probably put it into gpu_connector/utils.py?

great catch, let me double check all of the import locations again!

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

…tiguous-registration

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-26T07:02:37Z

+            ``"HND"`` — heads before block-size (``VLLM_KV_CACHE_LAYOUT=HND``).
+    """
+
+    kv_layout: Literal["NHD", "HND"]


Missing SPDX header comment in test handler helpers file

Low Severity

The new public class LayoutHints has a docstring but omits the Args / Returns / Exceptions sections required by the style guide for all new public types. The docstring describes the class purpose but only documents the kv_layout key informally in a Keys: section rather than as a standard docstring format. This is a minor documentation gap per the project's convention rules for public API documentation.

^{Triggered by project rule: LMCache Code Review Style Guide}

ApostaC

LGTM!

Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

* Support the HND format from vLLM Signed-off-by: Samuel Shen <slshen@tensormesh.ai>

sammshen commented Mar 19, 2026

View reviewed changes

gemini-code-assist Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread csrc/mem_kernels.cu

Comment thread lmcache/v1/gpu_connector/utils.py

cursor Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread csrc/mem_kernels.cu Outdated

Comment thread lmcache/v1/multiprocess/gpu_context.py