[Core]: Support HND KV Format#2826
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the system's flexibility by introducing comprehensive support for the Head-Number-Dimension (HND) Key-Value (KV) cache format, alongside the existing Number-Head-Dimension (NHD) format. This enables broader compatibility with various vLLM configurations and ensures that KV cache data is correctly structured and efficiently transferred, particularly for inter-process communication, by dynamically adapting to the detected memory layout and enforcing data contiguity. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
| # blocks to have coalesced memory accesses | ||
| # do NOT blindly call .contiguous() nor .permute() | ||
| # we WANT to fail here when our assumptions fail | ||
| assert tensor.storage_offset() == 0 |
There was a problem hiding this comment.
Code Review
This pull request adds support for the HND KV-cache layout from vLLM. The changes are comprehensive, touching CUDA kernels, Python GPU connectors, and multiprocessing components to handle the new memory layout. Key changes include introducing new GPUKVFormat variants, updating offset calculation logic in CUDA kernels to handle HND, and adding logic to permute HND tensors to a contiguous format for IPC. The CudaIPCWrapper has also been simplified to enforce contiguity. Overall, the changes are well-implemented and consistent. I've provided a couple of suggestions to improve code maintainability by reducing duplication.
| detected_format = None | ||
|
|
||
| if serving_engine == EngineType.VLLM: | ||
| kv_layout = layout_hints.get("kv_layout") |
There was a problem hiding this comment.
need some really good documentation to ensure that the passer of the layout_hints knows exactly what needs to be passed
There was a problem hiding this comment.
@ApostaC introducing a new TypedDict! :)
|
|
||
| # Permute HND tensors to contiguous physical shape before IPC | ||
| # wrapping — CudaIPCWrapper asserts contiguity. | ||
| if kv_layout == "HND": |
There was a problem hiding this comment.
blanket permute_kv_cahces_to_contiguous
add a warning for now that we detected non contiguous for a case we haven't accounted for
There was a problem hiding this comment.
is it HND or some unknown reason
| self.gpu_kv_format = discover_gpu_kv_format( | ||
| kv_caches, EngineType.VLLM, layout_hints=layout_hints | ||
| ) | ||
| if is_hnd(self.gpu_kv_format): |
There was a problem hiding this comment.
also do a blanket permutation here.
also: double check whether we shoudl discover first or permute first
There was a problem hiding this comment.
permute first semantics enforced
| kv_caches, EngineType.VLLM, layout_hints=layout_hints | ||
| ) | ||
| assert_is_vllm_flash_attn_or_flash_infer(self.gpu_kv_format) | ||
| if is_hnd(self.gpu_kv_format): |
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
8171574 to
dbcdd55
Compare
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
ApostaC
left a comment
There was a problem hiding this comment.
Some comments regarding high-level placement for the modules.
| # First Party | ||
| from lmcache.v1.gpu_connector.utils import ( | ||
| ensure_contiguous_kv_caches, | ||
| try_get_vllm_kv_cache_layout, |
There was a problem hiding this comment.
Since try_get_vllm_kv_cache_layout is related to vLLM, can we put it under lmcache/integration/vllm instead of in gpu_connector module?
In this case, layout_hints itself becomes a LMCache-standard interface, and how to set the layout hints should be done by the serving engine integration
There was a problem hiding this comment.
great suggestion! will do
| def _vllm_layout_hints() -> LayoutHints: | ||
| """Build layout_hints dict by querying vLLM at runtime.""" | ||
| hints: LayoutHints = {} | ||
| kv_layout = try_get_vllm_kv_cache_layout() | ||
| if kv_layout is not None: | ||
| hints["kv_layout"] = kv_layout | ||
| return hints |
There was a problem hiding this comment.
Following up with the above comment, this can be moved to vLLM integration folder
| from lmcache.v1.memory_management import GPUMemoryAllocator # noqa: E501 | ||
| from lmcache.v1.memory_management import MemoryFormat, MemoryObj | ||
| from lmcache.v1.metadata import LMCacheMetadata | ||
| from lmcache.v1.multiprocess.custom_types import LayoutHints |
There was a problem hiding this comment.
It becomes a bit weird to have gpu_connector importing things from the multiprocess module. Do you have a better idea to place the type definition?
There was a problem hiding this comment.
Probably put it into gpu_connector/utils.py?
There was a problem hiding this comment.
great catch, let me double check all of the import locations again!
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
…tiguous-registration
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| ``"HND"`` — heads before block-size (``VLLM_KV_CACHE_LAYOUT=HND``). | ||
| """ | ||
|
|
||
| kv_layout: Literal["NHD", "HND"] |
There was a problem hiding this comment.
Missing SPDX header comment in test handler helpers file
Low Severity
The new public class LayoutHints has a docstring but omits the Args / Returns / Exceptions sections required by the style guide for all new public types. The docstring describes the class purpose but only documents the kv_layout key informally in a Keys: section rather than as a standard docstring format. This is a minor documentation gap per the project's convention rules for public API documentation.
Triggered by project rule: LMCache Code Review Style Guide
Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
* Support the HND format from vLLM Signed-off-by: Samuel Shen <slshen@tensormesh.ai>
* Support the HND format from vLLM Signed-off-by: Samuel Shen <slshen@tensormesh.ai>


Test plan
VLLM_KV_CACHE_LAYOUT=HND(Qwen2.5-3B): basic completion, KV cachereuse, deterministic output after cache reset
NL_X_NB_TWO_NH_BS_HS)Note
Medium Risk
Touches CUDA KV-cache transfer kernels and the multi-process registration protocol; incorrect layout detection or offset math could corrupt KV data or crash, but changes are localized and guarded by new checks/tests.
Overview
Adds end-to-end support for vLLM’s HND KV cache layout (heads-before-block) alongside existing NHD/MLA formats.
Updates the CUDA transfer path (
multi_layer_kv_transfer/single_layer_kv_transfer) to compute correct offsets for new HNDGPUKVFormats, adds requiredhead_sizeplumbing/validation, and exposes the new formats +head_sizearg via pybind.Extends the vLLM integration and multi-process server protocol to pass
layout_hints(e.g.{"kv_layout": "HND"}), auto-detect layout at runtime, and permute non-contiguous vLLM HND tensors back to a contiguous physical view before pointer capture/format discovery. Adds unit tests covering HND round-trips and updates existing registration tests for the new payload shape.Written by Cursor Bugbot for commit 9325b08. This will update automatically on new commits. Configure here.