[MP]feat: support different kv cache shape and dtype across layers#2926
[MP]feat: support different kv cache shape and dtype across layers#2926maobaolong merged 2 commits intoLMCache:devfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements support for multiple KV layer groups with distinct shapes and dtypes. Key changes include group-aware memory copy operations in gpu_ops.py, a new method in KVLayerGroupsManager to build groups from tensor lists, and a refactored GPUCacheContext that manages per-group buffers and pointers. The multiprocess server now performs transfers iteratively across these groups. Feedback was provided to replace a platform-dependent and potentially unsafe use of array.array and torch.frombuffer with a direct torch.tensor call for collecting data pointers to ensure 64-bit compatibility and memory safety.
| import array | ||
| return torch.frombuffer(array.array("l", pointers), dtype=torch.long) |
There was a problem hiding this comment.
The use of array.array("l", ...) is platform-dependent; on some systems (like 64-bit Windows), long is 32-bit, which would truncate 64-bit pointers. Additionally, torch.frombuffer does not manage the lifetime of the underlying array.array object, which could lead to memory safety issues if the tensor is used after the temporary array is garbage collected. It is safer and more idiomatic to use torch.tensor(..., dtype=torch.long). This also removes the need for the local import array.
| import array | |
| return torch.frombuffer(array.array("l", pointers), dtype=torch.long) | |
| return torch.tensor([kv_caches[i].data_ptr() for i in group.layer_indices], dtype=torch.long) |
e5ea198 to
e53a747
Compare
e53a747 to
634f596
Compare
|
@liuyumoye It seems getting conflict with dev now, would you like to resolve the conflict first? Hope to merge this PR so that MP mode can support DSA. |
75af69c to
14a720d
Compare
ApostaC
left a comment
There was a problem hiding this comment.
High-level comments:
- Tmp buffer and
lmcache_async_memcpy_h2d/d2hshould not be aware of kv cache group information. This can reduce the number of code changes by a lot - Please add some unit test for 132 int8s to test the kernel support.
Please see the detailed comments below
There was a problem hiding this comment.
I don't think we need to touch this file.
lmcache_memcpy_async_h2d/d2h doesn't need to know the layout inside the memory object, and it should be called outside the for group in kv_groups: loop.
| # Backward-compat scalar aliases (group 0) | ||
| self.hidden_dim_size_ = self.hidden_dim_sizes_[0] | ||
| self.num_heads_ = self.group_num_heads_[0] | ||
| self.head_size_ = self.group_head_sizes_[0] | ||
| self.shape_desc_ = self.shape_descs_[0] |
There was a problem hiding this comment.
Do we really need to keep this backward compatibility? I feel like we can force all the codes in server.py to use new interfaces
| @@ -119,17 +154,27 @@ def __init__( | |||
| 0, self.block_size_, dtype=torch.long, device=self.device_ | |||
| ).unsqueeze(0) | |||
| self.slot_mapping_tensor_ = (offsets + block_ids * self.block_size_).reshape( | |||
| (self.num_blocks, self.block_size_) | |||
| (self.num_blocks_, self.block_size_) | |||
| ) | |||
There was a problem hiding this comment.
Actually, this can be dropped.
There was a problem hiding this comment.
And also the old slot_mapping related apis
| tmp_buffer_shape = self.get_kv_buffer_shape( | ||
| lmcache_chunk_size * self.max_batch_size | ||
| ) | ||
| self.tmp_gpu_buffer_ = torch.empty( | ||
| tmp_buffer_shape, dtype=self.dtype, device=self.device_ | ||
| ) | ||
| self.tmp_gpu_buffers_: list[torch.Tensor] = [ | ||
| torch.empty( | ||
| self.get_kv_buffer_shape( | ||
| lmcache_chunk_size * self.max_batch_size, group_idx | ||
| ), | ||
| dtype=group.dtype, | ||
| device=self.device_, | ||
| ) | ||
| for group_idx, group in enumerate( | ||
| self.kv_layer_groups_manager_.kv_layer_groups | ||
| ) | ||
| ] | ||
| # Single-group alias for backward compatibility | ||
| self.tmp_gpu_buffer_ = self.tmp_gpu_buffers_[0] |
There was a problem hiding this comment.
As I mentioned above, for tmp_gpu_buffer, we don't need to create tmp_gpu_buffer for each group, but just a "flat" one for all the groups.
There was a problem hiding this comment.
We can have a helper function called something like _get_kv_buffer_shape_unified_group() to get the shapes.
| for group_idx in range(num_groups): | ||
| tmp_buffers = gpu_context.get_tmp_gpu_buffer_batched( | ||
| self.chunk_size, batch_len, group_idx | ||
| ) | ||
| group_kv_pointers = gpu_context.get_group_kv_pointers(group_idx) | ||
|
|
||
| # H2D copy for all chunks in the batch | ||
| for tmp_buffer, memory_obj in zip( | ||
| tmp_buffers, memory_obj_batch, strict=False | ||
| ): | ||
| lmcache_memcpy_async_h2d(memory_obj, tmp_buffer, group_idx) | ||
|
|
||
| lmc_ops.multi_layer_block_kv_transfer( | ||
| group_kv_pointers, | ||
| [tb.data_ptr() for tb in tmp_buffers], | ||
| chunk_block_ids_gpu, | ||
| gpu_context.device, | ||
| lmc_ops.TransferDirection.H2D, | ||
| gpu_context.get_shape_desc(group_idx), | ||
| self.chunk_size, | ||
| gpu_context.gpu_kv_format_, | ||
| skip_blocks_in_chunk, | ||
| ) |
There was a problem hiding this comment.
With my proposal above, the code will be something like this:
| for group_idx in range(num_groups): | |
| tmp_buffers = gpu_context.get_tmp_gpu_buffer_batched( | |
| self.chunk_size, batch_len, group_idx | |
| ) | |
| group_kv_pointers = gpu_context.get_group_kv_pointers(group_idx) | |
| # H2D copy for all chunks in the batch | |
| for tmp_buffer, memory_obj in zip( | |
| tmp_buffers, memory_obj_batch, strict=False | |
| ): | |
| lmcache_memcpy_async_h2d(memory_obj, tmp_buffer, group_idx) | |
| lmc_ops.multi_layer_block_kv_transfer( | |
| group_kv_pointers, | |
| [tb.data_ptr() for tb in tmp_buffers], | |
| chunk_block_ids_gpu, | |
| gpu_context.device, | |
| lmc_ops.TransferDirection.H2D, | |
| gpu_context.get_shape_desc(group_idx), | |
| self.chunk_size, | |
| gpu_context.gpu_kv_format_, | |
| skip_blocks_in_chunk, | |
| ) | |
| # H2D copy for all chunks in the batch | |
| tmp_buffers = gpu_context.get_tmp_gpu_buffer_batched( | |
| self.chunk_size, batch_len | |
| ) | |
| lmcache_memcpy_async_h2d(memory_obj, tmp_buffer, group_idx) | |
| for group_idx in range(num_groups): | |
| group_kv_pointers = gpu_context.get_group_kv_pointers(group_idx) | |
| ### New code to get buffer offset from gpu_context by group_idx | |
| tmp_buffer_offsets = gpu_context.get_tmp_gpu_buffer_offset(group_idx) | |
| lmc_ops.multi_layer_block_kv_transfer( | |
| group_kv_pointers, | |
| [tb.data_ptr() + tmp_buffer_offsets for tb in tmp_buffers], | |
| chunk_block_ids_gpu, | |
| gpu_context.device, | |
| lmc_ops.TransferDirection.H2D, | |
| gpu_context.get_shape_desc(group_idx), | |
| self.chunk_size, | |
| gpu_context.gpu_kv_format_, | |
| skip_blocks_in_chunk, | |
| ) |
14a720d to
ecf01b6
Compare
d4ba284 to
c363473
Compare
c363473 to
2b762e3
Compare
|
|
||
| self.hidden_dim_sizes_.append(hidden_dim) | ||
| self.group_num_heads_.append(nh) | ||
| self.group_head_sizes_.append(hs) |
There was a problem hiding this comment.
Unused attributes stored but never read
Low Severity
group_num_heads_ and group_head_sizes_ are populated in the constructor but never read anywhere in the codebase. These lists are dead stores — the equivalent values (nh and hs) are already stored inside each PageBufferShapeDesc in shape_descs_, which is what callers actually use. Keeping unused state in the class adds confusion for future maintainers who may wonder where these are consumed.
Triggered by project rule: LMCache Code Review Style Guide
Reviewed by Cursor Bugbot for commit 2b762e3. Configure here.
2b762e3 to
b108af3
Compare
Thanks for pointing that out! I've resolved the merge conflict with the latest dev branch. The PR is ready for review again. Please let me know if there are any other issues. |
ApostaC
left a comment
There was a problem hiding this comment.
All are nit comments. Otherwise LGTM!
There was a problem hiding this comment.
nit note: usually we put #define and #undef outside the function body.
| """ | ||
| assert memory_obj.tensor is not None | ||
| assert memory_obj.tensor.numel() == gpu_buffer.numel() | ||
| src_tensor = memory_obj.raw_data |
There was a problem hiding this comment.
nit: not sure MemoryObj.raw_data is a public&stable API or not. But I do see there is data_ptr() property define in the MemoryObj base class. We can use that directly when calling lmc_ops.lmcache_memcpy_async instead.
| """ | ||
| assert memory_obj.tensor is not None | ||
| assert memory_obj.tensor.numel() == gpu_buffer.numel() | ||
| dst_tensor = memory_obj.raw_data |
| given group.""" | ||
| return self.group_kv_pointers_[group_idx] | ||
|
|
||
| def get_tmp_gpu_buffer_flat(self, chunk_idx: int = 0) -> torch.Tensor: |
There was a problem hiding this comment.
nit: let's avoid using a default parameter for chunk_idx. We should make sure that the caller understands it needs to pass in chunk_idx because it directly relates to the batching logic.
| The returned slice has exactly ``tmp_chunk_bytes_`` bytes and its | ||
| layout matches ``MemoryObj.raw_data`` (groups concatenated in order), | ||
| so it can be copied to/from a MemoryObj with a single memcpy. |
There was a problem hiding this comment.
nit: Let's avoid using tmp_chunk_bytes_ and MemoryObj.raw_data in docstring to avoid confusion for other developers. Ideally, we can say something like:
The returned tensor will fit a memory full object corresponding ``self.chunk_size`` tokens.
| num_elems = shape.numel() | ||
| return self.tmp_gpu_buffer_.flatten()[:num_elems].view(shape) | ||
| Returns a view of the temporary GPU buffer for the given group, | ||
| sized for a single request of ``num_tokens`` tokens (chunk 0). |
There was a problem hiding this comment.
nit: num_tokens --> lmcache_chunk_size. Also, the (chunk 0) at the end is a bit confusing.
| "num_layers": ctx.num_layers, | ||
| "block_size": ctx.block_size, | ||
| "hidden_dim_size": ctx.hidden_dim_size, | ||
| "hidden_dim_sizes": str(ctx.hidden_dim_sizes_), |
There was a problem hiding this comment.
nit: we should not use private members here. Let's have a property defined as hidden_dim_sizes in GPUCacheContext
b108af3 to
91c4630
Compare
- gpu_ops: add group_idx param to lmcache_memcpy_async_h2d/d2h, use memory_obj.get_tensor(group_idx) instead of memory_obj.tensor - kv_layer_groups: add build_kv_layer_groups_from_list() to group layers by (shape, dtype) from a plain tensor list - gpu_context: introduce per-group shape_descs_, hidden_dim_sizes_, group_kv_pointers_, and tmp_gpu_buffers_; update get_kv_buffer_shape, get_tmp_gpu_buffer, get_tmp_gpu_buffer_batched to accept group_idx; add get_shape_desc(group_idx) and get_group_kv_pointers(group_idx) - server: update get_layout_desc, _store_loop, _retrieve_loop to iterate over all groups; fix skip_tokens_in_chunk upper bound to use batch_len instead of _BATCH_SIZE Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
91c4630 to
cbf4d52
Compare
Thanks for the review! All nit comments have been addressed. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit cbf4d52. Configure here.
| lmc_ops.TransferDirection.D2H, | ||
| gpu_context.get_shape_desc(group_idx), | ||
| self.chunk_size, | ||
| gpu_context.gpu_kv_format_, |
There was a problem hiding this comment.
Private member access across class boundary in enforced directory
Medium Severity
New code in server.py accesses gpu_context.gpu_kv_format_ (a private _-suffixed attribute) from outside GPUCacheContext. This violates the project's SLF rule, which is enforced by CI in lmcache/v1/multiprocess/. GPUCacheContext exposes gpu_kv_format_name() for the string name but has no public accessor for the format enum itself. A public property like gpu_kv_format is needed to pass the value to the kernel without cross-class private member access.
Additional Locations (1)
Triggered by project rule: LMCache Code Review Style Guide
Reviewed by Cursor Bugbot for commit cbf4d52. Configure here.
maobaolong
left a comment
There was a problem hiding this comment.
@liuyumoye Thanks for this feature, LGTM.
| TORCH_CHECK(head_bytes % sizeof(uint16_t) == 0, "head_size * element_size (", | ||
| head_bytes, ") must be divisible by 2 for vectorized access"); | ||
|
|
||
| if (head_bytes % sizeof(uint4) == 0) { |
There was a problem hiding this comment.
@liuyumoye could we add some comments to indicate how many bytes?
| head_bytes, ") must be divisible by 2 for vectorized access"); | ||
|
|
||
| if (head_bytes % sizeof(uint4) == 0) { | ||
| LAUNCH_TEMPLATED(uint4); |
There was a problem hiding this comment.
Besides, could we add some 8 bytes or 1 bytes copy?
There was a problem hiding this comment.
After discuss offline, there is no need to add 8 bytes or 1 bytes.
Add scalar type fallback hierarchy for block KV transfer kernel: head_bytes % 16 == 0 -> uint4 (16B, fastest) head_bytes % 4 == 0 -> uint32_t (4B) head_bytes % 2 == 0 -> uint16_t (2B) This fixes the runtime error for MLA models where head_size=132 (uint8), giving head_bytes=132 which is not divisible by 16 but is divisible by 4. Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
cbf4d52 to
89ece3c
Compare
|
So can I use mp LMCache when I deploy GLM-5 now? |
|
@princepride yes. |
Thank you! Have you seen I joined the latest |
I left comments on your slack |
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Align with the rename introduced in LMCache#2926 where hidden_dim_size was changed to hidden_dim_sizes (List[int]) to support kv_groups. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update test fixture and assertion in test_describe.py to match the hidden_dim_size -> hidden_dim_sizes rename from LMCache#2926. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Align with the rename introduced in LMCache#2926 where hidden_dim_size was changed to hidden_dim_sizes (List[int]) to support kv_groups. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: princepride <wangzhipeng628@gmail.com>
Update test fixture and assertion in test_describe.py to match the hidden_dim_size -> hidden_dim_sizes rename from LMCache#2926. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: princepride <wangzhipeng628@gmail.com>
* fix typo bug Signed-off-by: princepride <wangzhipeng628@gmail.com> * fix: rename hidden_dim_size to hidden_dim_sizes in describe and server Align with the rename introduced in #2926 where hidden_dim_size was changed to hidden_dim_sizes (List[int]) to support kv_groups. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> * fix: update test fixture to use hidden_dim_sizes key Update test fixture and assertion in test_describe.py to match the hidden_dim_size -> hidden_dim_sizes rename from #2926. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> --------- Signed-off-by: princepride <wangzhipeng628@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…MCache#2926) * multiprocess: support per-group KV cache transfer with group_idx - gpu_ops: add group_idx param to lmcache_memcpy_async_h2d/d2h, use memory_obj.get_tensor(group_idx) instead of memory_obj.tensor - kv_layer_groups: add build_kv_layer_groups_from_list() to group layers by (shape, dtype) from a plain tensor list - gpu_context: introduce per-group shape_descs_, hidden_dim_sizes_, group_kv_pointers_, and tmp_gpu_buffers_; update get_kv_buffer_shape, get_tmp_gpu_buffer, get_tmp_gpu_buffer_batched to accept group_idx; add get_shape_desc(group_idx) and get_group_kv_pointers(group_idx) - server: update get_layout_desc, _store_loop, _retrieve_loop to iterate over all groups; fix skip_tokens_in_chunk upper bound to use batch_len instead of _BATCH_SIZE Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> * fix: support vectorized KV transfer for non-16B-aligned head sizes Add scalar type fallback hierarchy for block KV transfer kernel: head_bytes % 16 == 0 -> uint4 (16B, fastest) head_bytes % 4 == 0 -> uint32_t (4B) head_bytes % 2 == 0 -> uint16_t (2B) This fixes the runtime error for MLA models where head_size=132 (uint8), giving head_bytes=132 which is not divisible by 16 but is divisible by 4. Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> --------- Signed-off-by: liuyumoye <adeline_ly2023@outlook.com> Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
* fix typo bug Signed-off-by: princepride <wangzhipeng628@gmail.com> * fix: rename hidden_dim_size to hidden_dim_sizes in describe and server Align with the rename introduced in LMCache#2926 where hidden_dim_size was changed to hidden_dim_sizes (List[int]) to support kv_groups. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> * fix: update test fixture to use hidden_dim_sizes key Update test fixture and assertion in test_describe.py to match the hidden_dim_size -> hidden_dim_sizes rename from LMCache#2926. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> --------- Signed-off-by: princepride <wangzhipeng628@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix typo bug Signed-off-by: princepride <wangzhipeng628@gmail.com> * fix: rename hidden_dim_size to hidden_dim_sizes in describe and server Align with the rename introduced in LMCache#2926 where hidden_dim_size was changed to hidden_dim_sizes (List[int]) to support kv_groups. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> * fix: update test fixture to use hidden_dim_sizes key Update test fixture and assertion in test_describe.py to match the hidden_dim_size -> hidden_dim_sizes rename from LMCache#2926. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: princepride <wangzhipeng628@gmail.com> --------- Signed-off-by: princepride <wangzhipeng628@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>


support different kv cache shape and dtype across layers
What this PR does / why we need it:
This PR adds support for heterogeneous KV cache shapes and dtypes across layers (e.g., models where different layers have different KV head dimensions or data types).
Previously, GPUCacheContext assumed all layers share the same shape and dtype. This PR introduces KVLayerGroupsManager to group layers by (shape, dtype), and updates the D2H/H2D transfer logic in server.py to iterate over each group independently, using per-group tmp_gpu_buffer, kv_pointers, and tensor views.
Key changes:
kv_layer_groups.py: Addbuild_kv_layer_groups_from_list()to build layer groups from a raw list of KV cache tensors (no layer names required), grouping by (shape, dtype).gpu_context.py: Replace single hidden_dim_size_ / tmp_gpu_buffer_ with per-group lists (hidden_dim_sizes_, tmp_gpu_buffers_); expose get_tmp_gpu_buffer(num_tokens, group_idx), get_kv_buffer_shape(num_tokens, group_idx), and kv_layer_groups_manager property.server.py: Update get_layout_desc() to produce per-group shapes/dtypes; refactor D2H (store) and H2D (retrieve) loops to iterate over all groups.memory_management.py: Improve error handling in tensor property and get_tensor() — replace bare assert with descriptive ValueError; make tensor fall back to get_tensor(0) in multi-group scenarios.mock_l2_adapter.py: Replace bare assert with ValueError for cleaner error messages.Special notes for your reviewers:
The grouping logic in
build_kv_layer_groups_from_list()is order-preserving: groups are sorted by the first layer index they contain.get_tmp_gpu_buffer and get_kv_buffer_shape are backward-compatible — group_idx defaults to 0, so single-group models are unaffected.
The tensor property on TensorMemoryObj now delegates to get_tensor(0) when per-group metadata is present, maintaining backward compatibility with callers that only use the single-tensor interface.
If applicable:
Note
Medium Risk
Updates CUDA transfer kernels and the MP server store/retrieve path to handle per-layer-group shapes/dtypes, which can affect correctness and performance of GPU<->CPU KV transfers. Added tests mitigate risk but changes touch core memory copy and layout logic.
Overview
Enables multiprocessing cache store/retrieve to support heterogeneous KV cache shapes and dtypes across layers by grouping layers with identical
(shape, dtype)and transferring each group independently.GPUCacheContextnow buildsKVLayerGroupsManager, maintains per-groupPageBufferShapeDescand pointer arrays, and replaces the old single temporary KV buffer with a flatuint8chunk buffer that concatenates all groups (with helpers to view per-group/per-batch slices).server.py’s layout description, store (D2H), and retrieve (H2D) paths are updated to iterate over groups and copy via the new flat buffer.CUDA
multi_layer_block_kv_transferis generalized to dispatch over vector widths (uint4/uint32_t/uint16_t) based on alignment instead of requiringuint4, and Python GPU memcpy helpers now validate sizes and copy raw bytes for non-lazy allocators. New unit tests cover multi-group temp-buffer layout and non-4-byte-alignedlmcache_memcpy_asynccopies (e.g.int8hidden size 132).Reviewed by Cursor Bugbot for commit 89ece3c. Bugbot is set up for automated code reviews on this repo. Configure here.