Skip to content

[MP]feat: support different kv cache shape and dtype across layers#2926

Merged
maobaolong merged 2 commits intoLMCache:devfrom
liuyumoye:lmcache_support_dsa
Apr 7, 2026
Merged

[MP]feat: support different kv cache shape and dtype across layers#2926
maobaolong merged 2 commits intoLMCache:devfrom
liuyumoye:lmcache_support_dsa

Conversation

@liuyumoye
Copy link
Copy Markdown
Contributor

@liuyumoye liuyumoye commented Apr 1, 2026

support different kv cache shape and dtype across layers

What this PR does / why we need it:
This PR adds support for heterogeneous KV cache shapes and dtypes across layers (e.g., models where different layers have different KV head dimensions or data types).

Previously, GPUCacheContext assumed all layers share the same shape and dtype. This PR introduces KVLayerGroupsManager to group layers by (shape, dtype), and updates the D2H/H2D transfer logic in server.py to iterate over each group independently, using per-group tmp_gpu_buffer, kv_pointers, and tensor views.

Key changes:

kv_layer_groups.py: Add build_kv_layer_groups_from_list() to build layer groups from a raw list of KV cache tensors (no layer names required), grouping by (shape, dtype).

gpu_context.py: Replace single hidden_dim_size_ / tmp_gpu_buffer_ with per-group lists (hidden_dim_sizes_, tmp_gpu_buffers_); expose get_tmp_gpu_buffer(num_tokens, group_idx), get_kv_buffer_shape(num_tokens, group_idx), and kv_layer_groups_manager property.

server.py: Update get_layout_desc() to produce per-group shapes/dtypes; refactor D2H (store) and H2D (retrieve) loops to iterate over all groups.

memory_management.py: Improve error handling in tensor property and get_tensor() — replace bare assert with descriptive ValueError; make tensor fall back to get_tensor(0) in multi-group scenarios.

mock_l2_adapter.py: Replace bare assert with ValueError for cleaner error messages.

Special notes for your reviewers:
The grouping logic in build_kv_layer_groups_from_list() is order-preserving: groups are sorted by the first layer index they contain.

get_tmp_gpu_buffer and get_kv_buffer_shape are backward-compatible — group_idx defaults to 0, so single-group models are unaffected.

The tensor property on TensorMemoryObj now delegates to get_tensor(0) when per-group metadata is present, maintaining backward compatibility with callers that only use the single-tensor interface.

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Medium Risk
Updates CUDA transfer kernels and the MP server store/retrieve path to handle per-layer-group shapes/dtypes, which can affect correctness and performance of GPU<->CPU KV transfers. Added tests mitigate risk but changes touch core memory copy and layout logic.

Overview
Enables multiprocessing cache store/retrieve to support heterogeneous KV cache shapes and dtypes across layers by grouping layers with identical (shape, dtype) and transferring each group independently.

GPUCacheContext now builds KVLayerGroupsManager, maintains per-group PageBufferShapeDesc and pointer arrays, and replaces the old single temporary KV buffer with a flat uint8 chunk buffer that concatenates all groups (with helpers to view per-group/per-batch slices). server.py’s layout description, store (D2H), and retrieve (H2D) paths are updated to iterate over groups and copy via the new flat buffer.

CUDA multi_layer_block_kv_transfer is generalized to dispatch over vector widths (uint4/uint32_t/uint16_t) based on alignment instead of requiring uint4, and Python GPU memcpy helpers now validate sizes and copy raw bytes for non-lazy allocators. New unit tests cover multi-group temp-buffer layout and non-4-byte-aligned lmcache_memcpy_async copies (e.g. int8 hidden size 132).

Reviewed by Cursor Bugbot for commit 89ece3c. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for multiple KV layer groups with distinct shapes and dtypes. Key changes include group-aware memory copy operations in gpu_ops.py, a new method in KVLayerGroupsManager to build groups from tensor lists, and a refactored GPUCacheContext that manages per-group buffers and pointers. The multiprocess server now performs transfers iteratively across these groups. Feedback was provided to replace a platform-dependent and potentially unsafe use of array.array and torch.frombuffer with a direct torch.tensor call for collecting data pointers to ensure 64-bit compatibility and memory safety.

Comment thread lmcache/v1/kv_layer_groups.py Outdated
Comment on lines +205 to +206
import array
return torch.frombuffer(array.array("l", pointers), dtype=torch.long)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of array.array("l", ...) is platform-dependent; on some systems (like 64-bit Windows), long is 32-bit, which would truncate 64-bit pointers. Additionally, torch.frombuffer does not manage the lifetime of the underlying array.array object, which could lead to memory safety issues if the tensor is used after the temporary array is garbage collected. It is safer and more idiomatic to use torch.tensor(..., dtype=torch.long). This also removes the need for the local import array.

Suggested change
import array
return torch.frombuffer(array.array("l", pointers), dtype=torch.long)
return torch.tensor([kv_caches[i].data_ptr() for i in group.layer_indices], dtype=torch.long)

Comment thread lmcache/v1/multiprocess/server.py
Comment thread lmcache/v1/kv_layer_groups.py Outdated
@liuyumoye liuyumoye force-pushed the lmcache_support_dsa branch from e5ea198 to e53a747 Compare April 1, 2026 06:53
@liuyumoye liuyumoye changed the title feat: support different kv cache shape and dtype across layers [MP]feat: support different kv cache shape and dtype across layers Apr 1, 2026
@liuyumoye liuyumoye force-pushed the lmcache_support_dsa branch from e53a747 to 634f596 Compare April 2, 2026 07:02
Comment thread lmcache/v1/multiprocess/server.py Outdated
@maobaolong
Copy link
Copy Markdown
Collaborator

@liuyumoye It seems getting conflict with dev now, would you like to resolve the conflict first? Hope to merge this PR so that MP mode can support DSA.

@liuyumoye liuyumoye force-pushed the lmcache_support_dsa branch 2 times, most recently from 75af69c to 14a720d Compare April 2, 2026 14:05
Comment thread lmcache/v1/gpu_connector/gpu_ops.py Outdated
Comment thread lmcache/v1/kv_layer_groups.py
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level comments:

  • Tmp buffer and lmcache_async_memcpy_h2d/d2h should not be aware of kv cache group information. This can reduce the number of code changes by a lot
  • Please add some unit test for 132 int8s to test the kernel support.

Please see the detailed comments below

Comment thread csrc/mp_mem_kernels.cu
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to touch this file.
lmcache_memcpy_async_h2d/d2h doesn't need to know the layout inside the memory object, and it should be called outside the for group in kv_groups: loop.

Comment thread lmcache/v1/multiprocess/gpu_context.py Outdated
Comment on lines +134 to +138
# Backward-compat scalar aliases (group 0)
self.hidden_dim_size_ = self.hidden_dim_sizes_[0]
self.num_heads_ = self.group_num_heads_[0]
self.head_size_ = self.group_head_sizes_[0]
self.shape_desc_ = self.shape_descs_[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to keep this backward compatibility? I feel like we can force all the codes in server.py to use new interfaces

Comment thread lmcache/v1/multiprocess/gpu_context.py Outdated
Comment on lines 147 to 158
@@ -119,17 +154,27 @@ def __init__(
0, self.block_size_, dtype=torch.long, device=self.device_
).unsqueeze(0)
self.slot_mapping_tensor_ = (offsets + block_ids * self.block_size_).reshape(
(self.num_blocks, self.block_size_)
(self.num_blocks_, self.block_size_)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this can be dropped.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also the old slot_mapping related apis

Comment thread lmcache/v1/multiprocess/gpu_context.py Outdated
Comment on lines +127 to +177
tmp_buffer_shape = self.get_kv_buffer_shape(
lmcache_chunk_size * self.max_batch_size
)
self.tmp_gpu_buffer_ = torch.empty(
tmp_buffer_shape, dtype=self.dtype, device=self.device_
)
self.tmp_gpu_buffers_: list[torch.Tensor] = [
torch.empty(
self.get_kv_buffer_shape(
lmcache_chunk_size * self.max_batch_size, group_idx
),
dtype=group.dtype,
device=self.device_,
)
for group_idx, group in enumerate(
self.kv_layer_groups_manager_.kv_layer_groups
)
]
# Single-group alias for backward compatibility
self.tmp_gpu_buffer_ = self.tmp_gpu_buffers_[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned above, for tmp_gpu_buffer, we don't need to create tmp_gpu_buffer for each group, but just a "flat" one for all the groups.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a helper function called something like _get_kv_buffer_shape_unified_group() to get the shapes.

Comment on lines +475 to +497
for group_idx in range(num_groups):
tmp_buffers = gpu_context.get_tmp_gpu_buffer_batched(
self.chunk_size, batch_len, group_idx
)
group_kv_pointers = gpu_context.get_group_kv_pointers(group_idx)

# H2D copy for all chunks in the batch
for tmp_buffer, memory_obj in zip(
tmp_buffers, memory_obj_batch, strict=False
):
lmcache_memcpy_async_h2d(memory_obj, tmp_buffer, group_idx)

lmc_ops.multi_layer_block_kv_transfer(
group_kv_pointers,
[tb.data_ptr() for tb in tmp_buffers],
chunk_block_ids_gpu,
gpu_context.device,
lmc_ops.TransferDirection.H2D,
gpu_context.get_shape_desc(group_idx),
self.chunk_size,
gpu_context.gpu_kv_format_,
skip_blocks_in_chunk,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my proposal above, the code will be something like this:

Suggested change
for group_idx in range(num_groups):
tmp_buffers = gpu_context.get_tmp_gpu_buffer_batched(
self.chunk_size, batch_len, group_idx
)
group_kv_pointers = gpu_context.get_group_kv_pointers(group_idx)
# H2D copy for all chunks in the batch
for tmp_buffer, memory_obj in zip(
tmp_buffers, memory_obj_batch, strict=False
):
lmcache_memcpy_async_h2d(memory_obj, tmp_buffer, group_idx)
lmc_ops.multi_layer_block_kv_transfer(
group_kv_pointers,
[tb.data_ptr() for tb in tmp_buffers],
chunk_block_ids_gpu,
gpu_context.device,
lmc_ops.TransferDirection.H2D,
gpu_context.get_shape_desc(group_idx),
self.chunk_size,
gpu_context.gpu_kv_format_,
skip_blocks_in_chunk,
)
# H2D copy for all chunks in the batch
tmp_buffers = gpu_context.get_tmp_gpu_buffer_batched(
self.chunk_size, batch_len
)
lmcache_memcpy_async_h2d(memory_obj, tmp_buffer, group_idx)
for group_idx in range(num_groups):
group_kv_pointers = gpu_context.get_group_kv_pointers(group_idx)
### New code to get buffer offset from gpu_context by group_idx
tmp_buffer_offsets = gpu_context.get_tmp_gpu_buffer_offset(group_idx)
lmc_ops.multi_layer_block_kv_transfer(
group_kv_pointers,
[tb.data_ptr() + tmp_buffer_offsets for tb in tmp_buffers],
chunk_block_ids_gpu,
gpu_context.device,
lmc_ops.TransferDirection.H2D,
gpu_context.get_shape_desc(group_idx),
self.chunk_size,
gpu_context.gpu_kv_format_,
skip_blocks_in_chunk,
)

@liuyumoye liuyumoye force-pushed the lmcache_support_dsa branch from 14a720d to ecf01b6 Compare April 3, 2026 14:46
Comment thread lmcache/v1/multiprocess/gpu_context.py
Comment thread lmcache/v1/multiprocess/gpu_context.py Outdated
Comment thread lmcache/v1/multiprocess/server.py Outdated
@liuyumoye liuyumoye force-pushed the lmcache_support_dsa branch 3 times, most recently from d4ba284 to c363473 Compare April 7, 2026 02:52
Comment thread lmcache/v1/multiprocess/gpu_context.py
@liuyumoye liuyumoye force-pushed the lmcache_support_dsa branch from c363473 to 2b762e3 Compare April 7, 2026 02:59
@liuyumoye liuyumoye requested a review from ApostaC April 7, 2026 02:59

self.hidden_dim_sizes_.append(hidden_dim)
self.group_num_heads_.append(nh)
self.group_head_sizes_.append(hs)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused attributes stored but never read

Low Severity

group_num_heads_ and group_head_sizes_ are populated in the constructor but never read anywhere in the codebase. These lists are dead stores — the equivalent values (nh and hs) are already stored inside each PageBufferShapeDesc in shape_descs_, which is what callers actually use. Keeping unused state in the class adds confusion for future maintainers who may wonder where these are consumed.

Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

Reviewed by Cursor Bugbot for commit 2b762e3. Configure here.

@liuyumoye liuyumoye force-pushed the lmcache_support_dsa branch from 2b762e3 to b108af3 Compare April 7, 2026 03:18
@liuyumoye
Copy link
Copy Markdown
Contributor Author

@liuyumoye It seems getting conflict with dev now, would you like to resolve the conflict first? Hope to merge this PR so that MP mode can support DSA.

Thanks for pointing that out! I've resolved the merge conflict with the latest dev branch. The PR is ready for review again. Please let me know if there are any other issues.

Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All are nit comments. Otherwise LGTM!

Comment thread csrc/mp_mem_kernels.cu
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit note: usually we put #define and #undef outside the function body.

Comment thread lmcache/v1/gpu_connector/gpu_ops.py Outdated
"""
assert memory_obj.tensor is not None
assert memory_obj.tensor.numel() == gpu_buffer.numel()
src_tensor = memory_obj.raw_data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure MemoryObj.raw_data is a public&stable API or not. But I do see there is data_ptr() property define in the MemoryObj base class. We can use that directly when calling lmc_ops.lmcache_memcpy_async instead.

Comment thread lmcache/v1/gpu_connector/gpu_ops.py Outdated
"""
assert memory_obj.tensor is not None
assert memory_obj.tensor.numel() == gpu_buffer.numel()
dst_tensor = memory_obj.raw_data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same nit comment as above

Comment thread lmcache/v1/multiprocess/gpu_context.py Outdated
given group."""
return self.group_kv_pointers_[group_idx]

def get_tmp_gpu_buffer_flat(self, chunk_idx: int = 0) -> torch.Tensor:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's avoid using a default parameter for chunk_idx. We should make sure that the caller understands it needs to pass in chunk_idx because it directly relates to the batching logic.

Comment thread lmcache/v1/multiprocess/gpu_context.py Outdated
Comment on lines +298 to +300
The returned slice has exactly ``tmp_chunk_bytes_`` bytes and its
layout matches ``MemoryObj.raw_data`` (groups concatenated in order),
so it can be copied to/from a MemoryObj with a single memcpy.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Let's avoid using tmp_chunk_bytes_ and MemoryObj.raw_data in docstring to avoid confusion for other developers. Ideally, we can say something like:

The returned tensor will fit a memory full object corresponding ``self.chunk_size`` tokens.

Comment thread lmcache/v1/multiprocess/gpu_context.py Outdated
num_elems = shape.numel()
return self.tmp_gpu_buffer_.flatten()[:num_elems].view(shape)
Returns a view of the temporary GPU buffer for the given group,
sized for a single request of ``num_tokens`` tokens (chunk 0).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: num_tokens --> lmcache_chunk_size. Also, the (chunk 0) at the end is a bit confusing.

Comment thread lmcache/v1/multiprocess/server.py Outdated
"num_layers": ctx.num_layers,
"block_size": ctx.block_size,
"hidden_dim_size": ctx.hidden_dim_size,
"hidden_dim_sizes": str(ctx.hidden_dim_sizes_),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should not use private members here. Let's have a property defined as hidden_dim_sizes in GPUCacheContext

@liuyumoye liuyumoye force-pushed the lmcache_support_dsa branch from b108af3 to 91c4630 Compare April 7, 2026 08:53
- gpu_ops: add group_idx param to lmcache_memcpy_async_h2d/d2h,
  use memory_obj.get_tensor(group_idx) instead of memory_obj.tensor
- kv_layer_groups: add build_kv_layer_groups_from_list() to group
  layers by (shape, dtype) from a plain tensor list
- gpu_context: introduce per-group shape_descs_, hidden_dim_sizes_,
  group_kv_pointers_, and tmp_gpu_buffers_; update get_kv_buffer_shape,
  get_tmp_gpu_buffer, get_tmp_gpu_buffer_batched to accept group_idx;
  add get_shape_desc(group_idx) and get_group_kv_pointers(group_idx)
- server: update get_layout_desc, _store_loop, _retrieve_loop to
  iterate over all groups; fix skip_tokens_in_chunk upper bound to
  use batch_len instead of _BATCH_SIZE

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
@liuyumoye liuyumoye force-pushed the lmcache_support_dsa branch from 91c4630 to cbf4d52 Compare April 7, 2026 08:59
@liuyumoye
Copy link
Copy Markdown
Contributor Author

All are nit comments. Otherwise LGTM!

Thanks for the review! All nit comments have been addressed.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cbf4d52. Configure here.

lmc_ops.TransferDirection.D2H,
gpu_context.get_shape_desc(group_idx),
self.chunk_size,
gpu_context.gpu_kv_format_,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Private member access across class boundary in enforced directory

Medium Severity

New code in server.py accesses gpu_context.gpu_kv_format_ (a private _-suffixed attribute) from outside GPUCacheContext. This violates the project's SLF rule, which is enforced by CI in lmcache/v1/multiprocess/. GPUCacheContext exposes gpu_kv_format_name() for the string name but has no public accessor for the format enum itself. A public property like gpu_kv_format is needed to pass the value to the kernel without cross-class private member access.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

Reviewed by Cursor Bugbot for commit cbf4d52. Configure here.

Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liuyumoye Thanks for this feature, LGTM.

Comment thread csrc/mp_mem_kernels.cu
TORCH_CHECK(head_bytes % sizeof(uint16_t) == 0, "head_size * element_size (",
head_bytes, ") must be divisible by 2 for vectorized access");

if (head_bytes % sizeof(uint4) == 0) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liuyumoye could we add some comments to indicate how many bytes?

Comment thread csrc/mp_mem_kernels.cu Outdated
head_bytes, ") must be divisible by 2 for vectorized access");

if (head_bytes % sizeof(uint4) == 0) {
LAUNCH_TEMPLATED(uint4);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, could we add some 8 bytes or 1 bytes copy?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discuss offline, there is no need to add 8 bytes or 1 bytes.

Add scalar type fallback hierarchy for block KV transfer kernel:
  head_bytes % 16 == 0  -> uint4    (16B, fastest)
  head_bytes % 4  == 0  -> uint32_t (4B)
  head_bytes % 2  == 0  -> uint16_t (2B)

This fixes the runtime error for MLA models where head_size=132 (uint8),
giving head_bytes=132 which is not divisible by 16 but is divisible by 4.

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
@liuyumoye liuyumoye force-pushed the lmcache_support_dsa branch from cbf4d52 to 89ece3c Compare April 7, 2026 11:28
Copy link
Copy Markdown
Collaborator

@chunxiaozheng chunxiaozheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@maobaolong maobaolong enabled auto-merge (squash) April 7, 2026 12:54
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Apr 7, 2026
@maobaolong maobaolong merged commit 28c33b9 into LMCache:dev Apr 7, 2026
58 of 60 checks passed
@princepride
Copy link
Copy Markdown
Contributor

So can I use mp LMCache when I deploy GLM-5 now?

@maobaolong
Copy link
Copy Markdown
Collaborator

@princepride yes.

@princepride
Copy link
Copy Markdown
Contributor

@princepride yes.

Thank you! Have you seen I joined the latest 月球大叔 live streaming? I still remember about 1 year ago, we left comments under his channel😊, It's not very long ago. BTW, I want add your wechat.

@princepride
Copy link
Copy Markdown
Contributor

@princepride yes.

I left comments on your slack

@deng451e deng451e mentioned this pull request Apr 8, 2026
maobaolong added a commit to maobaolong/LMCache that referenced this pull request Apr 9, 2026
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
princepride added a commit to princepride/LMCache that referenced this pull request Apr 9, 2026
Align with the rename introduced in LMCache#2926 where hidden_dim_size was
changed to hidden_dim_sizes (List[int]) to support kv_groups.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
princepride added a commit to princepride/LMCache that referenced this pull request Apr 9, 2026
Update test fixture and assertion in test_describe.py to match the
hidden_dim_size -> hidden_dim_sizes rename from LMCache#2926.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
princepride added a commit to princepride/LMCache that referenced this pull request Apr 9, 2026
Align with the rename introduced in LMCache#2926 where hidden_dim_size was
changed to hidden_dim_sizes (List[int]) to support kv_groups.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
princepride added a commit to princepride/LMCache that referenced this pull request Apr 9, 2026
Update test fixture and assertion in test_describe.py to match the
hidden_dim_size -> hidden_dim_sizes rename from LMCache#2926.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
maobaolong pushed a commit that referenced this pull request Apr 9, 2026
* fix typo bug

Signed-off-by: princepride <wangzhipeng628@gmail.com>

* fix: rename hidden_dim_size to hidden_dim_sizes in describe and server

Align with the rename introduced in #2926 where hidden_dim_size was
changed to hidden_dim_sizes (List[int]) to support kv_groups.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>

* fix: update test fixture to use hidden_dim_sizes key

Update test fixture and assertion in test_describe.py to match the
hidden_dim_size -> hidden_dim_sizes rename from #2926.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>

---------

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Oasis-Git pushed a commit to Oasis-Git/LMCache that referenced this pull request Apr 13, 2026
…MCache#2926)

* multiprocess: support per-group KV cache transfer with group_idx

- gpu_ops: add group_idx param to lmcache_memcpy_async_h2d/d2h,
  use memory_obj.get_tensor(group_idx) instead of memory_obj.tensor
- kv_layer_groups: add build_kv_layer_groups_from_list() to group
  layers by (shape, dtype) from a plain tensor list
- gpu_context: introduce per-group shape_descs_, hidden_dim_sizes_,
  group_kv_pointers_, and tmp_gpu_buffers_; update get_kv_buffer_shape,
  get_tmp_gpu_buffer, get_tmp_gpu_buffer_batched to accept group_idx;
  add get_shape_desc(group_idx) and get_group_kv_pointers(group_idx)
- server: update get_layout_desc, _store_loop, _retrieve_loop to
  iterate over all groups; fix skip_tokens_in_chunk upper bound to
  use batch_len instead of _BATCH_SIZE

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>

* fix: support vectorized KV transfer for non-16B-aligned head sizes

Add scalar type fallback hierarchy for block KV transfer kernel:
  head_bytes % 16 == 0  -> uint4    (16B, fastest)
  head_bytes % 4  == 0  -> uint32_t (4B)
  head_bytes % 2  == 0  -> uint16_t (2B)

This fixes the runtime error for MLA models where head_size=132 (uint8),
giving head_bytes=132 which is not divisible by 16 but is divisible by 4.

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>

---------

Signed-off-by: liuyumoye <adeline_ly2023@outlook.com>
Co-authored-by: liuyumoye <adeline_ly2023@outlook.com>
Oasis-Git pushed a commit to Oasis-Git/LMCache that referenced this pull request Apr 13, 2026
* fix typo bug

Signed-off-by: princepride <wangzhipeng628@gmail.com>

* fix: rename hidden_dim_size to hidden_dim_sizes in describe and server

Align with the rename introduced in LMCache#2926 where hidden_dim_size was
changed to hidden_dim_sizes (List[int]) to support kv_groups.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>

* fix: update test fixture to use hidden_dim_sizes key

Update test fixture and assertion in test_describe.py to match the
hidden_dim_size -> hidden_dim_sizes rename from LMCache#2926.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>

---------

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
ftian1 pushed a commit to ftian1/LMCache that referenced this pull request Apr 20, 2026
* fix typo bug

Signed-off-by: princepride <wangzhipeng628@gmail.com>

* fix: rename hidden_dim_size to hidden_dim_sizes in describe and server

Align with the rename introduced in LMCache#2926 where hidden_dim_size was
changed to hidden_dim_sizes (List[int]) to support kv_groups.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>

* fix: update test fixture to use hidden_dim_sizes key

Update test fixture and assertion in test_describe.py to match the
hidden_dim_size -> hidden_dim_sizes rename from LMCache#2926.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>

---------

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants