Fix: Safely handle layerwise cache shape dimensions in remote backend by hlin99 · Pull Request #2751 · LMCache/LMCache

hlin99 · 2026-03-12T03:15:26Z

make sure to pad layerwise kv shape to 4D and strip it correctly in remote backend

The PR is to fix below error msg when layerwise=True in remote backend

(EngineCore_DP0 pid=45164) [2026-03-10 14:36:14,376] LMCache ERROR: Put task failed for key LayerCacheEngineKey(model_name='/workspace/Meta-Llama-3-8B-Instruct/', world_size=1, worker_id=0, chunk_hash=4132912831621080023, dtype=torch.bfloat16, request_configs=None, tags=None, _dtype_str='bfloat16', layer_id=8): Shape dimension should be 4 (remote_backend.py:196:lmcache.v1.storage_backend.remote_backend)
(EngineCore_DP0 pid=45164) [2026-03-10 14:36:14,377] LMCache ERROR: Put task failed for key LayerCacheEngineKey(model_name='/workspace/Meta-Llama-3-8B-Instruct/', world_size=1, worker_id=0, chunk_hash=4132912831621080023, dtype=torch.bfloat16, request_configs=None, tags=None, _dtype_str='bfloat16', layer_id=9): Shape dimension should be 4 (remote_backend.py:196:lmcache.v1.storage_backend.remote_backend)

Note

Medium Risk
Touches wire-format shape serialization and byte sizing used by remote put/get; mistakes could corrupt data or break backward compatibility for existing remote caches. Changes are localized and covered by new unit tests for sub-4D shape round-trips.

Overview
Fixes remote-backend serialization and partial-chunk handling to work with layerwise KV caches and vLLM MLA.

Remote protocol messages (RemoteMetadata, ClientMetaMessage, ServerMetaMessage) now pad sub-4D shapes to 4 integers on write and strip trailing-zero padding on read, avoiding failures when layerwise caches are 2D/3D. Remote connector partial reads now compute the token dimension correctly for both 4D and layerwise shapes and also updates internal size accounting so get_size()/byte_array match the truncated payload.

Separately, TensorMemoryObj.byte_array now uses the logical size (get_size()) instead of the raw buffer size to avoid leaking allocator padding, and the vLLM layerwise GPU connector relaxes KV-format assertions/allocations to support MLA (KV_MLA_FMT vs KV_T2D).

^{Reviewed by Cursor Bugbot for commit c19ec92. Bugbot is set up for automated code reviews on this repo. Configure here.}

gemini-code-assist · 2026-03-12T03:15:47Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a compatibility issue in the remote backend where 3D shapes used for layerwise caching needed to be adapted to a 4D protocol. The solution introduces explicit type checking for layerwise cache keys and helper functions to safely pad and unpad tensor shapes during data transmission and reception, ensuring correct data handling without relying on potentially ambiguous shape heuristics.

Highlights

Explicit Layerwise Cache Detection: The implementation now explicitly checks the cache key type (LayerCacheEngineKey) to determine if a cache is layerwise, rather than relying on shape heuristics. This prevents issues with legitimate non-layerwise 4D tensors that might have num_layers=1.
Shape Padding and Unpadding Helpers: Two new helper functions, _pad_shape_to_4d() and _unpad_shape_from_4d(), were added to handle the conversion between 3D layerwise cache shapes and the 4D shapes required by the remote backend protocol.
Remote Backend Integration: 3D layerwise shapes are now padded to 4D in LMCServerConnector.put() before transmission and unpadded back to their original dimensions in receive_all() upon receipt. The is_layerwise flag is correctly passed through put() and get() methods.
Backward Compatibility: The changes maintain backward compatibility with existing non-layerwise 4D shapes, ensuring no disruption to current functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

lmcache/v1/storage_backend/connector/lm_connector.py
- Imported LayerCacheEngineKey to enable explicit type checking for layerwise cache keys.
- Added _pad_shape_to_4d and _unpad_shape_from_4d helper functions to manage shape transformations for protocol compatibility.
- Modified receive_all to accept an is_layerwise parameter and utilize _unpad_shape_from_4d to restore original tensor shapes.
- Updated put to determine if a key is layerwise, pad its shape to 4D using _pad_shape_to_4d, and send the padded shape.
- Modified get to identify layerwise keys and pass this information to receive_all for correct shape handling.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly implements the padding and unpadding of tensor shapes for layerwise caching in the remote backend. By explicitly checking the key type (LayerCacheEngineKey), the changes avoid potential issues with shape-based heuristics. The new helper functions are well-defined, and their integration into LMCServerConnector's put and get methods is sound. My review includes one suggestion to improve code structure by encapsulating the new helper functions within the class that uses them.

Signed-off-by: Tony Lin <tony.lin@intel.com>

sammshen · 2026-03-31T18:25:18Z

@cursor review

- Fix TensorMemoryObj.get_size() to use raw_data actual size instead of group_prefix_sum[-1], preventing out-of-bounds memory access when byte_array is called after reshape_partial_chunk truncates raw_data. group_prefix_sum is preserved for use by get_tensor(index). - Refactor _pad_shape_to_4d and _strip_shape_padding from private static methods of RemoteMetadata to module-level functions (pad_shape_to_4d, strip_shape_padding), eliminating cross-class access to private methods from ClientMetaMessage and ServerMetaMessage.

Signed-off-by: Tony Lin <tony.lin@intel.com>

DongDongJu

LGTM

DongDongJu · 2026-04-13T15:41:01Z


        num_tokens = len(slot_mapping_full)

+        mem_fmt = MemoryFormat.KV_MLA_FMT if self.use_mla else MemoryFormat.KV_T2D


DongDongJu · 2026-04-13T15:44:46Z

Hello Tony, Thanks for the work.
One non blocking question. pad_shape_to_4d now rejects mixed zero/non-zero shapes,
but the doc only comments [x, 0, 0, 0] byte-object shapes.
Is there any intention on it?

Signed-off-by: Tony Lin <tony.lin@intel.com>

hlin99 · 2026-04-14T00:45:35Z

Hello Tony, Thanks for the work. One non blocking question. pad_shape_to_4d now rejects mixed zero/non-zero shapes, but the doc only comments [x, 0, 0, 0] byte-object shapes. Is there any intention on it?

hi @DongDongJu thank you for pointing this out. the code logic was refined several times, but docstring wasn't updated accordingly. i fixed it in latest commit.

deng451e

LGTM

cursor · 2026-04-15T13:34:39Z

+        else:
+            # Layerwise 3D: [num_tokens, 2, hidden_dim]
+            # Layerwise MLA 2D: [num_tokens, hidden_dim]
+            token_dim = 0


Token dimension determined by shape length, not format

Medium Severity

reshape_partial_chunk infers token_dim solely from the number of shape dimensions rather than consulting the memory format. For any 3D shape (non-4D), it assumes token_dim = 0. This is correct for KV_2LTD ([num_tokens, 2, hidden_dim]) and MLA 2D, but wrong for KV_T2D ([2, num_tokens, hidden_dim]) where the token dimension is 1. The MemoryFormat.KV_T2D.token_dim() method confirms it returns 1. Using memory_obj.meta.fmt.token_dim() instead of branching on shape length would be safer and consistent with how other code resolves token positions.

^{Reviewed by Cursor Bugbot for commit d946696. Configure here.}

Signed-off-by: Tony Lin <tony.lin@intel.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit bab0a7d. Configure here.}

Signed-off-by: Tony Lin <tony.lin@intel.com>

gemini-code-assist Bot reviewed Mar 12, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/connector/lm_connector.py Outdated

hlin99 force-pushed the ww11_PR_layerwise_remote branch 2 times, most recently from 6ab791d to f658119 Compare March 12, 2026 05:56

hlin99 marked this pull request as draft March 24, 2026 11:34

Fix: Handle layerwise cache shape dimensions for remote backends

481aef3

Signed-off-by: Tony Lin <tony.lin@intel.com>

hlin99 force-pushed the ww11_PR_layerwise_remote branch from d5372b2 to 481aef3 Compare March 26, 2026 04:23

Merge branch 'dev' into ww11_PR_layerwise_remote

ffea650

hlin99 marked this pull request as ready for review March 26, 2026 04:43

hlin99 added 4 commits March 27, 2026 03:20

Fix mla + remote backend

e4ba770

Signed-off-by: Tony Lin <tony.lin@intel.com>

Merge branch 'dev' into ww11_PR_layerwise_remote

a08270e

Signed-off-by: Tony Lin <tony.lin@intel.com>

Merge branch 'dev' into ww11_PR_layerwise_remote

e72b3de

Merge branch 'dev' into ww11_PR_layerwise_remote

f78a948

cursor Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread lmcache/v1/memory_management.py

Merge branch 'dev' into ww11_PR_layerwise_remote

df9af54

cursor Bot reviewed Apr 1, 2026

View reviewed changes

Comment thread lmcache/v1/protocol.py Outdated

Comment thread lmcache/v1/gpu_connector/utils.py Outdated

sammshen added the full Run comprehensive tests on this PR label Apr 1, 2026

Merge branch 'dev' into ww11_PR_layerwise_remote

a8916a9

cursor Bot reviewed Apr 2, 2026

View reviewed changes

Comment thread lmcache/v1/gpu_connector/utils.py

Merge branch 'dev' into ww11_PR_layerwise_remote

a4ef094

hlin99 requested review from ApostaC, DongDongJu, chunxiaozheng, deng451e and sammshen as code owners April 7, 2026 02:21

Merge branch 'dev' into ww11_PR_layerwise_remote

28b4242

cursor Bot reviewed Apr 8, 2026

View reviewed changes

Comment thread lmcache/v1/protocol.py

hlin99 mentioned this pull request Apr 8, 2026

fix: address review comments on PR #2751 hlin99/LMCache#142

Closed

cursor Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread lmcache/v1/protocol.py Outdated

fix: avoid warning noise from zero-placeholder shapes in pad_shape_to_4d

8a2f329

Signed-off-by: Tony Lin <tony.lin@intel.com>

DongDongJu approved these changes Apr 13, 2026

View reviewed changes

update docstring

2ca51a7

Signed-off-by: Tony Lin <tony.lin@intel.com>

hlin99 added 2 commits April 14, 2026 08:45

Merge branch 'dev' into ww11_PR_layerwise_remote

4e233b1

Merge branch 'dev' into ww11_PR_layerwise_remote

41d8eef

deng451e approved these changes Apr 15, 2026

View reviewed changes

deng451e enabled auto-merge (squash) April 15, 2026 01:09

hlin99 added 3 commits April 15, 2026 09:52

Merge branch 'dev' into ww11_PR_layerwise_remote

7f3c5d4

Merge branch 'dev' into ww11_PR_layerwise_remote

e04d10d

Merge branch 'dev' into ww11_PR_layerwise_remote

d946696

cursor Bot reviewed Apr 15, 2026

View reviewed changes

Merge branch 'dev' into ww11_PR_layerwise_remote

26e0681

auto-merge was automatically disabled April 16, 2026 01:09
Head branch was pushed to by a user without write access

minor change to fix CI

dd60ac3

Signed-off-by: Tony Lin <tony.lin@intel.com>

cursor Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread tests/v1/test_remote_metadata.py

github-actions Bot removed the full Run comprehensive tests on this PR label Apr 16, 2026

add default value for backward compatibility

bab0a7d

Signed-off-by: Tony Lin <tony.lin@intel.com>

cursor Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread lmcache/v1/storage_backend/connector/base_connector.py Outdated

address cursor's comments

497aaf4

Signed-off-by: Tony Lin <tony.lin@intel.com>

deng451e enabled auto-merge (squash) April 16, 2026 04:57

deng451e added the full Run comprehensive tests on this PR label Apr 16, 2026

hlin99 added 4 commits April 16, 2026 13:01

Merge branch 'dev' into ww11_PR_layerwise_remote

e15a078

Merge branch 'dev' into ww11_PR_layerwise_remote

71375ea

Merge branch 'dev' into ww11_PR_layerwise_remote

89c7754

Merge branch 'dev' into ww11_PR_layerwise_remote

c19ec92

deng451e merged commit 575a745 into LMCache:dev Apr 17, 2026
37 of 38 checks passed

hlin99 deleted the ww11_PR_layerwise_remote branch April 25, 2026 05:30


		num_tokens = len(slot_mapping_full)

		mem_fmt = MemoryFormat.KV_MLA_FMT if self.use_mla else MemoryFormat.KV_T2D

Conversation

hlin99 commented Mar 12, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

sammshen commented Mar 31, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DongDongJu left a comment

Choose a reason for hiding this comment

Uh oh!

DongDongJu Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

DongDongJu commented Apr 13, 2026

Uh oh!

hlin99 commented Apr 14, 2026

Uh oh!

deng451e left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 15, 2026

Choose a reason for hiding this comment

Token dimension determined by shape length, not format

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hlin99 commented Mar 12, 2026 •

edited by cursor Bot

Loading