Skip to content

Fix: Safely handle layerwise cache shape dimensions in remote backend#2751

Merged
deng451e merged 37 commits intoLMCache:devfrom
hlin99:ww11_PR_layerwise_remote
Apr 17, 2026
Merged

Fix: Safely handle layerwise cache shape dimensions in remote backend#2751
deng451e merged 37 commits intoLMCache:devfrom
hlin99:ww11_PR_layerwise_remote

Conversation

@hlin99
Copy link
Copy Markdown
Contributor

@hlin99 hlin99 commented Mar 12, 2026

fix #2752

make sure to pad layerwise kv shape to 4D and strip it correctly in remote backend

The PR is to fix below error msg when layerwise=True in remote backend

(EngineCore_DP0 pid=45164) [2026-03-10 14:36:14,376] LMCache ERROR: Put task failed for key LayerCacheEngineKey(model_name='/workspace/Meta-Llama-3-8B-Instruct/', world_size=1, worker_id=0, chunk_hash=4132912831621080023, dtype=torch.bfloat16, request_configs=None, tags=None, _dtype_str='bfloat16', layer_id=8): Shape dimension should be 4 (remote_backend.py:196:lmcache.v1.storage_backend.remote_backend)
(EngineCore_DP0 pid=45164) [2026-03-10 14:36:14,377] LMCache ERROR: Put task failed for key LayerCacheEngineKey(model_name='/workspace/Meta-Llama-3-8B-Instruct/', world_size=1, worker_id=0, chunk_hash=4132912831621080023, dtype=torch.bfloat16, request_configs=None, tags=None, _dtype_str='bfloat16', layer_id=9): Shape dimension should be 4 (remote_backend.py:196:lmcache.v1.storage_backend.remote_backend)


Note

Medium Risk
Touches wire-format shape serialization and byte sizing used by remote put/get; mistakes could corrupt data or break backward compatibility for existing remote caches. Changes are localized and covered by new unit tests for sub-4D shape round-trips.

Overview
Fixes remote-backend serialization and partial-chunk handling to work with layerwise KV caches and vLLM MLA.

Remote protocol messages (RemoteMetadata, ClientMetaMessage, ServerMetaMessage) now pad sub-4D shapes to 4 integers on write and strip trailing-zero padding on read, avoiding failures when layerwise caches are 2D/3D. Remote connector partial reads now compute the token dimension correctly for both 4D and layerwise shapes and also updates internal size accounting so get_size()/byte_array match the truncated payload.

Separately, TensorMemoryObj.byte_array now uses the logical size (get_size()) instead of the raw buffer size to avoid leaking allocator padding, and the vLLM layerwise GPU connector relaxes KV-format assertions/allocations to support MLA (KV_MLA_FMT vs KV_T2D).

Reviewed by Cursor Bugbot for commit c19ec92. Bugbot is set up for automated code reviews on this repo. Configure here.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a compatibility issue in the remote backend where 3D shapes used for layerwise caching needed to be adapted to a 4D protocol. The solution introduces explicit type checking for layerwise cache keys and helper functions to safely pad and unpad tensor shapes during data transmission and reception, ensuring correct data handling without relying on potentially ambiguous shape heuristics.

Highlights

  • Explicit Layerwise Cache Detection: The implementation now explicitly checks the cache key type (LayerCacheEngineKey) to determine if a cache is layerwise, rather than relying on shape heuristics. This prevents issues with legitimate non-layerwise 4D tensors that might have num_layers=1.
  • Shape Padding and Unpadding Helpers: Two new helper functions, _pad_shape_to_4d() and _unpad_shape_from_4d(), were added to handle the conversion between 3D layerwise cache shapes and the 4D shapes required by the remote backend protocol.
  • Remote Backend Integration: 3D layerwise shapes are now padded to 4D in LMCServerConnector.put() before transmission and unpadded back to their original dimensions in receive_all() upon receipt. The is_layerwise flag is correctly passed through put() and get() methods.
  • Backward Compatibility: The changes maintain backward compatibility with existing non-layerwise 4D shapes, ensuring no disruption to current functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lmcache/v1/storage_backend/connector/lm_connector.py
    • Imported LayerCacheEngineKey to enable explicit type checking for layerwise cache keys.
    • Added _pad_shape_to_4d and _unpad_shape_from_4d helper functions to manage shape transformations for protocol compatibility.
    • Modified receive_all to accept an is_layerwise parameter and utilize _unpad_shape_from_4d to restore original tensor shapes.
    • Updated put to determine if a key is layerwise, pad its shape to 4D using _pad_shape_to_4d, and send the padded shape.
    • Modified get to identify layerwise keys and pass this information to receive_all for correct shape handling.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly implements the padding and unpadding of tensor shapes for layerwise caching in the remote backend. By explicitly checking the key type (LayerCacheEngineKey), the changes avoid potential issues with shape-based heuristics. The new helper functions are well-defined, and their integration into LMCServerConnector's put and get methods is sound. My review includes one suggestion to improve code structure by encapsulating the new helper functions within the class that uses them.

Comment thread lmcache/v1/storage_backend/connector/lm_connector.py Outdated
@hlin99 hlin99 force-pushed the ww11_PR_layerwise_remote branch 2 times, most recently from 6ab791d to f658119 Compare March 12, 2026 05:56
@hlin99 hlin99 marked this pull request as draft March 24, 2026 11:34
Signed-off-by: Tony Lin <tony.lin@intel.com>
@hlin99 hlin99 force-pushed the ww11_PR_layerwise_remote branch from d5372b2 to 481aef3 Compare March 26, 2026 04:23
@hlin99 hlin99 marked this pull request as ready for review March 26, 2026 04:43
@sammshen
Copy link
Copy Markdown
Contributor

@cursor review

Comment thread lmcache/v1/memory_management.py
Comment thread lmcache/v1/protocol.py Outdated
Comment thread lmcache/v1/gpu_connector/utils.py Outdated
@sammshen sammshen added the full Run comprehensive tests on this PR label Apr 1, 2026
Comment thread lmcache/v1/gpu_connector/utils.py
Comment thread lmcache/v1/protocol.py
hlin99 pushed a commit to hlin99/LMCache that referenced this pull request Apr 8, 2026
- Fix TensorMemoryObj.get_size() to use raw_data actual size instead of
  group_prefix_sum[-1], preventing out-of-bounds memory access when
  byte_array is called after reshape_partial_chunk truncates raw_data.
  group_prefix_sum is preserved for use by get_tensor(index).

- Refactor _pad_shape_to_4d and _strip_shape_padding from private static
  methods of RemoteMetadata to module-level functions (pad_shape_to_4d,
  strip_shape_padding), eliminating cross-class access to private methods
  from ClientMetaMessage and ServerMetaMessage.
Comment thread lmcache/v1/protocol.py Outdated
Copy link
Copy Markdown
Collaborator

@DongDongJu DongDongJu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


num_tokens = len(slot_mapping_full)

mem_fmt = MemoryFormat.KV_MLA_FMT if self.use_mla else MemoryFormat.KV_T2D
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@DongDongJu
Copy link
Copy Markdown
Collaborator

Hello Tony, Thanks for the work.
One non blocking question. pad_shape_to_4d now rejects mixed zero/non-zero shapes,
but the doc only comments [x, 0, 0, 0] byte-object shapes.
Is there any intention on it?

Signed-off-by: Tony Lin <tony.lin@intel.com>
@hlin99
Copy link
Copy Markdown
Contributor Author

hlin99 commented Apr 14, 2026

Hello Tony, Thanks for the work. One non blocking question. pad_shape_to_4d now rejects mixed zero/non-zero shapes, but the doc only comments [x, 0, 0, 0] byte-object shapes. Is there any intention on it?

hi @DongDongJu thank you for pointing this out. the code logic was refined several times, but docstring wasn't updated accordingly. i fixed it in latest commit.

Copy link
Copy Markdown
Collaborator

@deng451e deng451e left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@deng451e deng451e enabled auto-merge (squash) April 15, 2026 01:09
else:
# Layerwise 3D: [num_tokens, 2, hidden_dim]
# Layerwise MLA 2D: [num_tokens, hidden_dim]
token_dim = 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Token dimension determined by shape length, not format

Medium Severity

reshape_partial_chunk infers token_dim solely from the number of shape dimensions rather than consulting the memory format. For any 3D shape (non-4D), it assumes token_dim = 0. This is correct for KV_2LTD ([num_tokens, 2, hidden_dim]) and MLA 2D, but wrong for KV_T2D ([2, num_tokens, hidden_dim]) where the token dimension is 1. The MemoryFormat.KV_T2D.token_dim() method confirms it returns 1. Using memory_obj.meta.fmt.token_dim() instead of branching on shape length would be safer and consistent with how other code resolves token positions.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d946696. Configure here.

auto-merge was automatically disabled April 16, 2026 01:09

Head branch was pushed to by a user without write access

Signed-off-by: Tony Lin <tony.lin@intel.com>
Comment thread tests/v1/test_remote_metadata.py
@github-actions github-actions Bot removed the full Run comprehensive tests on this PR label Apr 16, 2026
Signed-off-by: Tony Lin <tony.lin@intel.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit bab0a7d. Configure here.

Comment thread lmcache/v1/storage_backend/connector/base_connector.py Outdated
Signed-off-by: Tony Lin <tony.lin@intel.com>
@deng451e deng451e enabled auto-merge (squash) April 16, 2026 04:57
@deng451e deng451e added the full Run comprehensive tests on this PR label Apr 16, 2026
@deng451e deng451e merged commit 575a745 into LMCache:dev Apr 17, 2026
37 of 38 checks passed
@hlin99 hlin99 deleted the ww11_PR_layerwise_remote branch April 25, 2026 05:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

layerwise is not working on remote backend

5 participants