[HW: XPU] Enable Layerwise XPU Connector#2611
Conversation
Summary of ChangesHello @slokesha, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates a new layerwise XPU connector into the LMCache system, significantly enhancing memory management capabilities for XPU devices. This change enables a more granular and optimized approach to handling KV caches, facilitating more efficient data transfer and resource utilization, which is crucial for improving the performance of large language models on Intel GPUs. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
d9d23e4 to
4dac694
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a VLLMPagedMemLayerwiseXPUConnector to support layerwise KV cache transfers on Intel XPUs using pure PyTorch operations, along with corresponding tests. The implementation is a great step towards broader hardware support.
My review has identified a few areas for improvement:
- In
batched_from_gpu, the device-side staging buffer for theuse_gpu=Truepath is allocated but never used, which is a bug and a missed performance optimization. - The same method also contains an inefficient data transfer pattern for
slot_mappinginside a loop. - The new tests for the layerwise XPU connector only cover the
use_gpu=Falsecase, and coverage for theuse_gpu=Truepath should be added.
Addressing these points will improve the correctness, performance, and robustness of the new connector.
4dac694 to
7b669e9
Compare
88cbb4b to
ecc1ce2
Compare
239ff2b to
ab30d10
Compare
08f2b43 to
30546f3
Compare
30546f3 to
bd8e439
Compare
DongDongJu
left a comment
There was a problem hiding this comment.
Hello @slokesha , Thanks for the great work!
I left few questions.
Please add following tests:
- Multi-chunk starts/ends (e.g., two or three segments) for both layerwise and non-layerwise
use_gpu=Truetest or forcing use_gpu to false when xpu enabled..
bf6e158 to
5a8f875
Compare
|
@DongDongJu @sammshen , The new tests cover use_gpu=true and multi-chunk scenarios in test_xpu_connector.py. I’m not completely sure whether the benchmark test is necessary for this PR, but I’ve kept it for now. Happy to remove it if you think it doesn’t belong in the test suite. |
DongDongJu
left a comment
There was a problem hiding this comment.
Hello @slokesha,
I left few comments.
a8f715b to
3edd105
Compare
eabd5dc to
bf3d18d
Compare
f0336b7 to
08be1d2
Compare
DongDongJu
left a comment
There was a problem hiding this comment.
Please make the folllowing changes for the use_xpu cleanup!
LGTM
|
also please run the pre-commit |
…MPagedMemLayerwiseXPUConnector Add VLLMPagedMemLayerwiseXPUConnector class (from PR LMCache#2611) and helper functions (_split_token2d_kv, _get_head_size_view) to enable apples-to-apples performance comparison between the two layerwise connector implementations. Co-authored-by: ftian1 <16394660+ftian1@users.noreply.github.com> Agent-Logs-Url: https://github.com/ftian1/LMCache/sessions/9d73c790-ce1d-4dad-a3c8-f9746d8f1fd6
…se_gpu to use_xpu in layerwise XPU connector Signed-off-by: slokesha <slokeshappa@habana.ai>
Head branch was pushed to by a user without write access
Signed-off-by: Spurthi Lokeshappa <spurthi.lokeshappa@intel.com>
Signed-off-by: slokesha <slokeshappa@habana.ai>
|
@sammshen and @DongDongJu, Can we merge this PR? |
|
rerunning CI |
* Added Layerwise XPU Connector Signed-off-by: slokesha <slokeshappa@habana.ai> * Adressed PR Comments Signed-off-by: slokesha <slokeshappa@habana.ai> * fix k_tok error Signed-off-by: slokesha <spurthi.lokeshappa@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai> * Added _get_head_size_view() to take GPUKVFormat enum Signed-off-by: slokesha <slokeshappa@habana.ai> * Addresed PR comments Signed-off-by: slokesha <slokeshappa@habana.ai> * added multi_chunk_test Signed-off-by: slokesha <slokeshappa@habana.ai> * xpu: fix CPU staging pin memory, disk retrieve deadlock, and rename use_gpu to use_xpu in layerwise XPU connector Signed-off-by: slokesha <slokeshappa@habana.ai> * Fixed Pre-commit Signed-off-by: slokesha <slokeshappa@habana.ai> --------- Signed-off-by: slokesha <slokeshappa@habana.ai> Signed-off-by: slokesha <spurthi.lokeshappa@intel.com> Signed-off-by: Spurthi Lokeshappa <spurthi.lokeshappa@intel.com>
* Added Layerwise XPU Connector Signed-off-by: slokesha <slokeshappa@habana.ai> * Adressed PR Comments Signed-off-by: slokesha <slokeshappa@habana.ai> * fix k_tok error Signed-off-by: slokesha <spurthi.lokeshappa@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai> * Added _get_head_size_view() to take GPUKVFormat enum Signed-off-by: slokesha <slokeshappa@habana.ai> * Addresed PR comments Signed-off-by: slokesha <slokeshappa@habana.ai> * added multi_chunk_test Signed-off-by: slokesha <slokeshappa@habana.ai> * xpu: fix CPU staging pin memory, disk retrieve deadlock, and rename use_gpu to use_xpu in layerwise XPU connector Signed-off-by: slokesha <slokeshappa@habana.ai> * Fixed Pre-commit Signed-off-by: slokesha <slokeshappa@habana.ai> --------- Signed-off-by: slokesha <slokeshappa@habana.ai> Signed-off-by: slokesha <spurthi.lokeshappa@intel.com> Signed-off-by: Spurthi Lokeshappa <spurthi.lokeshappa@intel.com>
* Added Layerwise XPU Connector Signed-off-by: slokesha <slokeshappa@habana.ai> * Adressed PR Comments Signed-off-by: slokesha <slokeshappa@habana.ai> * fix k_tok error Signed-off-by: slokesha <spurthi.lokeshappa@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai> * Added _get_head_size_view() to take GPUKVFormat enum Signed-off-by: slokesha <slokeshappa@habana.ai> * Addresed PR comments Signed-off-by: slokesha <slokeshappa@habana.ai> * added multi_chunk_test Signed-off-by: slokesha <slokeshappa@habana.ai> * xpu: fix CPU staging pin memory, disk retrieve deadlock, and rename use_gpu to use_xpu in layerwise XPU connector Signed-off-by: slokesha <slokeshappa@habana.ai> * Fixed Pre-commit Signed-off-by: slokesha <slokeshappa@habana.ai> --------- Signed-off-by: slokesha <slokeshappa@habana.ai> Signed-off-by: slokesha <spurthi.lokeshappa@intel.com> Signed-off-by: Spurthi Lokeshappa <spurthi.lokeshappa@intel.com>
* Added Layerwise XPU Connector Signed-off-by: slokesha <slokeshappa@habana.ai> * Adressed PR Comments Signed-off-by: slokesha <slokeshappa@habana.ai> * fix k_tok error Signed-off-by: slokesha <spurthi.lokeshappa@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai> * Added _get_head_size_view() to take GPUKVFormat enum Signed-off-by: slokesha <slokeshappa@habana.ai> * Addresed PR comments Signed-off-by: slokesha <slokeshappa@habana.ai> * added multi_chunk_test Signed-off-by: slokesha <slokeshappa@habana.ai> * xpu: fix CPU staging pin memory, disk retrieve deadlock, and rename use_gpu to use_xpu in layerwise XPU connector Signed-off-by: slokesha <slokeshappa@habana.ai> * Fixed Pre-commit Signed-off-by: slokesha <slokeshappa@habana.ai> --------- Signed-off-by: slokesha <slokeshappa@habana.ai> Signed-off-by: slokesha <spurthi.lokeshappa@intel.com> Signed-off-by: Spurthi Lokeshappa <spurthi.lokeshappa@intel.com>
In layerwise retrieval with LocalCPU backend, the unpin loop at cache_engine.py:1040-1042 was designed for disk-loaded staging objects (added in LMCache#2611). However, LocalCPUBackend.batched_get_non_blocking() returns the same Python object from hot_cache that lookup(pin=True) had already pinned, causing retrieve_layer() to unpin it once, and then wait_for_save() to unpin it again via lookup_unpin() (LMCache#2786). This double unpin drives pin_count to -1 and, more critically, triggers a premature free() of the memory object (unpin() calls free() when both pin_count <= 0 and ref_count <= 0). Fix: guard the unpin with location != 'LocalCPUBackend' so that only disk-loaded staging objects (LocalDisk, Maru, etc.) are unpinned here. LocalCPU objects retain their pin until lookup_unpin() releases them in wait_for_save(), preserving the correct single-free lifecycle. Fixes LMCache#2954 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In layerwise retrieval with LocalCPU backend, the unpin loop at cache_engine.py:1040-1042 was designed for disk-loaded staging objects (added in LMCache#2611). However, LocalCPUBackend.batched_get_non_blocking() returns the same Python object from hot_cache that lookup(pin=True) had already pinned, causing retrieve_layer() to unpin it once, and then wait_for_save() to unpin it again via lookup_unpin() (LMCache#2786). This double unpin drives pin_count to -1 and, more critically, triggers a premature free() of the memory object (unpin() calls free() when both pin_count <= 0 and ref_count <= 0). Fix: guard the unpin with location != 'LocalCPUBackend' so that only disk-loaded staging objects (LocalDisk, Maru, etc.) are unpinned here. LocalCPU objects retain their pin until lookup_unpin() releases them in wait_for_save(), preserving the correct single-free lifecycle. Fixes LMCache#2954 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com>
What this PR does / why we need it:
This PR adds full Layerwise KV cache support for XPU connectors, bringing feature parity with the CUDA connector implementation.
It enables LMCache’s use_layerwise=True workflow for XPU devices, supporting:
Special notes for your reviewers:
If applicable: