[P/D][Nixl] Make kv cache register compatible with hybrid memory allocator#23079
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Code Review
This pull request refactors the KV cache registration in NixlConnector to support a hybrid memory allocator by using KVCacheConfig. This simplifies the logic by removing device- and backend-specific inferences of the cache layout. The changes are well-aligned with the goal, but I've identified two issues. First, the register_kv_caches method in NixlConnector has a signature that is incompatible with its base class, which could cause runtime errors. Second, there's a critical assumption that all KV cache tensors are of the same size, which might not be true with a hybrid allocator and could lead to silent data corruption. I've provided suggestions to address both points.
|
cc @robertgshaw2-redhat @NickLucche @njhill PTAL |
|
Can you join #feat-hybrid-allocator-kv-connector in slack to collaborate on kv connector + hybrid allocator? |
|
cc @KuntaiDu |
There was a problem hiding this comment.
I think I like this initial refactoring, thanks for the work!
We still need to figure out the best logic+interface for sliding window attention layers. which is probably going to be the main thing about enabling hma here, but the block_len and kv cache sharing logic look good.
Can we add tests for the kv sharing case to the nixl suite?
Also, are all other nixl tests running fine with the changes?
Hey @NickLucche, thanks for the review! Before hma can be enabled in nixl connector, there is also work to update the start_load_kv. What is mainly missing is this part from the design doc: I added a unit test. For the integration test, is it mainly run_accuracy_test and run_edge_case_test? |
NickLucche
left a comment
There was a problem hiding this comment.
Thanks a lot for adding the test, looks good now!
For the integration test, is it mainly run_accuracy_test and run_edge_case_test
Yep we just have to make sure run_accuracy_test passes. Could you also test out if this PR works fine with heteroTP? Just set PREFILLER_TP_SIZE and DECODER_TP_SIZE.
I have very limited access these days sorry :(
I also left a comment about a small test refactoring in light of upcoming changes, hope that's ok.
Other than that this is LGTM.
Thanks @NickLucche! The unit test is updated now. I've run run_accuracy_test and validate it passes on prefill decode ratio (1,1) (2,2) (1,2) (2,4) (4,4). Please let me know if there is anything missing. |
NickLucche
left a comment
There was a problem hiding this comment.
LGTM, thanks for the patience @sfeng33 !
|
Thanks for the review! Since this PR requires approval from someone with write access, tagging @robertgshaw2-redhat and @njhill for a final look when you get a chance 🙏 |
Signed-off-by: sfeng33 <4florafeng@gmail.com>
…cator (vllm-project#23079) Signed-off-by: sfeng33 <4florafeng@gmail.com> Signed-off-by: root <xwq391974@alibaba-inc.com>
…cator (vllm-project#23079) Signed-off-by: sfeng33 <4florafeng@gmail.com>
…cator (vllm-project#23079) Signed-off-by: sfeng33 <4florafeng@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
…cator (vllm-project#23079) Signed-off-by: sfeng33 <4florafeng@gmail.com>
…cator (vllm-project#23079) Signed-off-by: sfeng33 <4florafeng@gmail.com>
…cator (vllm-project#23079) Signed-off-by: sfeng33 <4florafeng@gmail.com>
…cator (vllm-project#23079) Signed-off-by: sfeng33 <4florafeng@gmail.com>
…cator (vllm-project#23079) Signed-off-by: sfeng33 <4florafeng@gmail.com>
…cator (vllm-project#23079) Signed-off-by: sfeng33 <4florafeng@gmail.com>
…cator (vllm-project#23079) Signed-off-by: sfeng33 <4florafeng@gmail.com>
…cator (vllm-project#23079) Signed-off-by: sfeng33 <4florafeng@gmail.com>
…cator (vllm-project#23079) Signed-off-by: sfeng33 <4florafeng@gmail.com>
Purpose
This PR refactors the register_kv_caches method for nixl_connector, so that it works with or without hybrid memory allocator (HMA).
Partially fix #22292.
Background
With HMA, for models with hybrid attention, there can be less the number of kv cache's physical tensors, e.g., gemma-3-4b-it's tensor count drops from 34 to 5, where different layers can the same kv cache tensor.
In nixl_connector's method register_kv_caches(), the related two functionalities are:
This PR
The full implementation is on L742-L761 in nixl_connector.py
Test Plan