feat(kv_cache): enable asymmetric store/retrieve storages in PD backend#2509
feat(kv_cache): enable asymmetric store/retrieve storages in PD backend#2509deng451e merged 28 commits intoLMCache:devfrom
Conversation
Remove the restriction that prevented using `save_decode_cache` and
`remote_backend` simultaneously in Prefill-Decode (PD) separation scenarios.
This change introduces `pd_retrieve_locations` and `pd_store_location`
parameters to decouple the KV cache retrieval and storage logic. This
enables an asymmetric cache flow:
1. Prefill nodes transmit KV cache to Decode nodes via the PDBackend.
2. Decode nodes write back their generated KV cache to a remote backend
for subsequent prefill reuse.
3. In multi-turn dialogue scenarios, subsequent
prefill requests retrieve historical KV cache from the remote backend,
significantly increasing Prefix Cache hit rates and reducing TTFT
This decoupling provides greater flexibility for cross-instance cache
management and improves overall pipeline efficiency in distributed
inference.
[ Compute Layer ]
+----------------------+ +------------------+
| Prefill Node | ===============>| Decode Node |
| (Hit-Remote & GenKV) | (1) PDBackend | (Hit-PD & GenKV) |
+-------^--------------+ +-------+----------+
| |
: :
------------|-----------------------------------|------------
[ Storage Layer ] |
| | (2) pd_store_location
| (3) pd_retrieve_locations | (Decode -> Pool)
| (Pool -> Prefill) |
| v
+-------+--------------------------------------------+
| Distributed Storage Pool |
| [Node A] [Node B] [Node C] [Node D] |
| <======= (Object Storage / NFS / DFS) =======> |
+----------------------------------------------------+
Workflow:
1. Prefill -> Decode (PDBackend): Initial KV transfer for the current turn.
2. Decode -> Remote (Store): Decode saves updated KV to NFS for persistence.
3. Remote -> Prefill (Retrieve): Next-turn prefill pulls from Remote,
drastically increasing Prefix Cache hit rate for multi-turn dialogues.
Signed-off-by: Tony Lin <tony.lin@intel.com>
Summary of ChangesHello @hlin99, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Prefill-Decode (PD) backend's flexibility by introducing new configuration options that allow for asymmetric management of the Key-Value (KV) cache. By decoupling the storage and retrieval mechanisms, the system can now leverage remote backends more effectively, particularly for multi-turn dialogue scenarios. This change is designed to improve overall pipeline efficiency, boost Prefix Cache hit rates, and reduce the Time To First Token (TTFT) by facilitating the reuse of historical KV cache across different stages and instances. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant feature to enable asymmetric save/remote storage in the PD backend, decoupling KV cache retrieval and storage logic. The changes in lmcache/v1/config.py correctly add the new pd_retrieve_locations and pd_store_location configurations and remove previous assertions that restricted this functionality. In lmcache/v1/storage_backend/storage_manager.py, these new configurations are integrated to serve as default locations for various cache operations. The implementation is logical and well-aligned with the detailed description. I have one suggestion for improvement in storage_manager.py to handle configuration initialization more robustly. Overall, this is a solid contribution that enhances flexibility for distributed inference.
Signed-off-by: Tony Lin <tony.lin@intel.com>
|
did you turn |
|
can you run a minimal test (using the same request twice wtih local_cpu: false) showing prefill is retrieving decoded tokens from remote? |
yes. SAVE_DECODE_CACHE is turned on. i also add a config exmaple in the PR, please refer to https://github.com/LMCache/LMCache/pull/2509/changes#diff-071ef93157304173f68c621a88bf5a14394f9bd60c54bc502d061e2aee87b0dd the only config that is not in the PR is: the decode kv_role needs to be kv_both, rather than current kv_consumer PD backend is responsible for P->D |
Hi Samuel, I ran a similar test and collected logs for a multi-round conversation with LocalCPUBackend enabled. Environment
Round 1 (cold start)
Summary:
Prefill logs: Decode:
Decode logs: Round 2
Summary:
Prefill logs: Round 3
Summary:
Prefill logs: Conclusion
|
|
Hi @sammshen I tried to turn off CPU backend and leave remote backend as you required, but eventually I understood CPU must be on together with remote backend. and force turning CPU off will encounter a separate issue (#2556), which I fixed by #2557 I think another PR for different issue is preferred, but if you would like to merge it into this one. let me know. Great thanks for your time to review PRs ! |
|
hi @sammshen thanks for your time to review. is there any other question or concern that need me to address? |
|
@feixiangpeng PTAL |
|
Hi @hlin99 , Thank you for the proposal — I appreciate the idea and the effort behind it. I do have a few concerns regarding the current design and its impact on end-to-end performance.
In a distributed serving setup, it may be more effective to adaptively dispatch these memory-bound prefill requests directly to decoder nodes, where they can be better batched with similar workloads and where cache hits can be populated locally without remote traversal overhead. This could preserve both cache efficiency and the intended performance benefits of disaggregated execution. |
additionally, in your pd system, do you turn on prefix-caching on prefill?
in addition, hitting those KV on decode nodes also requires this patch as now save_decode_cache is off when PD is turned on, which means decode won't save any KV in any place. in the last, i want to highlight that this solution DOES NOT break anything we have and it just offers additional options. people can decide to turn it on or not depending on their platform. remote backend is just my example and it doesn't mean it's the only option. if you look at the PR, people can select other backend as they want for their platform/system. and let's not assume remote backend is always slow. AFAIK, many CSPs they have very fast distributed storage system for these KV storage, and we pay very small overhead (sevaral ms) to buy big value(save seconds) |
|
in coding scenario. for the N th round conversation, all N-2 round KV are hit in local cpu backend(assume the capacity is big enough) and N-1 round KV are hit in remote backend. so majority KV is retrieved from faster backend. hope this eases the concern. |
| storage manager) or has been stored (handled by storage backend). | ||
| """ | ||
| # The dictionary from backend cname to objects and keys | ||
| if location is None: |
There was a problem hiding this comment.
is location None if and only if pd?
There was a problem hiding this comment.
to minimize the scope of this PR, right now the retrieve and store locations are effective only
when pd backend is turned on
|
|
||
| # Search all backends for blocking get | ||
| for backend_name, backend in self.get_active_storage_backends(location): | ||
| for backend_name, backend in self.get_active_storage_backends( |
There was a problem hiding this comment.
why are we passin gin the pd_retrieve_locations in the general get?
There was a problem hiding this comment.
Just to leverage what we have in current design to avoid bigger arch level changes
| self.pd_store_location = None | ||
|
|
||
| if self.enable_pd: | ||
| # these para are only effective under PD |
There was a problem hiding this comment.
what about generalizing retrieve_locations and store_locations since you're passing it into get_active_storage_backends anyways?
There was a problem hiding this comment.
Yes. I have two concerns that current changes are only effective under PD. But I am happy to extend the scope of my PR if others do not have such concerns like me.
- To minimize the change, to avoid potential regressions, and to avoid any arch level changes, I leverage the interface lmcache already has today to break the limitations of not allowing save decode tokens. Making sure we have additional features and not breaking anything we have today
- The retrieve location and store locations are from general get and store, so it could not limited to PD only. But I am not quite sure if there any story about these params and a little worried about config those params in general cases.
If the suggestion is to open these retrieve and store locations in general, I am happy to do it and will update the PR. Please advise. Thx. @sammshen
There was a problem hiding this comment.
it is unclean to have something PD specific in the general purpose get() and store()
would it be possible to pass int he location every time we call storage_manager.get() and .store() in the PD case
There was a problem hiding this comment.
got it. Thx for the good suggestion. Will update it
There was a problem hiding this comment.
@sammshen hi, Sam, updated the PR with changes according to your good suggestions. could you take a look again? thx.
Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
|
hi @sammshen the PR has 2 approvals, but looks like we need the 2nd committer appoval to meet the merge requirement? can you help assign to the right person? many thanks. |
|
@DongDongJu would you like to take a quick look at this PR? |
|
Sorry for late notice. Let me check it now |
DongDongJu
left a comment
There was a problem hiding this comment.
Hello @hlin99,
Thanks for this updates!
From my understanding, synchronous retrieve path's follow up patch for _process_tokens_internal().
but this pr make enough small patches for utilize the pd backend and remote backend at the same time.
Generally looks good to me. I left just one comment.
Signed-off-by: Tony Lin <tony.lin@intel.com>
Signed-off-by: Tony Lin <tony.lin@intel.com>
|
hi @sammshen already rebased to the latest and apart from k3 fail which is expected & unit-test failure which seems irrelvant to the PR, all the other CIs got passed. let me know if anything is missed to merge this PR? thanks. |
…nd (LMCache#2509) * feat(kv_cache): enable asymmetric save/remote storage in PD backend Remove the restriction that prevented using `save_decode_cache` and `remote_backend` simultaneously in Prefill-Decode (PD) separation scenarios. This change introduces `pd_retrieve_locations` and `pd_store_location` parameters to decouple the KV cache retrieval and storage logic. This enables an asymmetric cache flow: 1. Prefill nodes transmit KV cache to Decode nodes via the PDBackend. 2. Decode nodes write back their generated KV cache to a remote backend for subsequent prefill reuse. 3. In multi-turn dialogue scenarios, subsequent prefill requests retrieve historical KV cache from the remote backend, significantly increasing Prefix Cache hit rates and reducing TTFT This decoupling provides greater flexibility for cross-instance cache management and improves overall pipeline efficiency in distributed inference. [ Compute Layer ] +----------------------+ +------------------+ | Prefill Node | ===============>| Decode Node | | (Hit-Remote & GenKV) | (1) PDBackend | (Hit-PD & GenKV) | +-------^--------------+ +-------+----------+ | | : : ------------|-----------------------------------|------------ [ Storage Layer ] | | | (2) pd_store_location | (3) pd_retrieve_locations | (Decode -> Pool) | (Pool -> Prefill) | | v +-------+--------------------------------------------+ | Distributed Storage Pool | | [Node A] [Node B] [Node C] [Node D] | | <======= (Object Storage / NFS / DFS) =======> | +----------------------------------------------------+ Workflow: 1. Prefill -> Decode (PDBackend): Initial KV transfer for the current turn. 2. Decode -> Remote (Store): Decode saves updated KV to NFS for persistence. 3. Remote -> Prefill (Retrieve): Next-turn prefill pulls from Remote, drastically increasing Prefix Cache hit rate for multi-turn dialogues. Signed-off-by: Tony Lin <tony.lin@intel.com> * small refactor Signed-off-by: Tony Lin <tony.lin@intel.com> * config examples for pd + remote backends Signed-off-by: Tony Lin <tony.lin@intel.com> * refactor: rename pd_retrieve_locations/pd_store_location to retrieve_locations/store_location Remove the PD-specific prefix to make the retrieve/store locations generic instead of being limited to PD only. This breaks the PD-only feature restriction and allows the mechanism to be reused by other roles/components. Signed-off-by: Tony Lin <tony.lin@intel.com> * move retrieve & store locations from storage manger to cache engine Signed-off-by: Tony Lin <tony.lin@intel.com> * add para validation check Signed-off-by: Tony Lin <tony.lin@intel.com> * config: replace hardcoded IP with placeholder in decoder remote configs Signed-off-by: Tony Lin <tony.lin@intel.com> * resolve conflicts and rebase to the latest Signed-off-by: Tony Lin <tony.lin@intel.com> * address review comments Signed-off-by: Tony Lin <tony.lin@intel.com> * add description in configurations.rst Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com> Co-authored-by: deng451e <57919305+deng451e@users.noreply.github.com>
…nd (LMCache#2509) * feat(kv_cache): enable asymmetric save/remote storage in PD backend Remove the restriction that prevented using `save_decode_cache` and `remote_backend` simultaneously in Prefill-Decode (PD) separation scenarios. This change introduces `pd_retrieve_locations` and `pd_store_location` parameters to decouple the KV cache retrieval and storage logic. This enables an asymmetric cache flow: 1. Prefill nodes transmit KV cache to Decode nodes via the PDBackend. 2. Decode nodes write back their generated KV cache to a remote backend for subsequent prefill reuse. 3. In multi-turn dialogue scenarios, subsequent prefill requests retrieve historical KV cache from the remote backend, significantly increasing Prefix Cache hit rates and reducing TTFT This decoupling provides greater flexibility for cross-instance cache management and improves overall pipeline efficiency in distributed inference. [ Compute Layer ] +----------------------+ +------------------+ | Prefill Node | ===============>| Decode Node | | (Hit-Remote & GenKV) | (1) PDBackend | (Hit-PD & GenKV) | +-------^--------------+ +-------+----------+ | | : : ------------|-----------------------------------|------------ [ Storage Layer ] | | | (2) pd_store_location | (3) pd_retrieve_locations | (Decode -> Pool) | (Pool -> Prefill) | | v +-------+--------------------------------------------+ | Distributed Storage Pool | | [Node A] [Node B] [Node C] [Node D] | | <======= (Object Storage / NFS / DFS) =======> | +----------------------------------------------------+ Workflow: 1. Prefill -> Decode (PDBackend): Initial KV transfer for the current turn. 2. Decode -> Remote (Store): Decode saves updated KV to NFS for persistence. 3. Remote -> Prefill (Retrieve): Next-turn prefill pulls from Remote, drastically increasing Prefix Cache hit rate for multi-turn dialogues. Signed-off-by: Tony Lin <tony.lin@intel.com> * small refactor Signed-off-by: Tony Lin <tony.lin@intel.com> * config examples for pd + remote backends Signed-off-by: Tony Lin <tony.lin@intel.com> * refactor: rename pd_retrieve_locations/pd_store_location to retrieve_locations/store_location Remove the PD-specific prefix to make the retrieve/store locations generic instead of being limited to PD only. This breaks the PD-only feature restriction and allows the mechanism to be reused by other roles/components. Signed-off-by: Tony Lin <tony.lin@intel.com> * move retrieve & store locations from storage manger to cache engine Signed-off-by: Tony Lin <tony.lin@intel.com> * add para validation check Signed-off-by: Tony Lin <tony.lin@intel.com> * config: replace hardcoded IP with placeholder in decoder remote configs Signed-off-by: Tony Lin <tony.lin@intel.com> * resolve conflicts and rebase to the latest Signed-off-by: Tony Lin <tony.lin@intel.com> * address review comments Signed-off-by: Tony Lin <tony.lin@intel.com> * add description in configurations.rst Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com> Co-authored-by: deng451e <57919305+deng451e@users.noreply.github.com> Signed-off-by: Aaron Wu <aaron.wu@dell.com>
…nd (LMCache#2509) * feat(kv_cache): enable asymmetric save/remote storage in PD backend Remove the restriction that prevented using `save_decode_cache` and `remote_backend` simultaneously in Prefill-Decode (PD) separation scenarios. This change introduces `pd_retrieve_locations` and `pd_store_location` parameters to decouple the KV cache retrieval and storage logic. This enables an asymmetric cache flow: 1. Prefill nodes transmit KV cache to Decode nodes via the PDBackend. 2. Decode nodes write back their generated KV cache to a remote backend for subsequent prefill reuse. 3. In multi-turn dialogue scenarios, subsequent prefill requests retrieve historical KV cache from the remote backend, significantly increasing Prefix Cache hit rates and reducing TTFT This decoupling provides greater flexibility for cross-instance cache management and improves overall pipeline efficiency in distributed inference. [ Compute Layer ] +----------------------+ +------------------+ | Prefill Node | ===============>| Decode Node | | (Hit-Remote & GenKV) | (1) PDBackend | (Hit-PD & GenKV) | +-------^--------------+ +-------+----------+ | | : : ------------|-----------------------------------|------------ [ Storage Layer ] | | | (2) pd_store_location | (3) pd_retrieve_locations | (Decode -> Pool) | (Pool -> Prefill) | | v +-------+--------------------------------------------+ | Distributed Storage Pool | | [Node A] [Node B] [Node C] [Node D] | | <======= (Object Storage / NFS / DFS) =======> | +----------------------------------------------------+ Workflow: 1. Prefill -> Decode (PDBackend): Initial KV transfer for the current turn. 2. Decode -> Remote (Store): Decode saves updated KV to NFS for persistence. 3. Remote -> Prefill (Retrieve): Next-turn prefill pulls from Remote, drastically increasing Prefix Cache hit rate for multi-turn dialogues. Signed-off-by: Tony Lin <tony.lin@intel.com> * small refactor Signed-off-by: Tony Lin <tony.lin@intel.com> * config examples for pd + remote backends Signed-off-by: Tony Lin <tony.lin@intel.com> * refactor: rename pd_retrieve_locations/pd_store_location to retrieve_locations/store_location Remove the PD-specific prefix to make the retrieve/store locations generic instead of being limited to PD only. This breaks the PD-only feature restriction and allows the mechanism to be reused by other roles/components. Signed-off-by: Tony Lin <tony.lin@intel.com> * move retrieve & store locations from storage manger to cache engine Signed-off-by: Tony Lin <tony.lin@intel.com> * add para validation check Signed-off-by: Tony Lin <tony.lin@intel.com> * config: replace hardcoded IP with placeholder in decoder remote configs Signed-off-by: Tony Lin <tony.lin@intel.com> * resolve conflicts and rebase to the latest Signed-off-by: Tony Lin <tony.lin@intel.com> * address review comments Signed-off-by: Tony Lin <tony.lin@intel.com> * add description in configurations.rst Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com> Co-authored-by: deng451e <57919305+deng451e@users.noreply.github.com> put installation compatibility table into csv Signed-off-by: deng451e <838677410@qq.com> docs: make compat table scrollable Signed-off-by: deng451e <838677410@qq.com>
…nd (LMCache#2509) * feat(kv_cache): enable asymmetric save/remote storage in PD backend Remove the restriction that prevented using `save_decode_cache` and `remote_backend` simultaneously in Prefill-Decode (PD) separation scenarios. This change introduces `pd_retrieve_locations` and `pd_store_location` parameters to decouple the KV cache retrieval and storage logic. This enables an asymmetric cache flow: 1. Prefill nodes transmit KV cache to Decode nodes via the PDBackend. 2. Decode nodes write back their generated KV cache to a remote backend for subsequent prefill reuse. 3. In multi-turn dialogue scenarios, subsequent prefill requests retrieve historical KV cache from the remote backend, significantly increasing Prefix Cache hit rates and reducing TTFT This decoupling provides greater flexibility for cross-instance cache management and improves overall pipeline efficiency in distributed inference. [ Compute Layer ] +----------------------+ +------------------+ | Prefill Node | ===============>| Decode Node | | (Hit-Remote & GenKV) | (1) PDBackend | (Hit-PD & GenKV) | +-------^--------------+ +-------+----------+ | | : : ------------|-----------------------------------|------------ [ Storage Layer ] | | | (2) pd_store_location | (3) pd_retrieve_locations | (Decode -> Pool) | (Pool -> Prefill) | | v +-------+--------------------------------------------+ | Distributed Storage Pool | | [Node A] [Node B] [Node C] [Node D] | | <======= (Object Storage / NFS / DFS) =======> | +----------------------------------------------------+ Workflow: 1. Prefill -> Decode (PDBackend): Initial KV transfer for the current turn. 2. Decode -> Remote (Store): Decode saves updated KV to NFS for persistence. 3. Remote -> Prefill (Retrieve): Next-turn prefill pulls from Remote, drastically increasing Prefix Cache hit rate for multi-turn dialogues. Signed-off-by: Tony Lin <tony.lin@intel.com> * small refactor Signed-off-by: Tony Lin <tony.lin@intel.com> * config examples for pd + remote backends Signed-off-by: Tony Lin <tony.lin@intel.com> * refactor: rename pd_retrieve_locations/pd_store_location to retrieve_locations/store_location Remove the PD-specific prefix to make the retrieve/store locations generic instead of being limited to PD only. This breaks the PD-only feature restriction and allows the mechanism to be reused by other roles/components. Signed-off-by: Tony Lin <tony.lin@intel.com> * move retrieve & store locations from storage manger to cache engine Signed-off-by: Tony Lin <tony.lin@intel.com> * add para validation check Signed-off-by: Tony Lin <tony.lin@intel.com> * config: replace hardcoded IP with placeholder in decoder remote configs Signed-off-by: Tony Lin <tony.lin@intel.com> * resolve conflicts and rebase to the latest Signed-off-by: Tony Lin <tony.lin@intel.com> * address review comments Signed-off-by: Tony Lin <tony.lin@intel.com> * add description in configurations.rst Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com> Co-authored-by: deng451e <57919305+deng451e@users.noreply.github.com> put installation compatibility table into csv Signed-off-by: deng451e <838677410@qq.com> docs: make compat table scrollable Signed-off-by: deng451e <838677410@qq.com>
…nd (LMCache#2509) * feat(kv_cache): enable asymmetric save/remote storage in PD backend Remove the restriction that prevented using `save_decode_cache` and `remote_backend` simultaneously in Prefill-Decode (PD) separation scenarios. This change introduces `pd_retrieve_locations` and `pd_store_location` parameters to decouple the KV cache retrieval and storage logic. This enables an asymmetric cache flow: 1. Prefill nodes transmit KV cache to Decode nodes via the PDBackend. 2. Decode nodes write back their generated KV cache to a remote backend for subsequent prefill reuse. 3. In multi-turn dialogue scenarios, subsequent prefill requests retrieve historical KV cache from the remote backend, significantly increasing Prefix Cache hit rates and reducing TTFT This decoupling provides greater flexibility for cross-instance cache management and improves overall pipeline efficiency in distributed inference. [ Compute Layer ] +----------------------+ +------------------+ | Prefill Node | ===============>| Decode Node | | (Hit-Remote & GenKV) | (1) PDBackend | (Hit-PD & GenKV) | +-------^--------------+ +-------+----------+ | | : : ------------|-----------------------------------|------------ [ Storage Layer ] | | | (2) pd_store_location | (3) pd_retrieve_locations | (Decode -> Pool) | (Pool -> Prefill) | | v +-------+--------------------------------------------+ | Distributed Storage Pool | | [Node A] [Node B] [Node C] [Node D] | | <======= (Object Storage / NFS / DFS) =======> | +----------------------------------------------------+ Workflow: 1. Prefill -> Decode (PDBackend): Initial KV transfer for the current turn. 2. Decode -> Remote (Store): Decode saves updated KV to NFS for persistence. 3. Remote -> Prefill (Retrieve): Next-turn prefill pulls from Remote, drastically increasing Prefix Cache hit rate for multi-turn dialogues. Signed-off-by: Tony Lin <tony.lin@intel.com> * small refactor Signed-off-by: Tony Lin <tony.lin@intel.com> * config examples for pd + remote backends Signed-off-by: Tony Lin <tony.lin@intel.com> * refactor: rename pd_retrieve_locations/pd_store_location to retrieve_locations/store_location Remove the PD-specific prefix to make the retrieve/store locations generic instead of being limited to PD only. This breaks the PD-only feature restriction and allows the mechanism to be reused by other roles/components. Signed-off-by: Tony Lin <tony.lin@intel.com> * move retrieve & store locations from storage manger to cache engine Signed-off-by: Tony Lin <tony.lin@intel.com> * add para validation check Signed-off-by: Tony Lin <tony.lin@intel.com> * config: replace hardcoded IP with placeholder in decoder remote configs Signed-off-by: Tony Lin <tony.lin@intel.com> * resolve conflicts and rebase to the latest Signed-off-by: Tony Lin <tony.lin@intel.com> * address review comments Signed-off-by: Tony Lin <tony.lin@intel.com> * add description in configurations.rst Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com> Co-authored-by: deng451e <57919305+deng451e@users.noreply.github.com>
…nd (LMCache#2509) * feat(kv_cache): enable asymmetric save/remote storage in PD backend Remove the restriction that prevented using `save_decode_cache` and `remote_backend` simultaneously in Prefill-Decode (PD) separation scenarios. This change introduces `pd_retrieve_locations` and `pd_store_location` parameters to decouple the KV cache retrieval and storage logic. This enables an asymmetric cache flow: 1. Prefill nodes transmit KV cache to Decode nodes via the PDBackend. 2. Decode nodes write back their generated KV cache to a remote backend for subsequent prefill reuse. 3. In multi-turn dialogue scenarios, subsequent prefill requests retrieve historical KV cache from the remote backend, significantly increasing Prefix Cache hit rates and reducing TTFT This decoupling provides greater flexibility for cross-instance cache management and improves overall pipeline efficiency in distributed inference. [ Compute Layer ] +----------------------+ +------------------+ | Prefill Node | ===============>| Decode Node | | (Hit-Remote & GenKV) | (1) PDBackend | (Hit-PD & GenKV) | +-------^--------------+ +-------+----------+ | | : : ------------|-----------------------------------|------------ [ Storage Layer ] | | | (2) pd_store_location | (3) pd_retrieve_locations | (Decode -> Pool) | (Pool -> Prefill) | | v +-------+--------------------------------------------+ | Distributed Storage Pool | | [Node A] [Node B] [Node C] [Node D] | | <======= (Object Storage / NFS / DFS) =======> | +----------------------------------------------------+ Workflow: 1. Prefill -> Decode (PDBackend): Initial KV transfer for the current turn. 2. Decode -> Remote (Store): Decode saves updated KV to NFS for persistence. 3. Remote -> Prefill (Retrieve): Next-turn prefill pulls from Remote, drastically increasing Prefix Cache hit rate for multi-turn dialogues. Signed-off-by: Tony Lin <tony.lin@intel.com> * small refactor Signed-off-by: Tony Lin <tony.lin@intel.com> * config examples for pd + remote backends Signed-off-by: Tony Lin <tony.lin@intel.com> * refactor: rename pd_retrieve_locations/pd_store_location to retrieve_locations/store_location Remove the PD-specific prefix to make the retrieve/store locations generic instead of being limited to PD only. This breaks the PD-only feature restriction and allows the mechanism to be reused by other roles/components. Signed-off-by: Tony Lin <tony.lin@intel.com> * move retrieve & store locations from storage manger to cache engine Signed-off-by: Tony Lin <tony.lin@intel.com> * add para validation check Signed-off-by: Tony Lin <tony.lin@intel.com> * config: replace hardcoded IP with placeholder in decoder remote configs Signed-off-by: Tony Lin <tony.lin@intel.com> * resolve conflicts and rebase to the latest Signed-off-by: Tony Lin <tony.lin@intel.com> * address review comments Signed-off-by: Tony Lin <tony.lin@intel.com> * add description in configurations.rst Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com> Co-authored-by: deng451e <57919305+deng451e@users.noreply.github.com>
…nd (LMCache#2509) * feat(kv_cache): enable asymmetric save/remote storage in PD backend Remove the restriction that prevented using `save_decode_cache` and `remote_backend` simultaneously in Prefill-Decode (PD) separation scenarios. This change introduces `pd_retrieve_locations` and `pd_store_location` parameters to decouple the KV cache retrieval and storage logic. This enables an asymmetric cache flow: 1. Prefill nodes transmit KV cache to Decode nodes via the PDBackend. 2. Decode nodes write back their generated KV cache to a remote backend for subsequent prefill reuse. 3. In multi-turn dialogue scenarios, subsequent prefill requests retrieve historical KV cache from the remote backend, significantly increasing Prefix Cache hit rates and reducing TTFT This decoupling provides greater flexibility for cross-instance cache management and improves overall pipeline efficiency in distributed inference. [ Compute Layer ] +----------------------+ +------------------+ | Prefill Node | ===============>| Decode Node | | (Hit-Remote & GenKV) | (1) PDBackend | (Hit-PD & GenKV) | +-------^--------------+ +-------+----------+ | | : : ------------|-----------------------------------|------------ [ Storage Layer ] | | | (2) pd_store_location | (3) pd_retrieve_locations | (Decode -> Pool) | (Pool -> Prefill) | | v +-------+--------------------------------------------+ | Distributed Storage Pool | | [Node A] [Node B] [Node C] [Node D] | | <======= (Object Storage / NFS / DFS) =======> | +----------------------------------------------------+ Workflow: 1. Prefill -> Decode (PDBackend): Initial KV transfer for the current turn. 2. Decode -> Remote (Store): Decode saves updated KV to NFS for persistence. 3. Remote -> Prefill (Retrieve): Next-turn prefill pulls from Remote, drastically increasing Prefix Cache hit rate for multi-turn dialogues. Signed-off-by: Tony Lin <tony.lin@intel.com> * small refactor Signed-off-by: Tony Lin <tony.lin@intel.com> * config examples for pd + remote backends Signed-off-by: Tony Lin <tony.lin@intel.com> * refactor: rename pd_retrieve_locations/pd_store_location to retrieve_locations/store_location Remove the PD-specific prefix to make the retrieve/store locations generic instead of being limited to PD only. This breaks the PD-only feature restriction and allows the mechanism to be reused by other roles/components. Signed-off-by: Tony Lin <tony.lin@intel.com> * move retrieve & store locations from storage manger to cache engine Signed-off-by: Tony Lin <tony.lin@intel.com> * add para validation check Signed-off-by: Tony Lin <tony.lin@intel.com> * config: replace hardcoded IP with placeholder in decoder remote configs Signed-off-by: Tony Lin <tony.lin@intel.com> * resolve conflicts and rebase to the latest Signed-off-by: Tony Lin <tony.lin@intel.com> * address review comments Signed-off-by: Tony Lin <tony.lin@intel.com> * add description in configurations.rst Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com> Co-authored-by: deng451e <57919305+deng451e@users.noreply.github.com>
…nd (LMCache#2509) * feat(kv_cache): enable asymmetric save/remote storage in PD backend Remove the restriction that prevented using `save_decode_cache` and `remote_backend` simultaneously in Prefill-Decode (PD) separation scenarios. This change introduces `pd_retrieve_locations` and `pd_store_location` parameters to decouple the KV cache retrieval and storage logic. This enables an asymmetric cache flow: 1. Prefill nodes transmit KV cache to Decode nodes via the PDBackend. 2. Decode nodes write back their generated KV cache to a remote backend for subsequent prefill reuse. 3. In multi-turn dialogue scenarios, subsequent prefill requests retrieve historical KV cache from the remote backend, significantly increasing Prefix Cache hit rates and reducing TTFT This decoupling provides greater flexibility for cross-instance cache management and improves overall pipeline efficiency in distributed inference. [ Compute Layer ] +----------------------+ +------------------+ | Prefill Node | ===============>| Decode Node | | (Hit-Remote & GenKV) | (1) PDBackend | (Hit-PD & GenKV) | +-------^--------------+ +-------+----------+ | | : : ------------|-----------------------------------|------------ [ Storage Layer ] | | | (2) pd_store_location | (3) pd_retrieve_locations | (Decode -> Pool) | (Pool -> Prefill) | | v +-------+--------------------------------------------+ | Distributed Storage Pool | | [Node A] [Node B] [Node C] [Node D] | | <======= (Object Storage / NFS / DFS) =======> | +----------------------------------------------------+ Workflow: 1. Prefill -> Decode (PDBackend): Initial KV transfer for the current turn. 2. Decode -> Remote (Store): Decode saves updated KV to NFS for persistence. 3. Remote -> Prefill (Retrieve): Next-turn prefill pulls from Remote, drastically increasing Prefix Cache hit rate for multi-turn dialogues. Signed-off-by: Tony Lin <tony.lin@intel.com> * small refactor Signed-off-by: Tony Lin <tony.lin@intel.com> * config examples for pd + remote backends Signed-off-by: Tony Lin <tony.lin@intel.com> * refactor: rename pd_retrieve_locations/pd_store_location to retrieve_locations/store_location Remove the PD-specific prefix to make the retrieve/store locations generic instead of being limited to PD only. This breaks the PD-only feature restriction and allows the mechanism to be reused by other roles/components. Signed-off-by: Tony Lin <tony.lin@intel.com> * move retrieve & store locations from storage manger to cache engine Signed-off-by: Tony Lin <tony.lin@intel.com> * add para validation check Signed-off-by: Tony Lin <tony.lin@intel.com> * config: replace hardcoded IP with placeholder in decoder remote configs Signed-off-by: Tony Lin <tony.lin@intel.com> * resolve conflicts and rebase to the latest Signed-off-by: Tony Lin <tony.lin@intel.com> * address review comments Signed-off-by: Tony Lin <tony.lin@intel.com> * add description in configurations.rst Signed-off-by: Tony Lin <tony.lin@intel.com> --------- Signed-off-by: Tony Lin <tony.lin@intel.com> Co-authored-by: deng451e <57919305+deng451e@users.noreply.github.com>


Remove the restriction that prevented using
save_decode_cacheandremote_backendsimultaneously in Prefill-Decode (PD) separation scenarios.This change introduces
pd_retrieve_locationsandpd_store_locationparameters to decouple the KV cache retrieval and storage logic. This enables an asymmetric cache flow:This decoupling provides greater flexibility for cross-instance cache management and improves overall pipeline efficiency in distributed inference.
Workflow: