Refactor: sequential key polling for distributed key synchronization#2478
Refactor: sequential key polling for distributed key synchronization#2478hlin99 wants to merge 14 commits intoLMCache:devfrom
Conversation
Modified the cache lookup logic to handle synchronization delays in distributed environments. By introducing a polling mechanism, slow devices or lagging storage nodes are given a grace period to synchronize missing keys. Key changes: - Introduced `lookup_poll_key_timeout_ms` and `lookup_poll_key_intervals_ms` to configure retry behavior (defaults to 0, maintaining legacy one-shot behavior). - Implemented a nested polling loop that ensures all layers of a specific chunk are synchronized (`hit_chunks == self.num_layers`) before advancing. - Integrated `time.monotonic()` for robust timeout tracking against distributed clock shifts. This ensures data consistency and prefix-matching integrity in scenarios where storage layers exhibit eventual consistency. Signed-off-by: Tony Lin <tony.lin@intel.com>
Summary of ChangesHello @hlin99, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the cache lookup mechanism to enhance robustness in distributed environments where storage layers might exhibit eventual consistency. By introducing a configurable polling strategy, the system can now gracefully handle temporary synchronization delays, ensuring that all necessary cache layers are available before a lookup completes, thereby improving data consistency and prefix-matching integrity. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces a sequential key polling mechanism to improve data consistency in distributed environments, particularly for slow or lagging storage nodes. It adds lookup_poll_key_intervals_ms and lookup_poll_key_timeout_ms configuration parameters, defaulting to 0 to maintain existing one-shot lookup behavior. The implementation integrates time.monotonic() for robust timeout tracking. The changes are well-aligned with the stated goals of enhancing data consistency and prefix-matching integrity.
|
hi @maobaolong @sammshen would you like to have a review and comment is this change can help increase the robustness and scalability of remote backends? I found such issues on mooncake backends where the receiver may have some latency to successfully poll the key. that's why adding polling timeout may help. Many thanks! |
maobaolong
left a comment
There was a problem hiding this comment.
@hlin99 Would you like to add configuration introduce into the docs/source/api_reference/configurations.rst also?
Sure. Will update. Thanks for the comments! |
Signed-off-by: Tony Lin <tony.lin@intel.com>
doc has been updated. please help review if anything else is needed. thanks. @maobaolong |
|
do we want this or do we want lookup to fail fast? |
hi @sammshen, in some cases of remote backends + proxy server as PD system, the proxy server is not aware of when prefill put kv is done, then to notify decode. It relies on decode to poll kv from the remote. The latency is not high however current implementation w/o poll timeout can fail the get kv call. I encountered such an issue on mooncake + vLLM proxy server. And this PR is to offer a mechanism in LMCache to overcome the issue if others encounter the same. |
| else: | ||
| break | ||
|
|
||
| time.sleep(self.lookup_poll_key_intervals_ms / 1000) |
There was a problem hiding this comment.
lookup is synchronous so this could delay the scheduler
|
can you show the concrete example sorry I don't fully understand |
hi @sammshen whether the issue exists depend on how the PD proxy server is implemented. a diagram is shown as below. if the proxy server gets notified at the same time of store KV, get KV from decode may happen earlier than KV is ready in the backend. hopefully, this clarifies the issue. |
the issue is not caused by LMCache. and it seems the proxy server in LMCache example is notified after store KV, so the issue won't be there, but proxy server can be implemented in diff ways... so this PR offers a configuration to fit any implementation by tuning configs. |
sammshen
left a comment
There was a problem hiding this comment.
I'm hesitant to approve this PR due to the while loop on the critical path lookup. I see the motivation but I don't see any issues directly otivating the need for this change. @maobaolong WDYT?

Modified the cache lookup logic to handle synchronization delays in distributed environments. By introducing a polling mechanism, slow devices or lagging storage nodes are given a grace period to synchronize missing keys.
Key changes:
lookup_poll_key_timeout_msandlookup_poll_key_intervals_msto configure retry behavior (defaults to 0, maintaining legacy one-shot behavior).hit_chunks == self.num_layers) before advancing.time.monotonic()for robust timeout tracking against distributed clock shifts.This ensures data consistency and prefix-matching integrity in scenarios where storage layers exhibit eventual consistency.