Sync lookup, and move prefetch to retrieve#2769
Sync lookup, and move prefetch to retrieve#2769yoo-kumaneko wants to merge 2 commits intoLMCache:devfrom
Conversation
Align the LMCache MP server's lookup/prefetch/retrieve pipeline with the OffloadingConnector design: - Add SYNC_LOOKUP protocol: a blocking RPC that performs L1 prefix scan and L2 existence check (with pin) in a single round-trip, returning the hit count directly. Eliminates the QUERY_PREFETCH_STATUS polling loop from the scheduler hot path. - Merge L2-to-L1 prefetch into RETRIEVE: when the RETRIEVE RPC arrives, it first executes any pending L2→L1 data movement (from the prior SYNC_LOOKUP), then performs the L1→GPU copy. This overlaps the full cache load with the forward pass of other scheduled requests. - Add synchronous API to PrefetchController (synchronous_lookup, execute_load_phase, unlock_lookup_results) and StorageManager (synchronous_lookup_and_lock, execute_prefetch_load, unlock_l2_lookups). - Update the adapter to use SYNC_LOOKUP and store hit counts directly instead of job IDs, removing the two-phase lookup/poll pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… read lock count In buffer-only mode with MLA (tp > world_size), the L2-to-L1 load phase only acquired 1 read lock per key, but tp workers each independently called finish_read. The first release deleted the temporary object; remaining workers hit "finish read on non-existing key" warnings. Propagate extra_count from the pending lookup state through execute_prefetch_load -> execute_load_phase -> finish_write_and_reserve_read so that 1 + extra_count locks are acquired, matching the number of consuming workers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the cache lookup mechanism within LMCache, moving from an asynchronous, two-phase lookup protocol to a more streamlined synchronous approach. This change simplifies the client-side interaction by providing immediate hit count results, while the server-side components now include dedicated synchronous methods for managing L1 and L2 cache lookups and data loading. The overall goal is to improve the efficiency and reduce the complexity of cache prefetching and retrieval operations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
What this PR does / why we need it:
Special notes for your reviewers:
If applicable: