Skip to content

Sync lookup, and move prefetch to retrieve#2769

Closed
yoo-kumaneko wants to merge 2 commits intoLMCache:devfrom
yoo-kumaneko:sync-lookup-prefetch-on-retrieve-pre-rebase
Closed

Sync lookup, and move prefetch to retrieve#2769
yoo-kumaneko wants to merge 2 commits intoLMCache:devfrom
yoo-kumaneko:sync-lookup-prefetch-on-retrieve-pre-rebase

Conversation

@yoo-kumaneko
Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

root and others added 2 commits March 13, 2026 17:21
Align the LMCache MP server's lookup/prefetch/retrieve pipeline with
the OffloadingConnector design:

- Add SYNC_LOOKUP protocol: a blocking RPC that performs L1 prefix scan
  and L2 existence check (with pin) in a single round-trip, returning
  the hit count directly. Eliminates the QUERY_PREFETCH_STATUS polling
  loop from the scheduler hot path.

- Merge L2-to-L1 prefetch into RETRIEVE: when the RETRIEVE RPC arrives,
  it first executes any pending L2→L1 data movement (from the prior
  SYNC_LOOKUP), then performs the L1→GPU copy. This overlaps the full
  cache load with the forward pass of other scheduled requests.

- Add synchronous API to PrefetchController (synchronous_lookup,
  execute_load_phase, unlock_lookup_results) and StorageManager
  (synchronous_lookup_and_lock, execute_prefetch_load,
  unlock_l2_lookups).

- Update the adapter to use SYNC_LOOKUP and store hit counts directly
  instead of job IDs, removing the two-phase lookup/poll pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… read lock count

In buffer-only mode with MLA (tp > world_size), the L2-to-L1 load
phase only acquired 1 read lock per key, but tp workers each
independently called finish_read. The first release deleted the
temporary object; remaining workers hit "finish read on non-existing
key" warnings. Propagate extra_count from the pending lookup state
through execute_prefetch_load -> execute_load_phase ->
finish_write_and_reserve_read so that 1 + extra_count locks are
acquired, matching the number of consuming workers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the cache lookup mechanism within LMCache, moving from an asynchronous, two-phase lookup protocol to a more streamlined synchronous approach. This change simplifies the client-side interaction by providing immediate hit count results, while the server-side components now include dedicated synchronous methods for managing L1 and L2 cache lookups and data loading. The overall goal is to improve the efficiency and reduce the complexity of cache prefetching and retrieval operations.

Highlights

  • Synchronous Lookup Protocol: Introduced a new SYNC_LOOKUP request type and protocol, replacing the previous two-phase asynchronous lookup mechanism. This new approach allows for a single round-trip to determine cache hit counts, simplifying the client-side logic and eliminating the need for polling.
  • Refactored VLLM Integration Adapter: The VLLMMultiProcessAdapter was updated to utilize the new synchronous lookup. This involved replacing the internal state tracking for job IDs and finished jobs with a direct mapping to hit counts, and modifying the lookup and result checking methods accordingly.
  • Synchronous Prefetch and Load in Storage Management: New synchronous methods (synchronous_lookup_and_lock, execute_prefetch_load, unlock_l2_lookups) were added to the StorageManager and PrefetchController to handle the L1 and L2 cache interactions. These methods perform blocking operations to ensure immediate results for the synchronous lookup flow.
  • Handoff Mechanism for Pending Lookups: A mechanism was implemented in the MPCacheEngine to save and retrieve pending lookup state (_PendingLookupState) between the SYNC_LOOKUP request and the subsequent RETRIEVE or FREE_LOOKUP_LOCKS operations. This ensures that L2 prefetch data can be correctly processed or released.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lmcache/integration/vllm/vllm_multi_process_adapter.py
    • Replaced _lookup_job_ids and _finished_lookup_jobs with a single _lookup_hit_counts dictionary to store synchronous lookup results.
    • Updated maybe_submit_lookup_request to send a SYNC_LOOKUP request and directly store the returned hit count.
    • Simplified check_lookup_result to retrieve the hit count directly from _lookup_hit_counts without any network calls or polling.
    • Modified cleanup_lookup_result to simply remove the request ID from _lookup_hit_counts.
  • lmcache/v1/distributed/storage_controllers/prefetch_controller.py
    • Imported the time module for busy-polling in synchronous operations.
    • Added synchronous_lookup method to submit lookup_and_lock tasks to L2 adapters and busy-poll for their results.
    • Implemented execute_load_phase to synchronously perform the L2-to-L1 data loading, including plan selection, L1 buffer reservation, and L2 load task submission.
    • Introduced unlock_lookup_results and an internal _unlock_results helper to release L2 pins from synchronous lookups without proceeding to load.
  • lmcache/v1/distributed/storage_manager.py
    • Imported Bitmap for handling bitmap operations.
    • Added synchronous_lookup_and_lock to perform a combined L1 and L2 prefix lookup, pinning L2 objects and returning hit counts.
    • Implemented execute_prefetch_load to facilitate L2-to-L1 data movement based on prior synchronous lookup results.
    • Added unlock_l2_lookups to release L2 pins when a synchronous lookup is not followed by a load.
  • lmcache/v1/multiprocess/protocols/base.py
    • Added SYNC_LOOKUP as a new RequestType enum member.
  • lmcache/v1/multiprocess/protocols/engine.py
    • Exported SYNC_LOOKUP in the module's __all__ list.
    • Defined the SYNC_LOOKUP protocol, specifying its payload (KeyType, int) and response (int) and marking it as a BLOCKING handler type.
  • lmcache/v1/multiprocess/server.py
    • Imported Bitmap for use in lookup state.
    • Defined _PendingLookupState dataclass to store state for synchronous lookup handoff.
    • Initialized _pending_lookups dictionary and _pending_lookups_lock to manage synchronous lookup state.
    • Modified the retrieve method to check for and execute pending L2 prefetch loads from _pending_lookups.
    • Added sync_lookup method to handle the new SYNC_LOOKUP request, performing L1 and L2 lookups and storing pending state.
    • Updated free_lookup_locks to release any pending L2 pins stored in _pending_lookups.
    • Registered engine.sync_lookup as the handler for the SYNC_LOOKUP request type.
  • tests/v1/multiprocess/test_free_locks.py
    • Updated test assertions to reference _lookup_hit_counts instead of _lookup_job_ids.
    • Adjusted mock return values to reflect hit counts (chunks) instead of job IDs.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@yoo-kumaneko yoo-kumaneko changed the title Sync lookup prefetch on retrieve pre rebase Sync lookup and move prefetch to retrieve Mar 13, 2026
@yoo-kumaneko yoo-kumaneko changed the title Sync lookup and move prefetch to retrieve Sync lookup, and move prefetch to retrieve Mar 13, 2026
@yoo-kumaneko yoo-kumaneko marked this pull request as draft March 13, 2026 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant