[MP][Core] Update the workflow for lookup to avoid busy loop#2710
[MP][Core] Update the workflow for lookup to avoid busy loop#2710ApostaC merged 6 commits intoLMCache:devfrom
Conversation
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical performance issue in the LMCache's lookup operation by replacing a blocking busy-loop with a non-blocking, two-phase protocol. This change allows the server to submit prefetch tasks and immediately return a job ID, enabling clients to asynchronously poll for results without monopolizing thread pool workers. The refactoring ensures that other essential blocking operations can execute concurrently, leading to a more responsive and efficient cache server architecture. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the lookup operation to a two-phase, non-blocking protocol, successfully resolving the busy-loop issue and preventing blocking the thread pool worker. However, this new stateful mechanism on the server introduces significant security vulnerabilities, specifically resource exhaustion (memory leak) and predictable resource identifiers (IDOR). Prefetch jobs are stored in a dictionary without a TTL or limit, and their IDs are easily guessable, which could lead to cross-client interference and denial of service. Additionally, I suggest refactoring some duplicated code to improve maintainability.
| # Prefetch job tracking for two-phase lookup | ||
| # TODO: implement periodic cleanup of stale _prefetch_jobs entries | ||
| # for crash resilience (e.g., client calls lookup but never queries) | ||
| self._prefetch_jobs: dict[int, _PrefetchJob] = {} |
There was a problem hiding this comment.
The _prefetch_jobs dictionary stores state for every LOOKUP request, but entries are only removed when query_prefetch_status is called and returns a non-None result. A malicious or buggy client can exhaust server memory by sending many LOOKUP requests without querying their status, leading to a Denial of Service (DoS). The PR description acknowledges this with a TODO, but the vulnerability is introduced by this change.
| job_id = self._next_prefetch_job_id | ||
| self._next_prefetch_job_id += 1 |
There was a problem hiding this comment.
The prefetch_job_id is an incrementing integer, making it easily predictable. Since query_prefetch_status removes the job entry from the server's state upon a successful query, an attacker can guess job IDs and 'steal' the results of other clients' prefetch jobs. This results in a Denial of Service for the legitimate client, as their subsequent query will find no job entry and return 0.
| return self._register_prefetch_job( | ||
| _PrefetchJob( | ||
| handle=PrefetchHandle( | ||
| request_id=-1, | ||
| l1_prefix_hit_count=0, | ||
| total_requested_keys=0, | ||
| submit_time=time.monotonic(), | ||
| ), | ||
| world_size=1, | ||
| request_id=key.request_id, | ||
| ) | ||
| ) | ||
| return 0 | ||
|
|
||
| # Prepare for the obj keys | ||
| ipc_keys.extend(key.to_hash_keys(self.token_hasher)) | ||
| if not ipc_keys: | ||
| log_telemetry( | ||
| make_end_event( | ||
| "lookup", | ||
| key.request_id, | ||
| error="no_ipc_keys_generated", | ||
| return self._register_prefetch_job( | ||
| _PrefetchJob( | ||
| handle=PrefetchHandle( | ||
| request_id=-1, | ||
| l1_prefix_hit_count=0, | ||
| total_requested_keys=0, | ||
| submit_time=time.monotonic(), | ||
| ), | ||
| world_size=1, | ||
| request_id=key.request_id, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
There's duplicated code for creating and registering a dummy prefetch job in these error-handling paths. This can be refactored into a helper method to improve code clarity and maintainability.
You can add a helper method like this to the MPCacheEngine class:
def _create_and_register_dummy_job(self, request_id: str) -> int:
"""Creates and registers a dummy prefetch job for error cases."""
dummy_job = _PrefetchJob(
handle=PrefetchHandle(
request_id=-1,
l1_prefix_hit_count=0,
total_requested_keys=0,
submit_time=time.monotonic(),
),
world_size=1,
request_id=request_id,
)
return self._register_prefetch_job(dummy_job)Then, you can simplify the lookup method by calling this new helper.
return self._create_and_register_dummy_job(key.request_id)
# Prepare for the obj keys
ipc_keys.extend(key.to_hash_keys(self.token_hasher))
if not ipc_keys:
return self._create_and_register_dummy_job(key.request_id)Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
…#2710) * Separate lookup and pefetch to avoid busy loop Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
…#2710) * Separate lookup and pefetch to avoid busy loop Signed-off-by: Yihua Cheng <yihua98@uchicago.edu> Signed-off-by: shaoxiawjc <wjc2800@163.com>
…#2710) * Separate lookup and pefetch to avoid busy loop Signed-off-by: Yihua Cheng <yihua98@uchicago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>
…#2710) * Separate lookup and pefetch to avoid busy loop Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
…#2710) * Separate lookup and pefetch to avoid busy loop Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
What this PR does / why we need it:
Splits the MP lookup operation into a two-phase non-blocking protocol to eliminate the busy-loop in
MPCacheEngine.lookup().Previously,
lookup()calledStorageManager.submit_prefetch_task()then busy-looped onquery_prefetch_status()until the result was ready. Withmax_workers=1(the default), this blocked the sole thread pool worker, preventing all other BLOCKING requests (STORE, RETRIEVE, FREE_LOOKUP_LOCKS, END_SESSION) from executing concurrently.Now:
PrefetchHandleserver-side, and returns a prefetch job ID immediately — no busy loop.int | None. Runs in the main ZMQ loop so it never consumes a thread pool worker. Auto-cleans the job entry when a non-None result is returned (exactly-once semantics).The client adapter (
LMCacheMPSchedulerAdapter) blocks onLOOKUPto get the job ID inmaybe_submit_lookup_request(), then sends blockingQUERY_PREFETCH_STATUSpolls incheck_lookup_result().Special notes for your reviewers:
"lookup"to"lookup_and_prefetch"since the span now covers both phases.LOOKUPbut never callsQUERY_PREFETCH_STATUS, the_prefetch_jobsentry leaks. A TODO is left for periodic TTL-based cleanup in a future PR.cb_lookup_pre_computedstill has its own internal busy-loop — out of scope for this PR.If applicable: