Skip to content

[MP][Telemetry] Hot-fix to enable the telemetry logging for store#2707

Merged
ApostaC merged 2 commits intoLMCache:devfrom
ApostaC:local-dev/mp-telemetry-2
Mar 7, 2026
Merged

[MP][Telemetry] Hot-fix to enable the telemetry logging for store#2707
ApostaC merged 2 commits intoLMCache:devfrom
ApostaC:local-dev/mp-telemetry-2

Conversation

@ApostaC
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC commented Mar 6, 2026

What this PR does / why we need it:

Re-enables telemetry logging for store operations in the CuPy stream, which was previously disabled due to a hanging issue in certain environments.

The fix introduces a _can_log_store flag that gates store telemetry logging until after the first retrieve operation completes. This ensures a retrieve operation is always logged in the CuPy stream before any store operation, which avoids the hanging behavior.

Special notes for your reviewers:

The root cause of the hang appears to be related to logging a store telemetry event in the CuPy stream before any retrieve event has been logged. The flag-based ordering constraint is a minimal fix that preserves the telemetry data while avoiding the problematic code path.

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where telemetry logging for store operations in the CuPy stream caused the process to hang in certain environments. The changes re-enable this essential telemetry by introducing a safeguard that ensures a retrieve operation is always logged before any store operation, thereby preventing the hanging behavior and allowing for continued data collection.

Highlights

  • Telemetry Re-enablement: Re-enabled telemetry logging for store operations in the CuPy stream, which was previously disabled due to a hanging issue.
  • Conditional Logging Mechanism: Introduced a _can_log_store flag to gate store telemetry logging, ensuring it only occurs after the first retrieve operation completes to prevent hangs.
  • Root Cause Mitigation: Implemented a minimal fix to preserve telemetry data by enforcing an ordering constraint, avoiding the problematic code path where store telemetry logging before a retrieve event caused process hangs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lmcache/v1/multiprocess/server.py
    • Initialized a new instance variable _can_log_store to False in the __init__ method.
    • Modified the store method to conditionally log telemetry events (start and end) only when _can_log_store is True.
    • Updated the _retrieve_loop method to set _can_log_store to True after a successful retrieve operation.
Activity
  • No specific activity (comments, reviews, etc.) was provided in the context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Signed-off-by: ApostaC <yihua98@uchicago.edu>
@ApostaC ApostaC added the full Run comprehensive tests on this PR label Mar 6, 2026
@YaoJiayi YaoJiayi self-requested a review March 6, 2026 23:17
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request re-enables telemetry logging for store operations by introducing a _can_log_store flag, using a sensible workaround of waiting for the first retrieve operation to enable store logging to address a hanging issue. However, a critical security concern has been identified: the implementation relies on unvalidated client-provided identifiers (instance_id) to access sensitive GPU contexts, exposing the system to unauthorized access and potential data leakage between different client processes. Additionally, consider improving the comment for the new _can_log_store flag for clarity and cleaning up outdated comments in the store method regarding disabled telemetry.

"store",
key.request_id,
stored_count=len(reserved_dict),
device=str(gpu_context.device),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The store method uses a client-provided instance_id to access GPU contexts from the self.gpu_contexts dictionary without any authentication or authorization checks. This allows a malicious client to perform operations or trigger telemetry events on a GPU context belonging to another process. This is particularly risky as the server is exposed via ZMQ over TCP without any built-in security, making it vulnerable to Insecure Direct Object Reference (IDOR) attacks.

Comment on lines +181 to +183
# FIX: fix the problem of telemetry logging in cupy stream
# Need to log a retrieve operation before logging any store
# operation in the cupy stream
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This comment is a bit informal and could be more descriptive for future maintainers. It's good practice to explain the 'why' behind a workaround. Consider replacing it with a more detailed explanation of the underlying issue and how this flag resolves it.

Suggested change
# FIX: fix the problem of telemetry logging in cupy stream
# Need to log a retrieve operation before logging any store
# operation in the cupy stream
# Workaround for a hanging issue where logging a 'store' telemetry event
# before any 'retrieve' event can cause the process to hang. This flag
# delays logging 'store' events until after the first 'retrieve' operation.

Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ApostaC ApostaC enabled auto-merge (squash) March 6, 2026 23:20
Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ApostaC ApostaC merged commit ed7b386 into LMCache:dev Mar 7, 2026
25 of 29 checks passed
mauryaavinash95 pushed a commit to mauryaavinash95/LMCache that referenced this pull request Mar 7, 2026
shaoxiawjc pushed a commit to shaoxiawjc/LMCache that referenced this pull request Mar 11, 2026
…Cache#2707)

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: shaoxiawjc <wjc2800@163.com>
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 20, 2026
…Cache#2707)

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Aaron Wu <aaron.wu@dell.com>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants