[MP][Telemetry] Hot-fix to enable the telemetry logging for store#2707
[MP][Telemetry] Hot-fix to enable the telemetry logging for store#2707ApostaC merged 2 commits intoLMCache:devfrom
Conversation
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where telemetry logging for store operations in the CuPy stream caused the process to hang in certain environments. The changes re-enable this essential telemetry by introducing a safeguard that ensures a retrieve operation is always logged before any store operation, thereby preventing the hanging behavior and allowing for continued data collection. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Signed-off-by: ApostaC <yihua98@uchicago.edu>
There was a problem hiding this comment.
Code Review
This pull request re-enables telemetry logging for store operations by introducing a _can_log_store flag, using a sensible workaround of waiting for the first retrieve operation to enable store logging to address a hanging issue. However, a critical security concern has been identified: the implementation relies on unvalidated client-provided identifiers (instance_id) to access sensitive GPU contexts, exposing the system to unauthorized access and potential data leakage between different client processes. Additionally, consider improving the comment for the new _can_log_store flag for clarity and cleaning up outdated comments in the store method regarding disabled telemetry.
| "store", | ||
| key.request_id, | ||
| stored_count=len(reserved_dict), | ||
| device=str(gpu_context.device), |
There was a problem hiding this comment.
The store method uses a client-provided instance_id to access GPU contexts from the self.gpu_contexts dictionary without any authentication or authorization checks. This allows a malicious client to perform operations or trigger telemetry events on a GPU context belonging to another process. This is particularly risky as the server is exposed via ZMQ over TCP without any built-in security, making it vulnerable to Insecure Direct Object Reference (IDOR) attacks.
| # FIX: fix the problem of telemetry logging in cupy stream | ||
| # Need to log a retrieve operation before logging any store | ||
| # operation in the cupy stream |
There was a problem hiding this comment.
This comment is a bit informal and could be more descriptive for future maintainers. It's good practice to explain the 'why' behind a workaround. Consider replacing it with a more detailed explanation of the underlying issue and how this flag resolves it.
| # FIX: fix the problem of telemetry logging in cupy stream | |
| # Need to log a retrieve operation before logging any store | |
| # operation in the cupy stream | |
| # Workaround for a hanging issue where logging a 'store' telemetry event | |
| # before any 'retrieve' event can cause the process to hang. This flag | |
| # delays logging 'store' events until after the first 'retrieve' operation. |
…Cache#2707) Signed-off-by: ApostaC <yihua98@uchicago.edu>
…Cache#2707) Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: shaoxiawjc <wjc2800@163.com>
…Cache#2707) Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>
…Cache#2707) Signed-off-by: ApostaC <yihua98@uchicago.edu>
…Cache#2707) Signed-off-by: ApostaC <yihua98@uchicago.edu>
What this PR does / why we need it:
Re-enables telemetry logging for store operations in the CuPy stream, which was previously disabled due to a hanging issue in certain environments.
The fix introduces a
_can_log_storeflag that gates store telemetry logging until after the first retrieve operation completes. This ensures a retrieve operation is always logged in the CuPy stream before any store operation, which avoids the hanging behavior.Special notes for your reviewers:
The root cause of the hang appears to be related to logging a store telemetry event in the CuPy stream before any retrieve event has been logged. The flag-based ordering constraint is a minimal fix that preserves the telemetry data while avoiding the problematic code path.
If applicable: