Skip to content

[MP][Debuggability] Introduce status report subsystem for MP-mode#2699

Merged
ApostaC merged 6 commits intoLMCache:devfrom
ApostaC:local-dev/mp-state-report
Mar 9, 2026
Merged

[MP][Debuggability] Introduce status report subsystem for MP-mode#2699
ApostaC merged 6 commits intoLMCache:devfrom
ApostaC:local-dev/mp-state-report

Conversation

@ApostaC
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC commented Mar 5, 2026

What this PR does / why we need it:

Adds a composable report_status() -> dict interface across all MP-mode components for production debugging and introspection. Currently the only introspection tools are memcheck() (returns bool) and debug() (returns "OK"), which are insufficient for diagnosing issues in the multi-tier storage pipeline.

Each component implements report_status() returning a dict with is_healthy: bool plus component-specific metrics. Parents aggregate children's reports as nested dicts, with health propagating upward (any unhealthy child → parent unhealthy).

Components instrumented (bottom-up):

  • L1Manager: object counts, lock counts (write/read/temporary), memory usage, TTL config
  • L2 adapters: stored object count, locked keys, capacity (abstract method on interface; MockL2Adapter implemented)
  • StoreController: thread alive, pending keys, in-flight task count (via shadow counters — no new locks on critical path)
  • PrefetchController: thread alive, submission/pending/in-flight/completed queue sizes, phase breakdown (lookup vs load)
  • EvictionController: thread alive, policy config
  • StorageManager: aggregates all children
  • MPCacheEngine / BlendEngine: engine type, chunk size, hash algorithm, GPU contexts, active sessions + storage manager subtree

New HTTP endpoint: GET /api/status returns the full JSON status tree.

New CLI tool: python -m lmcache.tools.mp_status_viewer [--url URL] [--json] fetches and pretty-prints the status.

Leaf helpers: TokenHasher.hash_algorithm_name, SessionManager.active_count()

Usage example:

# Start LMCache MP mode with http server (default http port 8000)
python3 -m lmcache.v1.multiprocess.http_server --l1-size 70 --eviction-policy LRU 

# Pretty-printed view (default)
python -m lmcache.tools.mp_status_viewer

# Raw JSON (for scripting / monitoring)
python -m lmcache.tools.mp_status_viewer --json | jq '.storage_manager.l1_manager'

# Custom endpoint
python -m lmcache.tools.mp_status_viewer --url http://my-host:9000/api/status

# Or just curl the endpoint directly
curl -s localhost:8000/api/status | jq

Screenshot
image

Special notes for your reviewers:

  • Shadow counters in StoreController and PrefetchController are updated in the background loop thread only — no new locks on the controller critical path. Existing lightweight locks (submission queue, results queue, listener) are reused for the few fields that need them.
  • L1Manager's report_status() uses the existing @l1_mgr_synchronized decorator and iterates _objects checking TTLLock.is_locked() per entry. This is O(n) but the endpoint is called infrequently (debug use).
  • The CLI viewer uses only stdlib (urllib.request, json) — no requests or rich dependency.

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

ApostaC added 4 commits March 5, 2026 04:58
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive status reporting subsystem for MP-mode components, enhancing debuggability and introspection. It provides a structured way to monitor the health and performance of various components through a new HTTP endpoint and a CLI tool, facilitating easier diagnosis of issues in the multi-tier storage pipeline.

Highlights

  • Status Reporting Interface: Introduced a composable report_status() -> dict interface across all MP-mode components for production debugging and introspection.
  • Instrumentation: Instrumented key components like L1Manager, L2 adapters, StoreController, PrefetchController, EvictionController, and StorageManager to provide detailed status reports.
  • New HTTP Endpoint and CLI Tool: Added a new HTTP endpoint (GET /api/status) to return the full JSON status tree and a CLI tool (python -m lmcache.tools.mp_status_viewer) for fetching and displaying the status.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • Instrumentation
    • L1Manager
    • L2 adapters
    • StoreController
    • PrefetchController
    • EvictionController
    • StorageManager
    • MPCacheEngine / BlendEngine
  • New Features
    • Introduced report_status() interface for MP-mode components
    • Added HTTP endpoint /api/status
    • Created CLI tool lmcache.tools.mp_status_viewer
Activity
  • Implemented report_status() methods in various components
  • Created a new HTTP endpoint to expose the status information
  • Developed a CLI tool for easy status viewing
  • Added unit tests for the new functionality
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive status reporting subsystem, a valuable addition for debuggability and introspection of the multi-process cache components. The implementation is well-designed, utilizing a composable report_status() interface and shadow counters, and the changes are well-tested. However, a critical security concern has been identified: the new /api/status endpoint exposes sensitive internal state without any authentication or authorization, which could lead to information exposure or potential denial of service if the service is exposed to untrusted networks. Additionally, there is a minor suggestion to improve the robustness of the mp_status_viewer CLI tool.

Comment thread lmcache/v1/multiprocess/http_server.py
Comment thread lmcache/tools/mp_status_viewer/__main__.py
@ApostaC
Copy link
Copy Markdown
Contributor Author

ApostaC commented Mar 5, 2026

TODO:

  • Status viewer tool have a UI
  • Combine it with the telemetry viewer

@ApostaC
Copy link
Copy Markdown
Contributor Author

ApostaC commented Mar 5, 2026

@maobaolong Please feel free to take a look at this and leave your thoughts. It's the health check functionality for multi-process mode.

@ApostaC ApostaC requested review from KuntaiDu and sammshen March 6, 2026 21:25
status["cb_registered_gpu_ids"] = list(self._cb_gpu_contexts.keys())
status["cb_gpu_context_meta"] = {
str(gpu_id): {"model_name": meta[0], "world_size": meta[1]}
for gpu_id, meta in self._cb_gpu_context_meta.items()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you modify the gpu context here you probably need to lock it up?

"registered_gpu_ids": list(self.gpu_contexts.keys()),
"gpu_context_meta": {
str(gpu_id): {"model_name": meta[0], "world_size": meta[1]}
for gpu_id, meta in self.gpu_context_meta.items()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here, gpu_context needs protection?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ApostaC is this lock not needed? like a CacheContext lock

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a small lock comment

Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ApostaC This looks great!

Just suggest to rename and set http server as default behavior, otherwise, LGTM.

Never mind to merge this PR first. A bundle of features base on this will come soon.

@ApostaC ApostaC enabled auto-merge (squash) March 9, 2026 19:13
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Mar 9, 2026
@ApostaC ApostaC merged commit 98c337d into LMCache:dev Mar 9, 2026
35 of 38 checks passed
shaoxiawjc pushed a commit to shaoxiawjc/LMCache that referenced this pull request Mar 11, 2026
…Cache#2699)

* [Add] status report
* [Add] tool to report status

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: shaoxiawjc <wjc2800@163.com>
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 20, 2026
…Cache#2699)

* [Add] status report
* [Add] tool to report status

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Aaron Wu <aaron.wu@dell.com>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
…Cache#2699)

* [Add] status report
* [Add] tool to report status

Signed-off-by: ApostaC <yihua98@uchicago.edu>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
…Cache#2699)

* [Add] status report
* [Add] tool to report status

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants