Skip to content

Refactor PrometheusController into global singleton with self-registration#2659

Merged
KuntaiDu merged 5 commits intoLMCache:devfrom
ApostaC:local-dev/observability-refactor
Mar 1, 2026
Merged

Refactor PrometheusController into global singleton with self-registration#2659
KuntaiDu merged 5 commits intoLMCache:devfrom
ApostaC:local-dev/observability-refactor

Conversation

@ApostaC
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC commented Feb 28, 2026

Summary

Resolves #2650

  • Decouple PrometheusController from StorageControllerInterface: it is now a standalone class with no knowledge of specific managers
  • Global singleton pattern: init_prometheus_controller(config) creates the singleton at server startup; get_prometheus_controller() provides access from anywhere
  • Self-registration: each module (L1Manager, StorageManager) creates and registers its own stats logger via get_prometheus_controller().register_logger()
  • Lifecycle moved to server layer: run_cache_server() in both server.py and blend_server.py now owns init → start → stop
  • PrometheusConfig decoupled from StorageManagerConfig: config parsing uses its own add_prometheus_args() / parse_args_to_prometheus_config() helpers
  • Clean up StorageControllerInterface: remove storage_manager parameter since PrometheusController was the only consumer; EvictionController only needs l1_manager

Test plan

  • pytest -xvs tests/v1/mp_observability/ — 47 passed
  • pytest -xvs tests/v1/distributed/ — 146 passed

@ApostaC ApostaC requested a review from royyhuang February 28, 2026 06:06
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the Prometheus observability stack by introducing a global singleton pattern for the PrometheusController. This change improves modularity and simplifies how different components integrate with Prometheus metrics. By centralizing the controller's lifecycle and enabling self-registration for metric loggers, the system becomes more robust and easier to extend, ensuring that Prometheus metrics are consistently collected and managed across the application.

Highlights

  • PrometheusController Refactor: The PrometheusController has been completely refactored into a standalone class, decoupled from StorageControllerInterface, and now operates as a global singleton. This change centralizes Prometheus metric management.
  • Self-Registration for Metrics: Modules like L1Manager and StorageManager no longer directly manage the Prometheus controller. Instead, they now self-register their respective stats loggers with the global PrometheusController instance, simplifying integration.
  • Centralized Lifecycle Management: The initialization, starting, and stopping of the PrometheusController have been moved to the server layer, specifically within run_cache_server in server.py and blend_server.py, ensuring consistent lifecycle management.
  • Decoupled Prometheus Configuration: Prometheus configuration (PrometheusConfig) has been separated from StorageManagerConfig, gaining its own dedicated argument parsing helpers (add_prometheus_args, parse_args_to_prometheus_config) for better modularity.
  • Observability Module Renaming: The distributed/observability directory has been renamed to mp_observability to better reflect its scope, and all related files (loggers, stats, documentation) have been moved accordingly.
  • Updated Testing Strategy: Existing Prometheus controller tests were removed and a new comprehensive test suite was added for the singleton PrometheusController. A new conftest.py fixture was introduced to mock Prometheus classes, preventing metric registry collisions in distributed tests.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lmcache/v1/distributed/config.py
    • Removed the PrometheusConfig dataclass.
    • Removed the prometheus_config field from StorageManagerConfig.
    • Removed Prometheus-related argument parsing from add_storage_manager_args and parse_args_to_config.
  • lmcache/v1/distributed/l1_manager.py
    • Imported L1ManagerStatsLogger and get_prometheus_controller.
    • Implemented self-registration for L1ManagerStatsLogger with the global Prometheus controller in the constructor.
  • lmcache/v1/distributed/observability/prometheus_controller.py
    • Removed the old PrometheusController implementation.
  • lmcache/v1/distributed/storage_manager.py
    • Imported StorageManagerStatsLogger and get_prometheus_controller.
    • Removed direct instantiation and management of the PrometheusController.
    • Implemented self-registration for StorageManagerStatsLogger with the global Prometheus controller in the constructor.
    • Removed the explicit _prometheus_controller.stop() call during shutdown.
  • lmcache/v1/mp_observability/LOGGER_GUIDE.md
    • Renamed from lmcache/v1/distributed/observability/LOGGER_GUIDE.md.
    • Updated file paths to reflect the new mp_observability directory.
    • Revised Step 3 to describe the new self-registration pattern for loggers with the global PrometheusController.
  • lmcache/v1/mp_observability/METRICS.md
    • Renamed from lmcache/v1/distributed/observability/METRICS.md.
  • lmcache/v1/mp_observability/config.py
    • Added a new file defining PrometheusConfig dataclass.
    • Added DEFAULT_PROMETHEUS_CONFIG.
    • Added add_prometheus_args and parse_args_to_prometheus_config functions for argument parsing.
  • lmcache/v1/mp_observability/logger/integrator_stats_logger.py
    • Renamed from lmcache/v1/distributed/observability/logger/integrator_stats_logger.py.
  • lmcache/v1/mp_observability/logger/l1_stats_logger.py
    • Renamed from lmcache/v1/distributed/observability/logger/l1_stats_logger.py.
    • Updated import paths to mp_observability.
  • lmcache/v1/mp_observability/logger/l2_stats_logger.py
    • Renamed from lmcache/v1/distributed/observability/logger/l2_stats_logger.py.
    • Updated import paths to mp_observability.
  • lmcache/v1/mp_observability/logger/mp_server_logger.py
    • Renamed from lmcache/v1/distributed/observability/logger/mp_server_logger.py.
  • lmcache/v1/mp_observability/logger/prometheus_logger.py
    • Renamed from lmcache/v1/distributed/observability/logger/prometheus_logger.py.
  • lmcache/v1/mp_observability/logger/storage_manager_stats_logger.py
    • Renamed from lmcache/v1/distributed/observability/logger/storage_manager_stats_logger.py.
    • Updated import paths to mp_observability.
  • lmcache/v1/mp_observability/prometheus_controller.py
    • Added a new file implementing the refactored PrometheusController class.
    • Implemented register_logger for thread-safe logger registration.
    • Implemented start and stop methods for controller lifecycle.
    • Provided global singleton accessors: get_prometheus_controller and init_prometheus_controller.
  • lmcache/v1/mp_observability/stats/l1_stats.py
    • Renamed from lmcache/v1/distributed/observability/stats/l1_stats.py.
  • lmcache/v1/mp_observability/stats/mp_server_stats.py
    • Renamed from lmcache/v1/distributed/observability/stats/mp_server_stats.py.
  • lmcache/v1/mp_observability/stats/storage_manager_stats.py
    • Renamed from lmcache/v1/distributed/observability/stats/storage_manager_stats.py.
  • lmcache/v1/mp_observability/stats/vllm_integrator_stats.py
    • Renamed from lmcache/v1/distributed/observability/stats/vllm_integrator_stats.py.
  • lmcache/v1/multiprocess/blend_server.py
    • Imported new Prometheus configuration and controller functions.
    • Updated run_cache_server to accept prometheus_config.
    • Initialized the global Prometheus controller at the start of run_cache_server.
    • Started the Prometheus controller after engine creation.
    • Stopped the Prometheus controller during server shutdown.
    • Integrated Prometheus argument parsing in the main execution block.
  • lmcache/v1/multiprocess/http_server.py
    • Imported DEFAULT_PROMETHEUS_CONFIG and get_prometheus_controller.
    • Passed DEFAULT_PROMETHEUS_CONFIG to run_cache_server.
    • Stopped the Prometheus controller during HTTP server shutdown.
  • lmcache/v1/multiprocess/server.py
    • Imported new Prometheus configuration and controller functions.
    • Updated run_cache_server to accept prometheus_config.
    • Initialized the global Prometheus controller at the start of run_cache_server.
    • Updated Prometheus HTTP server start logic to use the new prometheus_config.
    • Started the Prometheus controller after engine creation.
    • Stopped the Prometheus controller during server shutdown.
    • Integrated Prometheus argument parsing in parse_args and the main execution block.
  • tests/v1/distributed/conftest.py
    • Added a new file with a pytest fixture to mock Prometheus classes, preventing metric registry collisions in distributed tests.
  • tests/v1/distributed/observability/test_prometheus_controller.py
    • Removed the old test file for the previous PrometheusController implementation.
  • tests/v1/mp_observability/test_l1_stats_logger.py
    • Renamed from tests/v1/distributed/observability/test_l1_stats_logger.py.
    • Updated import paths.
  • tests/v1/mp_observability/test_l2_stats_logger.py
    • Renamed from tests/v1/distributed/observability/test_l2_stats_logger.py.
    • Updated import paths.
  • tests/v1/mp_observability/test_prometheus_controller.py
    • Added a new test file for the refactored PrometheusController singleton, covering its new functionality and global access.
  • tests/v1/mp_observability/test_storage_manager_stats_logger.py
    • Renamed from tests/v1/distributed/observability/test_storage_manager_stats_logger.py.
    • Updated import paths.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/test.yml
Activity
  • The pull request introduces a significant refactoring of the Prometheus observability components.
  • The Prometheus controller's design was changed from a coupled component to a global singleton with self-registration capabilities.
  • Configuration and lifecycle management for Prometheus were centralized and decoupled from individual managers.
  • Several files related to observability were renamed and their import paths updated to reflect a new module structure.
  • New test files were added and old ones removed to align with the refactored Prometheus controller and ensure proper functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring of the PrometheusController. Moving to a global singleton with a self-registration pattern decouples the observability logic from the core storage managers, which is a great improvement for modularity and maintainability. The lifecycle management (init, start, stop) is now cleanly handled at the server layer, and the configuration has been centralized. The updated tests, including the new fixtures for test isolation, are thorough and well-written. I have one suggestion to improve thread safety during shutdown, but overall this is an excellent change.

Comment on lines +64 to +65
for pl in self.all_loggers:
pl.unregister()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a potential race condition here. The self.all_loggers list is being iterated over without holding the lock, while another thread could be calling register_logger() which modifies the list. This could lead to unexpected behavior during shutdown. To ensure thread safety, it's better to create a snapshot of the list under the lock before iterating.

Suggested change
for pl in self.all_loggers:
pl.unregister()
with self._lock:
# Create a snapshot for thread-safe iteration.
loggers_to_unregister = list(self.all_loggers)
for pl in loggers_to_unregister:
pl.unregister()

Signed-off-by: ApostaC <yihua98@uchicago.edu>
@ApostaC ApostaC force-pushed the local-dev/observability-refactor branch from 277e103 to 1101896 Compare February 28, 2026 23:44
Signed-off-by: ApostaC <yihua98@uchicago.edu>
…age manager

Signed-off-by: ApostaC <yihua98@uchicago.edu>
@ApostaC ApostaC added the full Run comprehensive tests on this PR label Mar 1, 2026
…/LMCache into local-dev/observability-refactor
Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@KuntaiDu KuntaiDu enabled auto-merge (squash) March 1, 2026 00:15
@KuntaiDu KuntaiDu merged commit ea1317a into LMCache:dev Mar 1, 2026
33 of 34 checks passed
hlin99 pushed a commit to hlin99/LMCache that referenced this pull request Mar 2, 2026
…ation (LMCache#2659)

* refactor the controller to become the singleton

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* fix precommit issues

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* rollback the storage controller interface and remove the need of storage manager

Signed-off-by: ApostaC <yihua98@uchicago.edu>

---------

Signed-off-by: ApostaC <yihua98@uchicago.edu>
oferki pushed a commit to oferki/LMCache that referenced this pull request Mar 3, 2026
…ation (LMCache#2659)

* refactor the controller to become the singleton

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* fix precommit issues

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* rollback the storage controller interface and remove the need of storage manager

Signed-off-by: ApostaC <yihua98@uchicago.edu>

---------

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Ofer Kiselov Nahman <ofer.kiselovnahman@weka.io>
oferki pushed a commit to oferki/LMCache that referenced this pull request Mar 3, 2026
…ation (LMCache#2659)

* refactor the controller to become the singleton

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* fix precommit issues

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* rollback the storage controller interface and remove the need of storage manager

Signed-off-by: ApostaC <yihua98@uchicago.edu>

---------

Signed-off-by: ApostaC <yihua98@uchicago.edu>
mauryaavinash95 pushed a commit to mauryaavinash95/LMCache that referenced this pull request Mar 7, 2026
…ation (LMCache#2659)

* refactor the controller to become the singleton

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* fix precommit issues

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* rollback the storage controller interface and remove the need of storage manager

Signed-off-by: ApostaC <yihua98@uchicago.edu>

---------

Signed-off-by: ApostaC <yihua98@uchicago.edu>
shaoxiawjc pushed a commit to shaoxiawjc/LMCache that referenced this pull request Mar 11, 2026
…ation (LMCache#2659)

* refactor the controller to become the singleton

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* fix precommit issues

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* rollback the storage controller interface and remove the need of storage manager

Signed-off-by: ApostaC <yihua98@uchicago.edu>

---------

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: shaoxiawjc <wjc2800@163.com>
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 20, 2026
…ation (LMCache#2659)

* refactor the controller to become the singleton

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* fix precommit issues

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* rollback the storage controller interface and remove the need of storage manager

Signed-off-by: ApostaC <yihua98@uchicago.edu>

---------

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Aaron Wu <aaron.wu@dell.com>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
…ation (LMCache#2659)

* refactor the controller to become the singleton

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* fix precommit issues

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* rollback the storage controller interface and remove the need of storage manager

Signed-off-by: ApostaC <yihua98@uchicago.edu>

---------

Signed-off-by: ApostaC <yihua98@uchicago.edu>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
…ation (LMCache#2659)

* refactor the controller to become the singleton

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* fix precommit issues

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* rollback the storage controller interface and remove the need of storage manager

Signed-off-by: ApostaC <yihua98@uchicago.edu>

---------

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Refactor] Prometheus controller: global singleton with module self-registration

3 participants