[MP][Observability][2/3] Migrate L1 + SM to EventBus + OTel, remove old Prometheus pipeline by royyhuang · Pull Request #2794 · LMCache/LMCache

royyhuang · 2026-03-17T00:36:05Z

Summary

Migrate L1Manager and StorageManager observability from old Listener/Stats/PrometheusController system to EventBus + OpenTelemetry
L1Manager publishes events directly to EventBus alongside listener iteration (listeners stay for StoreListener/EvictionPolicy business logic)
StorageManager fully migrated: listener iteration replaced with bus.publish()
New OTel metrics subscribers (L1MetricsSubscriber, SMMetricsSubscriber) and logging subscribers with OTel LoggingHandler bridge
Dual metrics export mode: OTLP push (OTEL_EXPORTER_OTLP_ENDPOINT) or Prometheus pull fallback (default)
Removed old Prometheus pipeline from MP mode: PrometheusController, stats dataclasses, stats loggers
Restructured subscribers into metrics/, logging/, telemetry/ subdirs
Updated docs: README.md (replaces LOGGER_GUIDE.md), METRICS.md, REFACT_DESIGN.md

Depends on #2792. This is PR 2 of 3.

Test plan

78 tests pass (uv run python -m pytest tests/v1/mp_observability/ --noconftest)
All pre-commit hooks pass
Metrics verified on Prometheus fallback (curl localhost:9090/metrics)
Metrics verified on OTLP push (via test receiver on :4317)
End-to-end: run LMCache MP server, do store/retrieve, verify metrics appear

gemini-code-assist · 2026-03-17T00:36:47Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the observability infrastructure for LMCache's multiprocess (MP) mode. It transitions the L1Manager and StorageManager components to a modern, event-driven system utilizing an EventBus for pub/sub dispatch and OpenTelemetry for metrics instrumentation. This change streamlines how operational data is collected and exported, providing greater flexibility and adherence to industry standards while removing a legacy Prometheus-specific pipeline.

Highlights

Observability Migration: Migrated L1Manager and StorageManager observability from a Listener/Prometheus-based system to an EventBus and OpenTelemetry (OTel) pipeline.
Legacy System Removal: Removed the legacy PrometheusController, stats dataclasses, and associated loggers, streamlining the observability stack.
New OTel Subscribers: Introduced new OTel metrics subscribers (L1MetricsSubscriber, SMMetricsSubscriber) and OTel-bridged logging subscribers for L1Manager and StorageManager events.
Dual Metrics Export: Implemented dual metrics export functionality, supporting OTLP push to an OTel collector or a Prometheus pull fallback for development and debugging.
Documentation Updates: Updated documentation, including a new README.md for the observability module and revised METRICS.md and REFACT_DESIGN.md, to reflect the new observability model.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

lmcache/v1/distributed/internal_api.py
- Removed the StorageManagerListener abstract base class.
lmcache/v1/distributed/l1_manager.py
- Replaced direct Prometheus-related imports with EventBus imports.
- Added explicit _event_bus.publish() calls for various L1 events, while retaining existing listener iteration for business logic.
lmcache/v1/distributed/storage_manager.py
- Removed StorageManagerListener and Prometheus-related imports.
- Integrated EventBus publishing for all StorageManager events, replacing old listener calls.
lmcache/v1/mp_observability/LOGGER_GUIDE.md
- Removed the old guide for adding stats loggers.
lmcache/v1/mp_observability/METRICS.md
- Updated the metrics documentation to reflect the new EventBus and OpenTelemetry architecture.
- Included new metric names and metadata contracts for events.
lmcache/v1/mp_observability/README.md
- Added a new README detailing the EventBus architecture, metrics export modes, and instructions for adding new events and subscribers.
lmcache/v1/mp_observability/REFACT_DESIGN.md
- Added a comprehensive design document outlining the refactor's motivation, proposed architecture, and implementation plan.
lmcache/v1/mp_observability/config.py
- Added the ObservabilityConfig dataclass to manage unified observability settings.
lmcache/v1/mp_observability/event.py
- Added definitions for EventType enum and the Event dataclass, forming the core of the new event model.
lmcache/v1/mp_observability/event_bus.py
- Added the EventBus class, EventSubscriber abstract base class, and singleton management for event dispatch.
lmcache/v1/mp_observability/logger/integrator_stats_logger.py
- Renamed to lmcache/v1/mp_observability/subscribers/init.py.
lmcache/v1/mp_observability/logger/l1_stats_logger.py
- Removed the legacy L1Manager stats logger.
lmcache/v1/mp_observability/logger/l2_stats_logger.py
- Removed the placeholder L2Manager stats logger.
lmcache/v1/mp_observability/logger/mp_server_logger.py
- Renamed to lmcache/v1/mp_observability/subscribers/logging/init.py.
lmcache/v1/mp_observability/logger/prometheus_logger.py
- Removed the base class for Prometheus loggers.
lmcache/v1/mp_observability/logger/storage_manager_stats_logger.py
- Removed the legacy StorageManager stats logger.
lmcache/v1/mp_observability/otel_init.py
- Added functions to initialize the OpenTelemetry MeterProvider, supporting both OTLP push and Prometheus pull metrics export.
lmcache/v1/mp_observability/prometheus_controller.py
- Removed the Prometheus controller responsible for periodic metric flushing.
lmcache/v1/mp_observability/stats/l1_stats.py
- Removed the L1 stats dataclass.
lmcache/v1/mp_observability/stats/mp_server_stats.py
- Removed the MP server stats dataclass.
lmcache/v1/mp_observability/stats/storage_manager_stats.py
- Removed the StorageManager stats dataclass.
lmcache/v1/mp_observability/stats/vllm_integrator_stats.py
- Removed the VLLM integrator stats dataclass.
lmcache/v1/mp_observability/subscribers/logging/l1.py
- Added a logging subscriber for L1Manager events, bridging to OTel logging.
lmcache/v1/mp_observability/subscribers/logging/sm.py
- Added a logging subscriber for StorageManager events, bridging to OTel logging.
lmcache/v1/mp_observability/subscribers/metrics/l1.py
- Added a metrics subscriber for L1Manager events, updating OTel counters.
lmcache/v1/mp_observability/subscribers/metrics/sm.py
- Added a metrics subscriber for StorageManager events, updating OTel counters.
lmcache/v1/multiprocess/blend_server.py
- Updated server initialization to remove Prometheus controller setup.
- Integrated EventBus and OTel metrics initialization and subscriber registration.
lmcache/v1/multiprocess/blend_server_v2.py
- Updated server initialization to remove Prometheus controller setup.
- Integrated EventBus and OTel metrics initialization and subscriber registration.
lmcache/v1/multiprocess/http_server.py
- Removed the call to stop the Prometheus controller during server shutdown.
lmcache/v1/multiprocess/server.py
- Removed Prometheus client imports and controller setup from server initialization.
- Integrated EventBus and OTel metrics initialization and subscriber registration.
requirements/common.txt
- Added new OpenTelemetry API, SDK, OTLP exporter, and Prometheus exporter dependencies.
tests/v1/mp_observability/subscribers/metrics/test_l1.py
- Added unit tests for the L1MetricsSubscriber.
tests/v1/mp_observability/subscribers/metrics/test_sm.py
- Added unit tests for the SMMetricsSubscriber.
tests/v1/mp_observability/test_event_bus.py
- Added unit tests for the EventBus and EventSubscriber components.
tests/v1/mp_observability/test_l1_stats_logger.py
- Removed unit tests for the legacy L1 stats logger.
tests/v1/mp_observability/test_l2_stats_logger.py
- Removed unit tests for the legacy L2 stats logger.
tests/v1/mp_observability/test_prometheus_controller.py
- Removed unit tests for the legacy Prometheus controller.
tests/v1/mp_observability/test_storage_manager_stats_logger.py
- Removed unit tests for the legacy StorageManager stats logger.

Activity

78 tests passed successfully.
All pre-commit hooks passed.
Metrics were verified on Prometheus fallback by curling 'localhost:9090/metrics'.
Metrics were verified on OTLP push using a test receiver on port '4317'.
This PR is the second of a three-part series, depending on PR [MP][Observability][1/3] EventBus core infrastructure + OpenTelemetry dependency #2792.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a significant and well-executed refactoring of the observability system, migrating from a custom Prometheus pipeline to a more flexible EventBus and OpenTelemetry-based architecture. The changes are clean, well-documented with a new README.md and a thorough design document, and correctly remove the old, now-redundant components. My feedback focuses on a few areas to further improve maintainability and testing robustness: strengthening the new metrics tests to assert on output, consolidating duplicated initialization logic, and clarifying the use of an internal OpenTelemetry API in the logging subscribers. Overall, this is a high-quality change that modernizes the observability stack.

gemini-code-assist · 2026-03-17T00:38:35Z

+    def test_read_finished_increments_counter(self, bus, subscriber):
+        bus.start()
+        bus.publish(_make_event(EventType.L1_READ_FINISHED, _make_keys(5)))
+        time.sleep(0.15)
+        bus.stop()
+        # Verify the counter was called — OTel counters are real objects,
+        # we check via the internal measurement
+        # (in a real integration test we'd scrape /metrics)


This test verifies that publishing an event doesn't cause a crash, but as the comment notes, it doesn't assert that the OTel counter is actually incremented. To make these tests more robust and provide stronger guarantees that metrics are being correctly generated, you could use an InMemoryMetricReader from the OTel SDK. This would allow you to collect the emitted metrics in memory and assert on their values.

gemini-code-assist · 2026-03-17T00:38:35Z

+    def test_read_prefetched_increments_counters(self, bus, subscriber):
+        bus.start()
+        bus.publish(
+            _make_sm_event(
+                EventType.SM_READ_PREFETCHED,
+                succeeded=_make_keys(3),
+                failed=_make_keys(2),
+            )
+        )
+        time.sleep(0.15)
+        bus.stop()


This test verifies that publishing an event doesn't cause a crash, but it doesn't assert that the OpenTelemetry counters are actually incremented. To make this test more robust, you could use an InMemoryMetricReader from the OTel SDK to collect the emitted metrics and then assert on their values (e.g., that the request counter was incremented by 1, succeeded keys by 3, and failed keys by 2). This would provide stronger guarantees that the metrics are being correctly generated.

gemini-code-assist · 2026-03-17T00:38:35Z

+    # Initialize EventBus and register observability subscribers
+    # First Party
+    from lmcache.v1.mp_observability.event_bus import (
+        EventBusConfig,
+        init_event_bus,
+    )
+    from lmcache.v1.mp_observability.subscribers.metrics.l1 import (
+        L1MetricsSubscriber,
+    )
+    from lmcache.v1.mp_observability.subscribers.metrics.sm import (
+        SMMetricsSubscriber,
+    )
+
+    # Set up OTel MeterProvider BEFORE creating subscribers so that
+    # module-level get_meter() calls bind to the real provider
    if prometheus_config.enabled:
-        prometheus_client.start_http_server(prometheus_config.port)
-        logger.info(
-            "Prometheus metrics available at http://0.0.0.0:%d/metrics",
-            prometheus_config.port,
-        )
+        # First Party
+        from lmcache.v1.mp_observability.otel_init import init_otel_metrics
+
+        init_otel_metrics(prometheus_port=prometheus_config.port)
+
+    bus = init_event_bus(EventBusConfig(enabled=prometheus_config.enabled))
+    bus.register_subscriber(L1MetricsSubscriber())
+    bus.register_subscriber(SMMetricsSubscriber())
+    bus.start()


This block of code for initializing the event bus, OTel, and subscribers is duplicated across server.py, blend_server.py, and blend_server_v2.py. To improve maintainability and reduce code duplication, consider refactoring this logic into a dedicated helper function within the lmcache.v1.mp_observability module. This function could encapsulate the setup process, making the server startup files cleaner and easier to manage.

gemini-code-assist · 2026-03-17T00:38:35Z

+try:
+    # Third Party
+    from opentelemetry.sdk._logs import LoggingHandler
+
+    _otel_handler = LoggingHandler(level=logging.DEBUG)
+    logger.addHandler(_otel_handler)
+except ImportError:
+    pass


This code imports LoggingHandler from opentelemetry.sdk._logs, which is an internal module. Relying on internal modules can be risky as they are not subject to semantic versioning and can change without notice. Since opentelemetry-sdk is now a direct dependency, the try...except ImportError is likely no longer necessary.

If using LoggingHandler is intentional for a lightweight setup, it would be beneficial to remove the try...except and add a comment explaining the rationale for using an internal API.

# Third Party # NOTE: Using LoggingHandler from an internal OTel module for a lightweight # bridge to OTel Logs, which is less overhead than the full LoggingInstrumentor. from opentelemetry.sdk._logs import LoggingHandler _otel_handler = LoggingHandler(level=logging.DEBUG) logger.addHandler(_otel_handler)

gemini-code-assist · 2026-03-17T00:38:35Z

+try:
+    # Third Party
+    from opentelemetry.sdk._logs import LoggingHandler
+
+    _otel_handler = LoggingHandler(level=logging.DEBUG)
+    logger.addHandler(_otel_handler)
+except ImportError:
+    pass


This code imports LoggingHandler from opentelemetry.sdk._logs, which is an internal module. Relying on internal modules can be risky as they are not subject to semantic versioning and can change without notice. Since opentelemetry-sdk is now a direct dependency, the try...except ImportError is likely no longer necessary.

If using LoggingHandler is intentional for a lightweight setup, it would be beneficial to remove the try...except and add a comment explaining the rationale for using an internal API.

# Third Party # NOTE: Using LoggingHandler from an internal OTel module for a lightweight # bridge to OTel Logs, which is less overhead than the full LoggingInstrumentor. from opentelemetry.sdk._logs import LoggingHandler _otel_handler = LoggingHandler(level=logging.DEBUG) logger.addHandler(_otel_handler)

sammshen

LGTM

…e old Prometheus pipeline Replace the old Listener/Stats/PrometheusController observability system with the EventBus + OpenTelemetry pipeline for MP mode. Key changes: - L1Manager publishes events directly to EventBus alongside listener iteration (listeners stay for business-logic consumers: StoreListener, EvictionPolicy) - StorageManager fully migrated: listener iteration replaced with bus.publish() - New OTel metrics subscribers (L1MetricsSubscriber, SMMetricsSubscriber) - New logging subscribers with OTel LoggingHandler bridge - Dual metrics export: OTLP push (OTEL_EXPORTER_OTLP_ENDPOINT) or Prometheus pull fallback (default, no collector needed) - Removed: PrometheusController, stats dataclasses, old stats loggers, old tests - Updated: METRICS.md, README.md (replaces LOGGER_GUIDE.md), REFACT_DESIGN.md Signed-off-by: royyhuang <roy.y.huang@gmail.com>

…metry subdirs Reorganize subscribers for cleaner project layout: subscribers/metrics/l1.py, sm.py subscribers/logging/l1.py, sm.py subscribers/telemetry/ (future PR 3) Update all imports in server, blend_server, blend_server_v2, and tests. Signed-off-by: royyhuang <roy.y.huang@gmail.com>

Signed-off-by: royyhuang <roy.y.huang@gmail.com>

ApostaC

Otherwise LGTM

ApostaC · 2026-03-18T22:04:14Z

+        # First Party
+        from lmcache.v1.mp_observability.otel_init import init_otel_metrics


nit: we can do the import at the top of the file

ApostaC · 2026-03-18T22:04:28Z

+        # First Party
+        from lmcache.v1.mp_observability.otel_init import init_otel_metrics


Same as above: we can import at the top

ApostaC · 2026-03-18T22:04:43Z

+        # First Party
+        from lmcache.v1.mp_observability.otel_init import init_otel_metrics


nit: Same import issue here

ApostaC · 2026-03-18T22:05:28Z

We need __init__.py in the subfolders

ApostaC · 2026-03-18T22:18:24Z

+        )
+        self._event_bus.publish(
+            Event(
+                event_type=EventType.L1_KEYS_EVICTED,


Can we have L1_KEYS_DELETED instead of L1_KEYS_EVICTED?
The main reason is that not all the deletion is eviction.

ApostaC · 2026-03-18T22:20:49Z

We probably need to rethink what's the best definition of "failed keys". Let's take this offline (the current PR is okay)

ApostaC · 2026-03-18T22:23:59Z

+                self._event_bus.publish(
+                    Event(
+                        event_type=EventType.SM_READ_PREFETCHED_FINISHED,
+                        metadata={
+                            "succeeded_keys": good_keys,
+                            "failed_keys": bad_keys,
+                        },
+                    )
+                )


The semantics are a bit weird here. We emit the same SM_READ_PREFETCHED_FINISHED event in the finish_read_prefetched function below. Ideally here we need a different event

We probably need to discuss how to define the "failures" and how to correctly capture them

Signed-off-by: royyhuang <roy.y.huang@gmail.com>

…lity_pr2' into refact/mp_observability_pr2

…ld Prometheus pipeline (LMCache#2794) * [MP][Observability] Migrate L1 + SM metrics to EventBus + OTel, remove old Prometheus pipeline Replace the old Listener/Stats/PrometheusController observability system with the EventBus + OpenTelemetry pipeline for MP mode. Signed-off-by: royyhuang <roy.y.huang@gmail.com>

…ld Prometheus pipeline (LMCache#2794) * [MP][Observability] Migrate L1 + SM metrics to EventBus + OTel, remove old Prometheus pipeline Replace the old Listener/Stats/PrometheusController observability system with the EventBus + OpenTelemetry pipeline for MP mode. Signed-off-by: royyhuang <roy.y.huang@gmail.com> Signed-off-by: Aaron Wu <aaron.wu@dell.com>

…ld Prometheus pipeline (LMCache#2794) * [MP][Observability] Migrate L1 + SM metrics to EventBus + OTel, remove old Prometheus pipeline Replace the old Listener/Stats/PrometheusController observability system with the EventBus + OpenTelemetry pipeline for MP mode. Signed-off-by: royyhuang <roy.y.huang@gmail.com>

royyhuang changed the title ~~[MP][Observability] Migrate L1 + SM to EventBus + OTel, remove old Prometheus pipeline~~ [MP][Observability][2/3] Migrate L1 + SM to EventBus + OTel, remove old Prometheus pipeline Mar 17, 2026

gemini-code-assist Bot reviewed Mar 17, 2026

View reviewed changes

sammshen approved these changes Mar 18, 2026

View reviewed changes

royyhuang added 2 commits March 18, 2026 20:26

royyhuang force-pushed the refact/mp_observability_pr2 branch from 36da9c7 to f3232a8 Compare March 18, 2026 20:26

fix stale import in v1/distributed/conftest.py

4c535a9

Signed-off-by: royyhuang <roy.y.huang@gmail.com>

ApostaC reviewed Mar 18, 2026

View reviewed changes

ApostaC approved these changes Mar 18, 2026

View reviewed changes

ApostaC enabled auto-merge (squash) March 18, 2026 23:20

github-actions Bot added the full Run comprehensive tests on this PR label Mar 18, 2026

royyhuang added 3 commits March 19, 2026 11:30

Merge branch 'dev' into refact/mp_observability_pr2

ec8940a

move event bus related import to top level

deb634c

Signed-off-by: royyhuang <roy.y.huang@gmail.com>

Merge remote-tracking branch 'refs/remotes/origin/refact/mp_observabi…

7b08479

…lity_pr2' into refact/mp_observability_pr2

ApostaC merged commit d77127b into LMCache:dev Mar 19, 2026
34 of 37 checks passed

ApostaC mentioned this pull request Mar 20, 2026

[Core] Add L2 eviction in mp mode #2824

Merged

2 tasks

		# First Party
		from lmcache.v1.mp_observability.otel_init import init_otel_metrics

Conversation

royyhuang commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

royyhuang commented Mar 17, 2026 •

edited

Loading