[MP][Observability][3/3] Migrate MP server telemetry to EventBus, unify config by royyhuang · Pull Request #2806 · LMCache/LMCache

royyhuang · 2026-03-18T00:38:24Z

Summary

Migrate all 6 MP server telemetry call sites (store/retrieve/lookup START/END) from the old TelemetryController to EventBus.publish(Event(...))
Add MPServerLoggingSubscriber (debug logs) and MPServerTracingSubscriber (OTel spans from START/END pairs)
Replace PrometheusConfig + TelemetryConfig with unified ObservabilityConfig controlling metrics, logging, and tracing via CLI flags
Move OTLP endpoint selection from OTEL_EXPORTER_OTLP_ENDPOINT env var to --otlp-endpoint config option
Delete entire telemetry/ subdirectory (controller, processors, config, event model) and its tests
Clean up docs: REFACT_DESIGN.md → DESIGN.md (architecture only), new EVENTS.md for metadata contracts

Changes

New files

subscribers/logging/mp_server.py — MPServerLoggingSubscriber
subscribers/tracing/mp_server.py — MPServerTracingSubscriber
DESIGN.md — architecture design doc (replaces REFACT_DESIGN.md)
EVENTS.md — event metadata contracts

Modified files

server.py, blend_server.py, blend_server_v2.py, http_server.py — use ObservabilityConfig, conditional subscriber registration, OTLP endpoint passthrough
config.py — ObservabilityConfig with add_observability_args() / parse_args_to_observability_config()
otel_init.py — remove env var fallback, config-driven OTLP endpoint
README.md — full config reference (CLI flags, config fields, export modes, tracing)
METRICS.md — add MP server events, link to EVENTS.md
All subscriber __init__.py — add __all__ exports

Deleted files

telemetry/ directory (7 source files + design doc)
tests/v1/mp_observability/telemetry/ (3 test files)

Test plan

uv run python -m pytest tests/v1/mp_observability/ --noconftest -v — 55 tests pass
Pre-commit hooks pass
OTLP collector verified: all 9 metrics (l1_read_keys, l1_write_keys, l1_evicted_keys, sm_read_requests, sm_read_succeed_keys, sm_read_failed_keys, sm_write_requests, sm_write_succeed_keys, sm_write_failed_keys) received with real vLLM traffic
All 3 span types verified: mp.store (115), mp.lookup_prefetch (28), mp.retrieve (3) with correct attributes and sub-ms GPU-accurate timestamps

chatgpt-codex-connector · 2026-03-18T00:38:29Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

gemini-code-assist · 2026-03-18T00:39:52Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the MP server's observability system by transitioning from a legacy TelemetryController and separate configurations to a modern, event-driven architecture based on an EventBus and a unified ObservabilityConfig. This change centralizes the collection and processing of metrics, logs, and traces, making the system more modular, extensible, and aligned with OpenTelemetry standards for improved monitoring and debugging capabilities.

Highlights

Telemetry Migration: Migrated all MP server telemetry call sites (store/retrieve/lookup START/END) from the legacy TelemetryController to a new EventBus.publish(Event(...)) mechanism.
New Observability Subscribers: Introduced MPServerLoggingSubscriber for debug logs and MPServerTracingSubscriber for OpenTelemetry spans, processing START/END event pairs.
Configuration Unification: Replaced separate PrometheusConfig and TelemetryConfig with a unified ObservabilityConfig to control metrics, logging, and tracing via CLI flags.
OTLP Endpoint Configuration: Moved OTLP endpoint selection from the OTEL_EXPORTER_OTLP_ENDPOINT environment variable to a dedicated --otlp-endpoint configuration option.
Codebase Simplification: Deleted the entire telemetry/ subdirectory, including its controller, processors, configuration, event model, and associated tests, streamlining the observability implementation.
Documentation Updates: Cleaned up and refactored design documentation (REFACT_DESIGN.md to DESIGN.md) and added a new EVENTS.md document for event metadata contracts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a significant and well-executed refactoring of the observability system, migrating from a custom TelemetryController to a more robust and extensible EventBus architecture based on OpenTelemetry. The changes are comprehensive, including updating all telemetry call sites, introducing a unified ObservabilityConfig, and adding excellent documentation for the new system (DESIGN.md, EVENTS.md, README.md).

My review focuses on a few areas for improvement to enhance configurability and maintainability:

Exposing the event bus queue size as a CLI argument.
Preserving type information for OpenTelemetry span attributes.
Reducing code duplication in the new logging subscribers.

Overall, this is a high-quality contribution that greatly improves the observability infrastructure.

gemini-code-assist · 2026-03-18T00:48:25Z

+def add_observability_args(
    parser: argparse.ArgumentParser,
 ) -> argparse.ArgumentParser:
-    """
-    Add Prometheus configuration arguments to an existing parser.
+    """Add observability configuration arguments to an existing parser.

    Args:
        parser: The argument parser to add arguments to.

    Returns:
-        argparse.ArgumentParser: The same parser with Prometheus arguments added.
+        The same parser with observability arguments added.
    """
-    prometheus_group = parser.add_argument_group(
-        "Prometheus Observability", "Configuration for Prometheus metrics"
+    group = parser.add_argument_group(
+        "Observability", "Configuration for metrics, logging, and tracing"
+    )
+    group.add_argument(
+        "--disable-observability",
+        action="store_true",
+        default=False,
+        help="Disable the observability EventBus entirely.",
    )
-    prometheus_group.add_argument(
-        "--disable-prometheus",
+    group.add_argument(
+        "--disable-metrics",
        action="store_true",
        default=False,
-        help="Disable Prometheus metrics collection and HTTP server.",
+        help="Disable metrics subscribers (OTel counters).",
    )
-    prometheus_group.add_argument(
+    group.add_argument(
+        "--disable-logging",
+        action="store_true",
+        default=False,
+        help="Disable logging subscribers.",
+    )
+    group.add_argument(
+        "--enable-tracing",
+        action="store_true",
+        default=False,
+        help="Enable span subscribers (OTel traces). Disabled by default.",
+    )
+    group.add_argument(
+        "--otlp-endpoint",
+        type=str,
+        default=None,
+        help=(
+            "OTLP gRPC endpoint (e.g. http://localhost:4317). "
+            "When set, metrics/traces are pushed to an OTel collector. "
+            "When unset, falls back to Prometheus pull mode."
+        ),
+    )
+    group.add_argument(
        "--prometheus-port",
        type=int,
        default=9090,
-        help="Port to expose the Prometheus /metrics endpoint on. Default is 9090.",
-    )
-    prometheus_group.add_argument(
-        "--prometheus-log-interval",
-        type=float,
-        default=10.0,
-        help="How often (in seconds) to flush stats to Prometheus. Default is 10.0.",
+        help=(
+            "Port for the Prometheus /metrics endpoint. "
+            "Only used when --otlp-endpoint is not set. Default is 9090."
+        ),
    )
    return parser


The max_queue_size for the EventBus is defined in ObservabilityConfig but is not configurable via a CLI argument. The previous telemetry system allowed configuring this via --telemetry-max-queue-size. It would be beneficial to expose this configuration option again to allow tuning the event bus behavior under different loads.

Consider adding a --event-bus-queue-size argument to add_observability_args and using it in parse_args_to_observability_config.

gemini-code-assist · 2026-03-18T00:48:25Z

+try:
+    # Third Party
+    from opentelemetry.sdk._logs import LoggingHandler
+
+    _otel_handler = LoggingHandler(level=logging.DEBUG)
+    logger.addHandler(_otel_handler)
+except ImportError:
+    pass


This block of code to set up the OpenTelemetry LoggingHandler is repeated across all three new logging subscribers (l1.py, mp_server.py, and sm.py). To improve maintainability and reduce code duplication, consider refactoring this into a shared helper function, for example in lmcache.logging, that initializes a logger and attaches the OTel handler.

gemini-code-assist · 2026-03-18T00:48:25Z

+        for k, v in event.metadata.items():
+            span.set_attribute(k, str(v))


Converting all metadata values to strings with str(v) before setting them as span attributes causes loss of type information. For example, integer values like stored_count will be stored as strings, which prevents numerical analysis (e.g., sums, averages) in tracing backends. The OpenTelemetry set_attribute method can handle int, float, and bool types directly. It's better to pass the values as-is and rely on the event producer to adhere to the documented metadata contract.

This logic is also duplicated on lines 94-95. You could extract it to a helper method.

Suggested change

for k, v in event.metadata.items():

span.set_attribute(k, str(v))

for k, v in event.metadata.items():

span.set_attribute(k, v)

sammshen · 2026-03-18T00:55:52Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c2eba00b0f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-18T01:02:56Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

…nfig, remove old telemetry system - Migrate 6 telemetry call sites in MPCacheEngine (store/retrieve/lookup) from log_telemetry(make_start_event/make_end_event) to EventBus.publish(Event) - Add MPServerLoggingSubscriber for debug logging of all MP server events - Add MPServerTracingSubscriber for OTel spans from START/END event pairs - Replace PrometheusConfig + TelemetryConfig with unified ObservabilityConfig (--disable-observability, --disable-metrics, --disable-logging, --enable-tracing, --otlp-endpoint, --prometheus-port) - Move OTLP endpoint from env var to config option (--otlp-endpoint) - Wire init_otel_tracing() when tracing is enabled - Conditionally register subscribers based on config toggles - Add __all__ exports to all subscriber __init__.py files - Delete entire telemetry/ subdirectory and its tests - Rename REFACT_DESIGN.md to DESIGN.md (architecture doc only) - Split event metadata contracts into EVENTS.md - Update README.md with full config reference - Update METRICS.md with MP server event types Signed-off-by: royyhuang <roy.y.huang@gmail.com>

…onfig Replace DEFAULT_PROMETHEUS_CONFIG / prometheus_config with DEFAULT_OBSERVABILITY_CONFIG / obs_config in test_cache_server, test_blend_server, and test_blend_server_v2. Signed-off-by: royyhuang <roy.y.huang@gmail.com>

Signed-off-by: royyhuang <roy.y.huang@gmail.com>

KuntaiDu

Otherwise LGTM

ApostaC

Please see the comments below.

Regarding the user-facing changes, here's the summary from cc:

Signed-off-by: royyhuang <roy.y.huang@gmail.com> # Conflicts: # lmcache/v1/multiprocess/blend_server.py # tests/v1/multiprocess/test_blend_server.py

- Move OTel LoggingHandler into init_logger so all loggers get OTel forwarding automatically; remove duplicated setup from 3 subscriber files. Log level now respects LMCACHE_LOG_LEVEL instead of being hardcoded to DEBUG. - Add global observability flag (is_observability_enabled) to skip launch_host_func calls in CUDA streams when observability is disabled. - Validate that --enable-tracing requires --otlp-endpoint at startup. - Create AGENTS.override.md for mp_observability module. - Rewrite observability docs to match new EventBus CLI args and document all three modes (metrics, logging, tracing). Signed-off-by: royyhuang <roy.y.huang@gmail.com>

Extract the repeated pattern of checking observability + calling launch_host_func into EventBus.publish_on_stream(stream, event). Callers no longer need to manually check the flag before scheduling host functions on CUDA streams. Signed-off-by: royyhuang <roy.y.huang@gmail.com>

Replace removed add_prometheus_args/add_telemetry_args with add_observability_args, and parse_args_to_prometheus_config/ parse_args_to_telemetry_config with parse_args_to_observability_config. Update test assertions to match new kwarg name (obs_config). Signed-off-by: royyhuang <roy.y.huang@gmail.com>

* add new kernels and unit tests for mp mode Signed-off-by: ApostaC <yihua98@uchicago.edu> * change the block ids to be on gpu tensor Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>

…ity_pr3 Signed-off-by: royyhuang <roy.y.huang@gmail.com> # Conflicts: # lmcache/v1/multiprocess/server.py

ApostaC

LGTM!

…ity_pr3 Signed-off-by: royyhuang <roy.y.huang@gmail.com> # Conflicts: # docs/source/mp/observability.rst

…Cache into refact/mp_observability_pr3

…fy config (LMCache#2806) * [MP][Observability] Migrate MP server telemetry to EventBus, unify config, remove old telemetry system - Migrate 6 telemetry call sites in MPCacheEngine (store/retrieve/lookup) from log_telemetry(make_start_event/make_end_event) to EventBus.publish(Event) - Add MPServerLoggingSubscriber for debug logging of all MP server events - Add MPServerTracingSubscriber for OTel spans from START/END event pairs - Replace PrometheusConfig + TelemetryConfig with unified ObservabilityConfig (--disable-observability, --disable-metrics, --disable-logging, --enable-tracing, --otlp-endpoint, --prometheus-port) - Move OTLP endpoint from env var to config option (--otlp-endpoint) - Wire init_otel_tracing() when tracing is enabled - Conditionally register subscribers based on config toggles - Add __all__ exports to all subscriber __init__.py files - Delete entire telemetry/ subdirectory and its tests - Rename REFACT_DESIGN.md to DESIGN.md (architecture doc only) - Split event metadata contracts into EVENTS.md - Update README.md with full config reference - Update METRICS.md with MP server event types Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [MP][Observability] Fix mypy: update test files to use ObservabilityConfig Replace DEFAULT_PROMETHEUS_CONFIG / prometheus_config with DEFAULT_OBSERVABILITY_CONFIG / obs_config in test_cache_server, test_blend_server, and test_blend_server_v2. Signed-off-by: royyhuang <roy.y.huang@gmail.com> * add back the lost logging subscribers Signed-off-by: royyhuang <roy.y.huang@gmail.com> * add event bus queue size configurable from cli args Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [MP][Observability] Address PR review comments - Move OTel LoggingHandler into init_logger so all loggers get OTel forwarding automatically; remove duplicated setup from 3 subscriber files. Log level now respects LMCACHE_LOG_LEVEL instead of being hardcoded to DEBUG. - Add global observability flag (is_observability_enabled) to skip launch_host_func calls in CUDA streams when observability is disabled. - Validate that --enable-tracing requires --otlp-endpoint at startup. - Create AGENTS.override.md for mp_observability module. - Rewrite observability docs to match new EventBus CLI args and document all three modes (metrics, logging, tracing). Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [MP][Observability] Add publish_on_stream util to EventBus Extract the repeated pattern of checking observability + calling launch_host_func into EventBus.publish_on_stream(stream, event). Callers no longer need to manually check the flag before scheduling host functions on CUDA streams. Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [CLI] Update server command for new observability config API Replace removed add_prometheus_args/add_telemetry_args with add_observability_args, and parse_args_to_prometheus_config/ parse_args_to_telemetry_config with parse_args_to_observability_config. Update test assertions to match new kwarg name (obs_config). Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [MP][Core] Block id based kernel for MP mode (LMCache#2838) * add new kernels and unit tests for mp mode Signed-off-by: ApostaC <yihua98@uchicago.edu> * change the block ids to be on gpu tensor Signed-off-by: Yihua Cheng <yihua98@uchicago.edu> --------- Signed-off-by: royyhuang <roy.y.huang@gmail.com> Signed-off-by: Yihua Cheng <yihua98@uchicago.edu> Co-authored-by: Yihua Cheng <yihua98@uchicago.edu>

royyhuang force-pushed the refact/mp_observability_pr3 branch from 7bdcf9f to c2eba00 Compare March 18, 2026 00:47

royyhuang changed the title ~~[MP][Observability] Migrate MP server telemetry to EventBus, unify config~~ [MP][Observability][3/3] Migrate MP server telemetry to EventBus, unify config Mar 18, 2026

gemini-code-assist Bot reviewed Mar 18, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 18, 2026

View reviewed changes

Comment thread lmcache/v1/multiprocess/server.py Outdated

Comment thread lmcache/v1/mp_observability/subscribers/tracing/mp_server.py

royyhuang added 2 commits March 19, 2026 23:08

royyhuang force-pushed the refact/mp_observability_pr3 branch from ab14d6d to fadd426 Compare March 19, 2026 23:08

royyhuang added 2 commits March 20, 2026 00:21

add back the lost logging subscribers

e2c6347

Signed-off-by: royyhuang <roy.y.huang@gmail.com>

add event bus queue size configurable from cli args

2533197

Signed-off-by: royyhuang <roy.y.huang@gmail.com>

royyhuang requested review from ApostaC and KuntaiDu March 20, 2026 23:30

KuntaiDu reviewed Mar 23, 2026

View reviewed changes

Comment thread lmcache/v1/mp_observability/subscribers/logging/mp_server.py Outdated

KuntaiDu approved these changes Mar 23, 2026

View reviewed changes

ApostaC reviewed Mar 25, 2026

View reviewed changes

royyhuang and others added 8 commits March 26, 2026 18:37

Merge branch 'dev' into refact/mp_observability_pr3

f826d49

Signed-off-by: royyhuang <roy.y.huang@gmail.com> # Conflicts: # lmcache/v1/multiprocess/blend_server.py # tests/v1/multiprocess/test_blend_server.py

Merge branch 'dev' into refact/mp_observability_pr3

9e99fed

Merge branch 'dev' into refact/mp_observability_pr3

4b4ffa8

[MP][Core] Block id based kernel for MP mode (LMCache#2838)

3cf95ea

* add new kernels and unit tests for mp mode Signed-off-by: ApostaC <yihua98@uchicago.edu> * change the block ids to be on gpu tensor Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>

Merge remote-tracking branch 'upstream/dev' into refact/mp_observabil…

bd99489

…ity_pr3 Signed-off-by: royyhuang <roy.y.huang@gmail.com> # Conflicts: # lmcache/v1/multiprocess/server.py

royyhuang enabled auto-merge (squash) March 26, 2026 22:25

github-actions Bot added the full Run comprehensive tests on this PR label Mar 26, 2026

royyhuang added the mp Buildkite trigger for multi-processing mode test label Mar 26, 2026

ApostaC approved these changes Mar 26, 2026

View reviewed changes

royyhuang added 3 commits March 26, 2026 17:20

Merge branch 'dev' into refact/mp_observability_pr3

216eafe

Merge remote-tracking branch 'upstream/dev' into refact/mp_observabil…

a6b67ea

…ity_pr3 Signed-off-by: royyhuang <roy.y.huang@gmail.com> # Conflicts: # docs/source/mp/observability.rst

Merge branch 'refact/mp_observability_pr3' of github.com:royyhuang/LM…

6f3702a

…Cache into refact/mp_observability_pr3

royyhuang merged commit 64c3ba8 into LMCache:dev Mar 27, 2026
33 checks passed

		for k, v in event.metadata.items():
		span.set_attribute(k, str(v))

Conversation

royyhuang commented Mar 18, 2026

Summary

Changes

New files

Modified files

Deleted files

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Mar 18, 2026

Uh oh!

gemini-code-assist Bot commented Mar 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

sammshen commented Mar 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Mar 18, 2026

Uh oh!

Uh oh!

KuntaiDu left a comment

Choose a reason for hiding this comment

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants