[MP][Observability][3/3] Migrate MP server telemetry to EventBus, unify config#2806
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the MP server's observability system by transitioning from a legacy Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
7bdcf9f to
c2eba00
Compare
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring of the observability system, migrating from a custom TelemetryController to a more robust and extensible EventBus architecture based on OpenTelemetry. The changes are comprehensive, including updating all telemetry call sites, introducing a unified ObservabilityConfig, and adding excellent documentation for the new system (DESIGN.md, EVENTS.md, README.md).
My review focuses on a few areas for improvement to enhance configurability and maintainability:
- Exposing the event bus queue size as a CLI argument.
- Preserving type information for OpenTelemetry span attributes.
- Reducing code duplication in the new logging subscribers.
Overall, this is a high-quality contribution that greatly improves the observability infrastructure.
| def add_observability_args( | ||
| parser: argparse.ArgumentParser, | ||
| ) -> argparse.ArgumentParser: | ||
| """ | ||
| Add Prometheus configuration arguments to an existing parser. | ||
| """Add observability configuration arguments to an existing parser. | ||
|
|
||
| Args: | ||
| parser: The argument parser to add arguments to. | ||
|
|
||
| Returns: | ||
| argparse.ArgumentParser: The same parser with Prometheus arguments added. | ||
| The same parser with observability arguments added. | ||
| """ | ||
| prometheus_group = parser.add_argument_group( | ||
| "Prometheus Observability", "Configuration for Prometheus metrics" | ||
| group = parser.add_argument_group( | ||
| "Observability", "Configuration for metrics, logging, and tracing" | ||
| ) | ||
| group.add_argument( | ||
| "--disable-observability", | ||
| action="store_true", | ||
| default=False, | ||
| help="Disable the observability EventBus entirely.", | ||
| ) | ||
| prometheus_group.add_argument( | ||
| "--disable-prometheus", | ||
| group.add_argument( | ||
| "--disable-metrics", | ||
| action="store_true", | ||
| default=False, | ||
| help="Disable Prometheus metrics collection and HTTP server.", | ||
| help="Disable metrics subscribers (OTel counters).", | ||
| ) | ||
| prometheus_group.add_argument( | ||
| group.add_argument( | ||
| "--disable-logging", | ||
| action="store_true", | ||
| default=False, | ||
| help="Disable logging subscribers.", | ||
| ) | ||
| group.add_argument( | ||
| "--enable-tracing", | ||
| action="store_true", | ||
| default=False, | ||
| help="Enable span subscribers (OTel traces). Disabled by default.", | ||
| ) | ||
| group.add_argument( | ||
| "--otlp-endpoint", | ||
| type=str, | ||
| default=None, | ||
| help=( | ||
| "OTLP gRPC endpoint (e.g. http://localhost:4317). " | ||
| "When set, metrics/traces are pushed to an OTel collector. " | ||
| "When unset, falls back to Prometheus pull mode." | ||
| ), | ||
| ) | ||
| group.add_argument( | ||
| "--prometheus-port", | ||
| type=int, | ||
| default=9090, | ||
| help="Port to expose the Prometheus /metrics endpoint on. Default is 9090.", | ||
| ) | ||
| prometheus_group.add_argument( | ||
| "--prometheus-log-interval", | ||
| type=float, | ||
| default=10.0, | ||
| help="How often (in seconds) to flush stats to Prometheus. Default is 10.0.", | ||
| help=( | ||
| "Port for the Prometheus /metrics endpoint. " | ||
| "Only used when --otlp-endpoint is not set. Default is 9090." | ||
| ), | ||
| ) | ||
| return parser |
There was a problem hiding this comment.
The max_queue_size for the EventBus is defined in ObservabilityConfig but is not configurable via a CLI argument. The previous telemetry system allowed configuring this via --telemetry-max-queue-size. It would be beneficial to expose this configuration option again to allow tuning the event bus behavior under different loads.
Consider adding a --event-bus-queue-size argument to add_observability_args and using it in parse_args_to_observability_config.
| try: | ||
| # Third Party | ||
| from opentelemetry.sdk._logs import LoggingHandler | ||
|
|
||
| _otel_handler = LoggingHandler(level=logging.DEBUG) | ||
| logger.addHandler(_otel_handler) | ||
| except ImportError: | ||
| pass |
There was a problem hiding this comment.
This block of code to set up the OpenTelemetry LoggingHandler is repeated across all three new logging subscribers (l1.py, mp_server.py, and sm.py). To improve maintainability and reduce code duplication, consider refactoring this into a shared helper function, for example in lmcache.logging, that initializes a logger and attaches the OTel handler.
| for k, v in event.metadata.items(): | ||
| span.set_attribute(k, str(v)) |
There was a problem hiding this comment.
Converting all metadata values to strings with str(v) before setting them as span attributes causes loss of type information. For example, integer values like stored_count will be stored as strings, which prevents numerical analysis (e.g., sums, averages) in tracing backends. The OpenTelemetry set_attribute method can handle int, float, and bool types directly. It's better to pass the values as-is and rely on the event producer to adhere to the documented metadata contract.
This logic is also duplicated on lines 94-95. You could extract it to a helper method.
| for k, v in event.metadata.items(): | |
| span.set_attribute(k, str(v)) | |
| for k, v in event.metadata.items(): | |
| span.set_attribute(k, v) |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c2eba00b0f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…nfig, remove old telemetry system - Migrate 6 telemetry call sites in MPCacheEngine (store/retrieve/lookup) from log_telemetry(make_start_event/make_end_event) to EventBus.publish(Event) - Add MPServerLoggingSubscriber for debug logging of all MP server events - Add MPServerTracingSubscriber for OTel spans from START/END event pairs - Replace PrometheusConfig + TelemetryConfig with unified ObservabilityConfig (--disable-observability, --disable-metrics, --disable-logging, --enable-tracing, --otlp-endpoint, --prometheus-port) - Move OTLP endpoint from env var to config option (--otlp-endpoint) - Wire init_otel_tracing() when tracing is enabled - Conditionally register subscribers based on config toggles - Add __all__ exports to all subscriber __init__.py files - Delete entire telemetry/ subdirectory and its tests - Rename REFACT_DESIGN.md to DESIGN.md (architecture doc only) - Split event metadata contracts into EVENTS.md - Update README.md with full config reference - Update METRICS.md with MP server event types Signed-off-by: royyhuang <roy.y.huang@gmail.com>
…onfig Replace DEFAULT_PROMETHEUS_CONFIG / prometheus_config with DEFAULT_OBSERVABILITY_CONFIG / obs_config in test_cache_server, test_blend_server, and test_blend_server_v2. Signed-off-by: royyhuang <roy.y.huang@gmail.com>
ab14d6d to
fadd426
Compare
Signed-off-by: royyhuang <roy.y.huang@gmail.com>
Signed-off-by: royyhuang <roy.y.huang@gmail.com>
Signed-off-by: royyhuang <roy.y.huang@gmail.com> # Conflicts: # lmcache/v1/multiprocess/blend_server.py # tests/v1/multiprocess/test_blend_server.py
- Move OTel LoggingHandler into init_logger so all loggers get OTel forwarding automatically; remove duplicated setup from 3 subscriber files. Log level now respects LMCACHE_LOG_LEVEL instead of being hardcoded to DEBUG. - Add global observability flag (is_observability_enabled) to skip launch_host_func calls in CUDA streams when observability is disabled. - Validate that --enable-tracing requires --otlp-endpoint at startup. - Create AGENTS.override.md for mp_observability module. - Rewrite observability docs to match new EventBus CLI args and document all three modes (metrics, logging, tracing). Signed-off-by: royyhuang <roy.y.huang@gmail.com>
Extract the repeated pattern of checking observability + calling launch_host_func into EventBus.publish_on_stream(stream, event). Callers no longer need to manually check the flag before scheduling host functions on CUDA streams. Signed-off-by: royyhuang <roy.y.huang@gmail.com>
Replace removed add_prometheus_args/add_telemetry_args with add_observability_args, and parse_args_to_prometheus_config/ parse_args_to_telemetry_config with parse_args_to_observability_config. Update test assertions to match new kwarg name (obs_config). Signed-off-by: royyhuang <roy.y.huang@gmail.com>
* add new kernels and unit tests for mp mode Signed-off-by: ApostaC <yihua98@uchicago.edu> * change the block ids to be on gpu tensor Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
…ity_pr3 Signed-off-by: royyhuang <roy.y.huang@gmail.com> # Conflicts: # lmcache/v1/multiprocess/server.py
…ity_pr3 Signed-off-by: royyhuang <roy.y.huang@gmail.com> # Conflicts: # docs/source/mp/observability.rst
…Cache into refact/mp_observability_pr3
…fy config (LMCache#2806) * [MP][Observability] Migrate MP server telemetry to EventBus, unify config, remove old telemetry system - Migrate 6 telemetry call sites in MPCacheEngine (store/retrieve/lookup) from log_telemetry(make_start_event/make_end_event) to EventBus.publish(Event) - Add MPServerLoggingSubscriber for debug logging of all MP server events - Add MPServerTracingSubscriber for OTel spans from START/END event pairs - Replace PrometheusConfig + TelemetryConfig with unified ObservabilityConfig (--disable-observability, --disable-metrics, --disable-logging, --enable-tracing, --otlp-endpoint, --prometheus-port) - Move OTLP endpoint from env var to config option (--otlp-endpoint) - Wire init_otel_tracing() when tracing is enabled - Conditionally register subscribers based on config toggles - Add __all__ exports to all subscriber __init__.py files - Delete entire telemetry/ subdirectory and its tests - Rename REFACT_DESIGN.md to DESIGN.md (architecture doc only) - Split event metadata contracts into EVENTS.md - Update README.md with full config reference - Update METRICS.md with MP server event types Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [MP][Observability] Fix mypy: update test files to use ObservabilityConfig Replace DEFAULT_PROMETHEUS_CONFIG / prometheus_config with DEFAULT_OBSERVABILITY_CONFIG / obs_config in test_cache_server, test_blend_server, and test_blend_server_v2. Signed-off-by: royyhuang <roy.y.huang@gmail.com> * add back the lost logging subscribers Signed-off-by: royyhuang <roy.y.huang@gmail.com> * add event bus queue size configurable from cli args Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [MP][Observability] Address PR review comments - Move OTel LoggingHandler into init_logger so all loggers get OTel forwarding automatically; remove duplicated setup from 3 subscriber files. Log level now respects LMCACHE_LOG_LEVEL instead of being hardcoded to DEBUG. - Add global observability flag (is_observability_enabled) to skip launch_host_func calls in CUDA streams when observability is disabled. - Validate that --enable-tracing requires --otlp-endpoint at startup. - Create AGENTS.override.md for mp_observability module. - Rewrite observability docs to match new EventBus CLI args and document all three modes (metrics, logging, tracing). Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [MP][Observability] Add publish_on_stream util to EventBus Extract the repeated pattern of checking observability + calling launch_host_func into EventBus.publish_on_stream(stream, event). Callers no longer need to manually check the flag before scheduling host functions on CUDA streams. Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [CLI] Update server command for new observability config API Replace removed add_prometheus_args/add_telemetry_args with add_observability_args, and parse_args_to_prometheus_config/ parse_args_to_telemetry_config with parse_args_to_observability_config. Update test assertions to match new kwarg name (obs_config). Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [MP][Core] Block id based kernel for MP mode (LMCache#2838) * add new kernels and unit tests for mp mode Signed-off-by: ApostaC <yihua98@uchicago.edu> * change the block ids to be on gpu tensor Signed-off-by: Yihua Cheng <yihua98@uchicago.edu> --------- Signed-off-by: royyhuang <roy.y.huang@gmail.com> Signed-off-by: Yihua Cheng <yihua98@uchicago.edu> Co-authored-by: Yihua Cheng <yihua98@uchicago.edu>
…fy config (LMCache#2806) * [MP][Observability] Migrate MP server telemetry to EventBus, unify config, remove old telemetry system - Migrate 6 telemetry call sites in MPCacheEngine (store/retrieve/lookup) from log_telemetry(make_start_event/make_end_event) to EventBus.publish(Event) - Add MPServerLoggingSubscriber for debug logging of all MP server events - Add MPServerTracingSubscriber for OTel spans from START/END event pairs - Replace PrometheusConfig + TelemetryConfig with unified ObservabilityConfig (--disable-observability, --disable-metrics, --disable-logging, --enable-tracing, --otlp-endpoint, --prometheus-port) - Move OTLP endpoint from env var to config option (--otlp-endpoint) - Wire init_otel_tracing() when tracing is enabled - Conditionally register subscribers based on config toggles - Add __all__ exports to all subscriber __init__.py files - Delete entire telemetry/ subdirectory and its tests - Rename REFACT_DESIGN.md to DESIGN.md (architecture doc only) - Split event metadata contracts into EVENTS.md - Update README.md with full config reference - Update METRICS.md with MP server event types Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [MP][Observability] Fix mypy: update test files to use ObservabilityConfig Replace DEFAULT_PROMETHEUS_CONFIG / prometheus_config with DEFAULT_OBSERVABILITY_CONFIG / obs_config in test_cache_server, test_blend_server, and test_blend_server_v2. Signed-off-by: royyhuang <roy.y.huang@gmail.com> * add back the lost logging subscribers Signed-off-by: royyhuang <roy.y.huang@gmail.com> * add event bus queue size configurable from cli args Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [MP][Observability] Address PR review comments - Move OTel LoggingHandler into init_logger so all loggers get OTel forwarding automatically; remove duplicated setup from 3 subscriber files. Log level now respects LMCACHE_LOG_LEVEL instead of being hardcoded to DEBUG. - Add global observability flag (is_observability_enabled) to skip launch_host_func calls in CUDA streams when observability is disabled. - Validate that --enable-tracing requires --otlp-endpoint at startup. - Create AGENTS.override.md for mp_observability module. - Rewrite observability docs to match new EventBus CLI args and document all three modes (metrics, logging, tracing). Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [MP][Observability] Add publish_on_stream util to EventBus Extract the repeated pattern of checking observability + calling launch_host_func into EventBus.publish_on_stream(stream, event). Callers no longer need to manually check the flag before scheduling host functions on CUDA streams. Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [CLI] Update server command for new observability config API Replace removed add_prometheus_args/add_telemetry_args with add_observability_args, and parse_args_to_prometheus_config/ parse_args_to_telemetry_config with parse_args_to_observability_config. Update test assertions to match new kwarg name (obs_config). Signed-off-by: royyhuang <roy.y.huang@gmail.com> * [MP][Core] Block id based kernel for MP mode (LMCache#2838) * add new kernels and unit tests for mp mode Signed-off-by: ApostaC <yihua98@uchicago.edu> * change the block ids to be on gpu tensor Signed-off-by: Yihua Cheng <yihua98@uchicago.edu> --------- Signed-off-by: royyhuang <roy.y.huang@gmail.com> Signed-off-by: Yihua Cheng <yihua98@uchicago.edu> Co-authored-by: Yihua Cheng <yihua98@uchicago.edu>

Summary
TelemetryControllertoEventBus.publish(Event(...))MPServerLoggingSubscriber(debug logs) andMPServerTracingSubscriber(OTel spans from START/END pairs)PrometheusConfig+TelemetryConfigwith unifiedObservabilityConfigcontrolling metrics, logging, and tracing via CLI flagsOTEL_EXPORTER_OTLP_ENDPOINTenv var to--otlp-endpointconfig optiontelemetry/subdirectory (controller, processors, config, event model) and its testsREFACT_DESIGN.md→DESIGN.md(architecture only), newEVENTS.mdfor metadata contractsChanges
New files
subscribers/logging/mp_server.py—MPServerLoggingSubscribersubscribers/tracing/mp_server.py—MPServerTracingSubscriberDESIGN.md— architecture design doc (replacesREFACT_DESIGN.md)EVENTS.md— event metadata contractsModified files
server.py,blend_server.py,blend_server_v2.py,http_server.py— useObservabilityConfig, conditional subscriber registration, OTLP endpoint passthroughconfig.py—ObservabilityConfigwithadd_observability_args()/parse_args_to_observability_config()otel_init.py— remove env var fallback, config-driven OTLP endpointREADME.md— full config reference (CLI flags, config fields, export modes, tracing)METRICS.md— add MP server events, link toEVENTS.md__init__.py— add__all__exportsDeleted files
telemetry/directory (7 source files + design doc)tests/v1/mp_observability/telemetry/(3 test files)Test plan
uv run python -m pytest tests/v1/mp_observability/ --noconftest -v— 55 tests passl1_read_keys,l1_write_keys,l1_evicted_keys,sm_read_requests,sm_read_succeed_keys,sm_read_failed_keys,sm_write_requests,sm_write_succeed_keys,sm_write_failed_keys) received with real vLLM trafficmp.store(115),mp.lookup_prefetch(28),mp.retrieve(3) with correct attributes and sub-ms GPU-accurate timestamps