Skip to content

[MP][Observability] Add telemetry subsystem for multiprocess mode#2696

Merged
ApostaC merged 10 commits intoLMCache:devfrom
ApostaC:local-dev/mp-telemetry
Mar 6, 2026
Merged

[MP][Observability] Add telemetry subsystem for multiprocess mode#2696
ApostaC merged 10 commits intoLMCache:devfrom
ApostaC:local-dev/mp-telemetry

Conversation

@ApostaC
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC commented Mar 5, 2026

What this PR does / why we need it:

Adds an event-based telemetry framework for MP mode observability, complementing the existing Prometheus metrics (aggregated counters/histograms) with per-operation event tracing.

Design: Uses a START/END event model with session IDs. Call sites emit a START event when an operation begins and an END event when it completes, sharing the same session_id to correlate them into spans. This two-event design is necessary because start and end happen at different call sites in LMCache's async submit/check/complete patterns. See DESIGN.md for full architecture details and how to implement new processors.

Key components:

  • TelemetryEvent dataclass (name, event_type, session_id, metadata)
  • TelemetryProcessor / TelemetryProcessorConfig ABCs with a registry pattern (same as L2 adapters)
  • LoggingProcessor built-in processor with configurable log level
  • TelemetryController with lock-free deque, daemon drain thread, tail-drop backpressure
  • Disabled-by-default singleton — no thread or overhead until explicitly enabled
  • make_start_event() / make_end_event() convenience functions for concise call sites

User-facing changes:

  • New CLI args: --enable-telemetry, --telemetry-max-queue-size, --telemetry-processor JSON
  • Example: --enable-telemetry --telemetry-processor '{"type": "logging", "log_level": "INFO"}'
  • Telemetry events emitted for store, retrieve, and lookup operations in server.py
  • Wired into both server.py and blend_server.py startup/shutdown lifecycle

Special notes for your reviewers:

  • The processor registry + factory pattern mirrors lmcache/v1/distributed/l2_adapters/config.py
  • collections.deque is used without a lock — CPython GIL guarantees atomic append()/popleft() with a single consumer thread
  • GPU-stream call sites use launch_host_func with is_enabled() guards to avoid registering unnecessary CUDA callbacks when telemetry is disabled
  • This PR implements the framework only. Adding more call sites (e.g., blend operations, L2 store/retrieve) is a follow-up

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

ApostaC added 2 commits March 5, 2026 06:22
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust, event-based telemetry system for LMCache's multiprocess mode, significantly enhancing observability by providing detailed, per-operation tracing. The framework is designed for extensibility, allowing for various processors to consume events, and is integrated seamlessly into the server's lifecycle with minimal performance impact when disabled. This foundational work enables deeper insights into cache behavior and performance characteristics.

Highlights

  • New Telemetry Framework: Introduced an event-based telemetry system for LMCache's multiprocess (MP) mode, designed for per-operation event tracing to complement existing Prometheus metrics.
  • Event Model: Implemented a START/END event model using session_id to correlate operations into spans, addressing the asynchronous nature of LMCache's submit/check/complete patterns.
  • Pluggable Processor Architecture: Designed a flexible architecture with abstract base classes (TelemetryProcessor, TelemetryProcessorConfig) and a registry pattern, allowing for easy integration of new telemetry processors (e.g., OpenTelemetry exporters).
  • High-Performance Event Ingestion: Utilized a lock-free collections.deque for event queuing on the hot path, with a dedicated daemon drain thread for asynchronous processing and tail-drop backpressure to prevent overload.
  • CLI Configuration & Integration: Added new CLI arguments (--enable-telemetry, --telemetry-max-queue-size, --telemetry-processor) for user-facing configuration, and integrated the telemetry system into the startup/shutdown lifecycles of server.py and blend_server.py.
  • Initial Instrumentation: Instrumented store, retrieve, and lookup operations in server.py to emit telemetry events, with is_enabled() guards to minimize overhead when telemetry is disabled.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • lmcache/v1/mp_observability/telemetry/DESIGN.md
    • Documented the design principles, architecture, module layout, key components, and implementation guide for new telemetry processors.
  • lmcache/v1/mp_observability/telemetry/init.py
    • Defined the public API for the telemetry system, re-exporting core components for external use.
  • lmcache/v1/mp_observability/telemetry/config.py
    • Implemented the TelemetryConfig dataclass for system configuration.
    • Added add_telemetry_args to integrate telemetry options into argparse.
    • Provided parse_args_to_telemetry_config for constructing configuration from parsed arguments.
  • lmcache/v1/mp_observability/telemetry/controller.py
    • Created the TelemetryController class to manage event ingestion, queuing, and dispatch to processors.
    • Implemented a daemon drain thread for asynchronous event processing.
    • Added a global singleton pattern with init_telemetry_controller and get_telemetry_controller.
    • Provided convenience functions log_telemetry, make_start_event, and make_end_event.
  • lmcache/v1/mp_observability/telemetry/event.py
    • Defined the EventType enum (START, END) for event classification.
    • Created the TelemetryEvent dataclass to structure telemetry data, including name, event_type, session_id, and metadata.
  • lmcache/v1/mp_observability/telemetry/processors/init.py
    • Initialized the telemetry processors package.
    • Ensured built-in processor types are registered upon import.
  • lmcache/v1/mp_observability/telemetry/processors/base.py
    • Established TelemetryProcessor as an abstract base class for all telemetry processors.
    • Defined TelemetryProcessorConfig as an abstract base class for processor-specific configurations.
    • Implemented a registry (_PROCESSOR_CONFIG_REGISTRY) for mapping processor type names to their config classes.
  • lmcache/v1/mp_observability/telemetry/processors/logging_processor.py
    • Implemented LoggingProcessorConfig for configuring the built-in logging processor.
    • Created LoggingProcessor to log telemetry events at a configurable level.
  • lmcache/v1/multiprocess/blend_server.py
    • Imported telemetry-related modules and functions.
    • Added telemetry_config parameter to run_cache_server.
    • Initialized and started the global telemetry controller during server startup.
    • Stopped the telemetry controller during server shutdown.
  • lmcache/v1/multiprocess/server.py
    • Imported telemetry-related modules and functions.
    • Added telemetry_config parameter to run_cache_server.
    • Initialized and started the global telemetry controller during server startup.
    • Stopped the telemetry controller during server shutdown.
    • Instrumented the store method with make_start_event and make_end_event calls.
    • Instrumented the retrieve method with make_start_event and make_end_event calls.
    • Instrumented the lookup method with make_start_event and make_end_event calls, including error metadata.
  • tests/v1/mp_observability/telemetry/test_controller.py
    • Added unit tests for TelemetryController lifecycle, event processing, and processor registration.
    • Included tests for exception isolation, backpressure, and the global singleton behavior.
  • tests/v1/mp_observability/telemetry/test_event.py
    • Added unit tests for the EventType enum and TelemetryEvent dataclass, verifying default values and metadata handling.
  • tests/v1/mp_observability/telemetry/test_logging_processor.py
    • Added unit tests for LoggingProcessor and LoggingProcessorConfig, verifying event logging and configuration parsing.
Activity
  • ApostaC created this pull request to introduce a new telemetry subsystem for multiprocess observability.
  • The PR includes a detailed DESIGN.md document outlining the architecture and implementation guidelines.
  • Core components for event definition, configuration, and a central controller have been added.
  • A built-in logging processor is provided as a concrete example.
  • The telemetry system has been integrated into server.py and blend_server.py with initial instrumentation for key cache operations.
  • Comprehensive unit tests have been added for the new telemetry components.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a well-designed and comprehensive telemetry subsystem. The architecture is robust, featuring a lock-free queue, a daemon drain thread for processing, and a flexible processor plugin system. The implementation is clean and includes thorough unit tests. I have a couple of minor suggestions for improvement regarding configuration handling to enhance robustness and code clarity.

Comment thread lmcache/v1/mp_observability/telemetry/config.py
Signed-off-by: ApostaC <yihua98@uchicago.edu>
@ApostaC
Copy link
Copy Markdown
Contributor Author

ApostaC commented Mar 5, 2026

Example usage command (log the telemetry to terminal)

python3 -m lmcache.v1.multiprocess.server --max-workers 4 --port 6555 \
    --l1-size 70 --eviction-policy LRU \
    --l2-adapter '{"type": "mock", "max_size_gb": 200, "mock_bandwidth_gb": 10}' \
    --enable-telemetry --telemetry-processor '{"type": "logging", "log_level":"warning"}'

Expected output:
image

@ApostaC ApostaC requested review from KuntaiDu and Oasis-Git March 5, 2026 07:15
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick comments:

  1. previously IIRC you prefer to keep the lmcache/v1 for pure storage rather than request-level telemetry. Is that still true?
  2. Please also re-implement the request_finished telemetry under the integration folder using this framework, and deprecate the one under integration folder.
  3. When request fully ends, it would be really nice if we can have a "summary" of this request logged out, to avoid people manually align the time. Example:
Summary of request chat-cmpl-xxx:
Received at (time)
Preprocessing:  ??ms (st->ed: 2:04.025 -> 2.04.124)
Retrieve ?? tokens: ?? ms (st->ed: ???)
LLM computation runs: ??ms (st->ed: ???)
Store ?? tokens: ?? ms (st->ed: ???)

This might be convenient for debugging.

Signed-off-by: ApostaC <yihua98@uchicago.edu>
@ApostaC
Copy link
Copy Markdown
Contributor Author

ApostaC commented Mar 5, 2026

@KuntaiDu

previously IIRC you prefer to keep the lmcache/v1 for pure storage rather than request-level telemetry. Is that still true?

This one is actually designed for request-level telemetry. The core design here is to have a very fast, non-blocking API to submit the telemetry event (i.e., push it into a queue), and then there is a background thread processing all the telemetry events.

Please also re-implement the request_finished telemetry under the integration folder using this framework, and deprecate the one under integration folder.

Unfortunately, this could be pretty difficult, since we need to track the TP information at the LMCache-level, which is quite hard. Additionally, though both modules are named as "telemetry", this one is more for the debuggability and visualization (e.g., OpenTelemetry support)

When request fully ends, it would be really nice if we can have a "summary" of this request logged out, to avoid people manually align the time. Example:

Will have a follow-up PR for it. There should also be some tools to visualize the spans.

ApostaC added 2 commits March 5, 2026 13:16
Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
with self._lock:
self._processors.append(processor)

def start(self) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be idempotent? or doesn't matter

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Let me fix it

@ApostaC ApostaC added the full Run comprehensive tests on this PR label Mar 6, 2026
@ApostaC ApostaC enabled auto-merge (squash) March 6, 2026 06:41
Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ApostaC ApostaC merged commit 509a2f1 into LMCache:dev Mar 6, 2026
26 of 29 checks passed
mauryaavinash95 pushed a commit to mauryaavinash95/LMCache that referenced this pull request Mar 7, 2026
…Cache#2696)

* [Add] the telemetry subsystem for MP mode

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* integrate telemetry system for lookup store and retrieve

Signed-off-by: ApostaC <yihua98@uchicago.edu>
shaoxiawjc pushed a commit to shaoxiawjc/LMCache that referenced this pull request Mar 11, 2026
…Cache#2696)

* [Add] the telemetry subsystem for MP mode

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* integrate telemetry system for lookup store and retrieve

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: shaoxiawjc <wjc2800@163.com>
@ApostaC ApostaC mentioned this pull request Mar 11, 2026
2 tasks
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 20, 2026
…Cache#2696)

* [Add] the telemetry subsystem for MP mode

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* integrate telemetry system for lookup store and retrieve

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: Aaron Wu <aaron.wu@dell.com>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
…Cache#2696)

* [Add] the telemetry subsystem for MP mode

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* integrate telemetry system for lookup store and retrieve

Signed-off-by: ApostaC <yihua98@uchicago.edu>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
…Cache#2696)

* [Add] the telemetry subsystem for MP mode

Signed-off-by: ApostaC <yihua98@uchicago.edu>

* integrate telemetry system for lookup store and retrieve

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants