[MP][Observability] Add telemetry subsystem for multiprocess mode by ApostaC · Pull Request #2696 · LMCache/LMCache

ApostaC · 2026-03-05T06:44:49Z

What this PR does / why we need it:

Adds an event-based telemetry framework for MP mode observability, complementing the existing Prometheus metrics (aggregated counters/histograms) with per-operation event tracing.

Design: Uses a START/END event model with session IDs. Call sites emit a START event when an operation begins and an END event when it completes, sharing the same session_id to correlate them into spans. This two-event design is necessary because start and end happen at different call sites in LMCache's async submit/check/complete patterns. See DESIGN.md for full architecture details and how to implement new processors.

Key components:

TelemetryEvent dataclass (name, event_type, session_id, metadata)
TelemetryProcessor / TelemetryProcessorConfig ABCs with a registry pattern (same as L2 adapters)
LoggingProcessor built-in processor with configurable log level
TelemetryController with lock-free deque, daemon drain thread, tail-drop backpressure
Disabled-by-default singleton — no thread or overhead until explicitly enabled
make_start_event() / make_end_event() convenience functions for concise call sites

User-facing changes:

New CLI args: --enable-telemetry, --telemetry-max-queue-size, --telemetry-processor JSON
Example: --enable-telemetry --telemetry-processor '{"type": "logging", "log_level": "INFO"}'
Telemetry events emitted for store, retrieve, and lookup operations in server.py
Wired into both server.py and blend_server.py startup/shutdown lifecycle

Special notes for your reviewers:

The processor registry + factory pattern mirrors lmcache/v1/distributed/l2_adapters/config.py
collections.deque is used without a lock — CPython GIL guarantees atomic append()/popleft() with a single consumer thread
GPU-stream call sites use launch_host_func with is_enabled() guards to avoid registering unnecessary CUDA callbacks when telemetry is disabled
This PR implements the framework only. Adding more call sites (e.g., blend operations, L2 store/retrieve) is a follow-up

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

Signed-off-by: ApostaC <yihua98@uchicago.edu>

gemini-code-assist · 2026-03-05T06:45:14Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust, event-based telemetry system for LMCache's multiprocess mode, significantly enhancing observability by providing detailed, per-operation tracing. The framework is designed for extensibility, allowing for various processors to consume events, and is integrated seamlessly into the server's lifecycle with minimal performance impact when disabled. This foundational work enables deeper insights into cache behavior and performance characteristics.

Highlights

New Telemetry Framework: Introduced an event-based telemetry system for LMCache's multiprocess (MP) mode, designed for per-operation event tracing to complement existing Prometheus metrics.
Event Model: Implemented a START/END event model using session_id to correlate operations into spans, addressing the asynchronous nature of LMCache's submit/check/complete patterns.
Pluggable Processor Architecture: Designed a flexible architecture with abstract base classes (TelemetryProcessor, TelemetryProcessorConfig) and a registry pattern, allowing for easy integration of new telemetry processors (e.g., OpenTelemetry exporters).
High-Performance Event Ingestion: Utilized a lock-free collections.deque for event queuing on the hot path, with a dedicated daemon drain thread for asynchronous processing and tail-drop backpressure to prevent overload.
CLI Configuration & Integration: Added new CLI arguments (--enable-telemetry, --telemetry-max-queue-size, --telemetry-processor) for user-facing configuration, and integrated the telemetry system into the startup/shutdown lifecycles of server.py and blend_server.py.
Initial Instrumentation: Instrumented store, retrieve, and lookup operations in server.py to emit telemetry events, with is_enabled() guards to minimize overhead when telemetry is disabled.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

lmcache/v1/mp_observability/telemetry/DESIGN.md
- Documented the design principles, architecture, module layout, key components, and implementation guide for new telemetry processors.
lmcache/v1/mp_observability/telemetry/init.py
- Defined the public API for the telemetry system, re-exporting core components for external use.
lmcache/v1/mp_observability/telemetry/config.py
- Implemented the TelemetryConfig dataclass for system configuration.
- Added add_telemetry_args to integrate telemetry options into argparse.
- Provided parse_args_to_telemetry_config for constructing configuration from parsed arguments.
lmcache/v1/mp_observability/telemetry/controller.py
- Created the TelemetryController class to manage event ingestion, queuing, and dispatch to processors.
- Implemented a daemon drain thread for asynchronous event processing.
- Added a global singleton pattern with init_telemetry_controller and get_telemetry_controller.
- Provided convenience functions log_telemetry, make_start_event, and make_end_event.
lmcache/v1/mp_observability/telemetry/event.py
- Defined the EventType enum (START, END) for event classification.
- Created the TelemetryEvent dataclass to structure telemetry data, including name, event_type, session_id, and metadata.
lmcache/v1/mp_observability/telemetry/processors/init.py
- Initialized the telemetry processors package.
- Ensured built-in processor types are registered upon import.
lmcache/v1/mp_observability/telemetry/processors/base.py
- Established TelemetryProcessor as an abstract base class for all telemetry processors.
- Defined TelemetryProcessorConfig as an abstract base class for processor-specific configurations.
- Implemented a registry (_PROCESSOR_CONFIG_REGISTRY) for mapping processor type names to their config classes.
lmcache/v1/mp_observability/telemetry/processors/logging_processor.py
- Implemented LoggingProcessorConfig for configuring the built-in logging processor.
- Created LoggingProcessor to log telemetry events at a configurable level.
lmcache/v1/multiprocess/blend_server.py
- Imported telemetry-related modules and functions.
- Added telemetry_config parameter to run_cache_server.
- Initialized and started the global telemetry controller during server startup.
- Stopped the telemetry controller during server shutdown.
lmcache/v1/multiprocess/server.py
- Imported telemetry-related modules and functions.
- Added telemetry_config parameter to run_cache_server.
- Initialized and started the global telemetry controller during server startup.
- Stopped the telemetry controller during server shutdown.
- Instrumented the store method with make_start_event and make_end_event calls.
- Instrumented the retrieve method with make_start_event and make_end_event calls.
- Instrumented the lookup method with make_start_event and make_end_event calls, including error metadata.
tests/v1/mp_observability/telemetry/test_controller.py
- Added unit tests for TelemetryController lifecycle, event processing, and processor registration.
- Included tests for exception isolation, backpressure, and the global singleton behavior.
tests/v1/mp_observability/telemetry/test_event.py
- Added unit tests for the EventType enum and TelemetryEvent dataclass, verifying default values and metadata handling.
tests/v1/mp_observability/telemetry/test_logging_processor.py
- Added unit tests for LoggingProcessor and LoggingProcessorConfig, verifying event logging and configuration parsing.

Activity

ApostaC created this pull request to introduce a new telemetry subsystem for multiprocess observability.
The PR includes a detailed DESIGN.md document outlining the architecture and implementation guidelines.
Core components for event definition, configuration, and a central controller have been added.
A built-in logging processor is provided as a concrete example.
The telemetry system has been integrated into server.py and blend_server.py with initial instrumentation for key cache operations.
Comprehensive unit tests have been added for the new telemetry components.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a well-designed and comprehensive telemetry subsystem. The architecture is robust, featuring a lock-free queue, a daemon drain thread for processing, and a flexible processor plugin system. The implementation is clean and includes thorough unit tests. I have a couple of minor suggestions for improvement regarding configuration handling to enhance robustness and code clarity.

Signed-off-by: ApostaC <yihua98@uchicago.edu>

ApostaC · 2026-03-05T07:14:57Z

Example usage command (log the telemetry to terminal)

python3 -m lmcache.v1.multiprocess.server --max-workers 4 --port 6555 \
    --l1-size 70 --eviction-policy LRU \
    --l2-adapter '{"type": "mock", "max_size_gb": 200, "mock_bandwidth_gb": 10}' \
    --enable-telemetry --telemetry-processor '{"type": "logging", "log_level":"warning"}'

Expected output:

Signed-off-by: ApostaC <yihua98@uchicago.edu>

KuntaiDu

Some quick comments:

previously IIRC you prefer to keep the lmcache/v1 for pure storage rather than request-level telemetry. Is that still true?
Please also re-implement the request_finished telemetry under the integration folder using this framework, and deprecate the one under integration folder.
When request fully ends, it would be really nice if we can have a "summary" of this request logged out, to avoid people manually align the time. Example:

Summary of request chat-cmpl-xxx:
Received at (time)
Preprocessing:  ??ms (st->ed: 2:04.025 -> 2.04.124)
Retrieve ?? tokens: ?? ms (st->ed: ???)
LLM computation runs: ??ms (st->ed: ???)
Store ?? tokens: ?? ms (st->ed: ???)

This might be convenient for debugging.

Signed-off-by: ApostaC <yihua98@uchicago.edu>

ApostaC · 2026-03-05T17:47:58Z

@KuntaiDu

previously IIRC you prefer to keep the lmcache/v1 for pure storage rather than request-level telemetry. Is that still true?

This one is actually designed for request-level telemetry. The core design here is to have a very fast, non-blocking API to submit the telemetry event (i.e., push it into a queue), and then there is a background thread processing all the telemetry events.

Please also re-implement the request_finished telemetry under the integration folder using this framework, and deprecate the one under integration folder.

Unfortunately, this could be pretty difficult, since we need to track the TP information at the LMCache-level, which is quite hard. Additionally, though both modules are named as "telemetry", this one is more for the debuggability and visualization (e.g., OpenTelemetry support)

When request fully ends, it would be really nice if we can have a "summary" of this request logged out, to avoid people manually align the time. Example:

Will have a follow-up PR for it. There should also be some tools to visualize the spans.

Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>

Signed-off-by: ApostaC <yihua98@uchicago.edu>

sammshen · 2026-03-05T23:19:21Z

+        with self._lock:
+            self._processors.append(processor)
+
+    def start(self) -> None:


should this be idempotent? or doesn't matter

Good point! Let me fix it

Signed-off-by: ApostaC <yihua98@uchicago.edu>

…nto local-dev/mp-telemetry

KuntaiDu

LGTM!

sammshen

LGTM!

…Cache#2696) * [Add] the telemetry subsystem for MP mode Signed-off-by: ApostaC <yihua98@uchicago.edu> * integrate telemetry system for lookup store and retrieve Signed-off-by: ApostaC <yihua98@uchicago.edu>

…Cache#2696) * [Add] the telemetry subsystem for MP mode Signed-off-by: ApostaC <yihua98@uchicago.edu> * integrate telemetry system for lookup store and retrieve Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: shaoxiawjc <wjc2800@163.com>

…Cache#2696) * [Add] the telemetry subsystem for MP mode Signed-off-by: ApostaC <yihua98@uchicago.edu> * integrate telemetry system for lookup store and retrieve Signed-off-by: ApostaC <yihua98@uchicago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>

…Cache#2696) * [Add] the telemetry subsystem for MP mode Signed-off-by: ApostaC <yihua98@uchicago.edu> * integrate telemetry system for lookup store and retrieve Signed-off-by: ApostaC <yihua98@uchicago.edu>

ApostaC added 2 commits March 5, 2026 06:22

[Add] the telemetry subsystem for MP mode

4e7d604

Signed-off-by: ApostaC <yihua98@uchicago.edu>

integrate telemetry system for lookup store and retrieve

15b0707

Signed-off-by: ApostaC <yihua98@uchicago.edu>

gemini-code-assist Bot reviewed Mar 5, 2026

View reviewed changes

Comment thread lmcache/v1/mp_observability/telemetry/config.py

Comment thread lmcache/v1/mp_observability/telemetry/processors/logging_processor.py

Comment thread lmcache/v1/mp_observability/telemetry/processors/logging_processor.py

temporarily disable store telemetry

649b9fd

Signed-off-by: ApostaC <yihua98@uchicago.edu>

ApostaC requested review from KuntaiDu and Oasis-Git March 5, 2026 07:15

stamp the timestamp during log time

403fb83

Signed-off-by: ApostaC <yihua98@uchicago.edu>

KuntaiDu reviewed Mar 5, 2026

View reviewed changes

add device information for retrieve

21f8707

Signed-off-by: ApostaC <yihua98@uchicago.edu>

ApostaC added 2 commits March 5, 2026 13:16

Merge branch 'dev' into local-dev/mp-telemetry

2723697

Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>

fix syntax error caused by merge

9bd1790

Signed-off-by: ApostaC <yihua98@uchicago.edu>

sammshen reviewed Mar 5, 2026

View reviewed changes

ApostaC added 3 commits March 5, 2026 18:16

Merge branch 'dev' into local-dev/mp-telemetry

0df98ec

make the 'start' method idempotent

3d535d5

Signed-off-by: ApostaC <yihua98@uchicago.edu>

Merge branch 'local-dev/mp-telemetry' of github.com:ApostaC/LMCache i…

bcf3b5e

…nto local-dev/mp-telemetry

ApostaC added the full Run comprehensive tests on this PR label Mar 6, 2026

ApostaC enabled auto-merge (squash) March 6, 2026 06:41

KuntaiDu approved these changes Mar 6, 2026

View reviewed changes

sammshen approved these changes Mar 6, 2026

View reviewed changes

ApostaC merged commit 509a2f1 into LMCache:dev Mar 6, 2026
26 of 29 checks passed

ApostaC mentioned this pull request Mar 10, 2026

[MP] Update the MP docs and pass telemetry config into http_server #2730

Merged

2 tasks

ApostaC mentioned this pull request Mar 11, 2026

[MP] introduce server auditlog #2741

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MP][Observability] Add telemetry subsystem for multiprocess mode#2696

[MP][Observability] Add telemetry subsystem for multiprocess mode#2696
ApostaC merged 10 commits intoLMCache:devfrom
ApostaC:local-dev/mp-telemetry

ApostaC commented Mar 5, 2026

Uh oh!

gemini-code-assist Bot commented Mar 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ApostaC commented Mar 5, 2026

Uh oh!

KuntaiDu left a comment •

edited

Loading

Uh oh!

ApostaC commented Mar 5, 2026

Uh oh!

sammshen Mar 5, 2026

Uh oh!

ApostaC Mar 6, 2026

Uh oh!

KuntaiDu left a comment

Uh oh!

sammshen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ApostaC commented Mar 5, 2026

Uh oh!

gemini-code-assist Bot commented Mar 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ApostaC commented Mar 5, 2026

Uh oh!

KuntaiDu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ApostaC commented Mar 5, 2026

Uh oh!

sammshen Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

ApostaC Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

KuntaiDu left a comment

Choose a reason for hiding this comment

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KuntaiDu left a comment •

edited

Loading