[MP] Introduce a simple way to register_gauge metrics.#2906
[MP] Introduce a simple way to register_gauge metrics.#2906ApostaC merged 3 commits intoLMCache:devfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces OTel observable gauges for monitoring multiprocess engine metrics, specifically tracking active prefetch jobs. It adds a register_gauge utility in otel_init.py and integrates it into the MPCacheEngine class. Feedback was provided regarding the use of a broad exception handler in the OTel initialization and the placement of the new private method _setup_metrics, which violates the project's style guide regarding class organization.
|
@ApostaC @sammshen @chunxiaozheng @deng451e Would you like to take a look at this PR? Thanks. |
|
Hi @maobaolong, thanks for your PR. IIUC, this PR bypasses our event bus for observability to create and expose metrics directly from server attribute. Ideally, we would like all metrics/logs/spans to be created following one standard code path for readability and maintainability unless you have really strong reasons against it, e.g., something the event bus systems simply cannot do. In that case, I would also suggest we rework on the event bus to accomodate your specific needs. For the specific metric mentioned in this PR, I believe @Oasis-Git is working on L2 related events. Once those events are merged into upstream, L2 related metrics/logs/spans should be really easy to integrate. |
|
@royyhuang Thanks for the review! You're right that register_gauge does bypass the event bus — I want to explain why I think that's reasonable here. |
…che#2906 Signed-off-by: baoloongmao <baoloongmao@tencent.com>
923e0a6 to
e5b5b2b
Compare
e5b5b2b to
28e5fc4
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 28e5fc4. Configure here.
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
28e5fc4 to
f9b4246
Compare
Hi @maobaolong, after discussed offline with @ApostaC, I think this indeed is a valid case and worth an escape hatch. PR itself is straight forward. Thanks for your contribution! |
|
@royyhuang Thanks for inspire me to study the event-bus design, it use a event driven mode, decouple event producer and event subscriber, it looks pretty good and suitable for majority scenarios. And also thanks @ApostaC @sammshen for your review! |
Introduce a simple way to register_gauge metrics. Signed-off-by: baoloongmao <baoloongmao@tencent.com> Co-authored-by: Roy Huang <roy.y.huang@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
Introduce a simple way to register_gauge metrics. Signed-off-by: baoloongmao <baoloongmao@tencent.com> Co-authored-by: Roy Huang <roy.y.huang@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
Per review on PR feat/mp-l1-l2-state-metrics: adapt to the existing ``register_gauge`` helper (PR LMCache#2906) instead of creating raw OTel ``UpDownCounter`` instruments inline. All four L1/L2 state metrics are now ObservableGauges registered through the same helper as ``lmcache_mp.active_prefetch_jobs``. Changes: - ``register_gauge``: now accepts either a single-value callback (existing callers unchanged) or a callback returning ``list[(value, attrs)]`` for per-attribute-set observations. An empty list yields no datapoints, which is the correct shape when nothing is in-flight. - StoreController / PrefetchController: replaced the ``UpDownCounter.add(+/-1, attrs)`` calls and the cleanup-decrement loops with single ObservableGauges registered via ``register_gauge``. The callbacks read live state at scrape time via new public snapshot methods (``get_inflight_count_by_adapter`` / ``get_inflight_load_state_by_adapter``) so there is no incremental bookkeeping to keep balanced — when ``_in_flight_*`` is cleared at shutdown the gauges naturally report zero. - Singleton dispatch: each controller class now guards gauge registration with a class-level ``_gauge_registered`` flag and dispatches the callback through ``cls._gauge_target`` (the most recent instance), matching the pattern already used by L1Manager. Required because the OTel SDK only honors the first registration for a given instrument name; without this guard, tests creating multiple controllers see stale data from a stopped instance. - ``InFlightPrefetchRequest.load_bytes_by_adapter`` field is retained as the per-adapter byte cache the gauge callback reads at scrape time. Cleanup decrement code has been removed (no longer needed with ObservableGauge semantics). - Renamed cleanup test to ``test_no_leak_after_shutdown_with_inflight_loads`` to reflect the new "gauge reads zero after dict is cleared" semantics. - Updated user-facing docs (``docs/source/mp/observability.rst``) and design docs (``docs/design/v1/mp_observability/METRICS.md``) to list the four metrics under Observable Gauges with attribute and source-of-truth info. All 6 new tests + 817 in tests/v1/distributed/ + tests/v1/mp_observability/ pass. Signed-off-by: Samuel Shen <slshen@uchciago.edu>

What this PR does / why we need it:
Introduce a simple way to register_gauge metrics
Special notes for your reviewers:
If applicable:
Note
Low Risk
Low risk observability-only change that adds a new OTel observable gauge and a small status-field addition; main concern is correctness of the gauge callback wiring under different OTel versions.
Overview
Adds a
register_gaugehelper inotel_init.pyto simplify creating OTel observable gauges (no-op with debug log whenopentelemetryisn’t installed).Hooks this into
MPCacheEngineto export a newlmcache_mp.active_prefetch_jobsgauge (thread-safe count of_prefetch_jobs) and includes the same value inreport_status. UpdatesMETRICS.mdto document the new gauge and adds unit tests covering gauge creation, callback behavior, and missing-OTel behavior.Reviewed by Cursor Bugbot for commit 3b66f98. Bugbot is set up for automated code reviews on this repo. Configure here.