Skip to content

[MP] Introduce a simple way to register_gauge metrics.#2906

Merged
ApostaC merged 3 commits intoLMCache:devfrom
maobaolong:l2_simple_add_metrics
Apr 12, 2026
Merged

[MP] Introduce a simple way to register_gauge metrics.#2906
ApostaC merged 3 commits intoLMCache:devfrom
maobaolong:l2_simple_add_metrics

Conversation

@maobaolong
Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong commented Mar 30, 2026

What this PR does / why we need it:

Introduce a simple way to register_gauge metrics

Clipboard_Screenshot_1774872681

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Low Risk
Low risk observability-only change that adds a new OTel observable gauge and a small status-field addition; main concern is correctness of the gauge callback wiring under different OTel versions.

Overview
Adds a register_gauge helper in otel_init.py to simplify creating OTel observable gauges (no-op with debug log when opentelemetry isn’t installed).

Hooks this into MPCacheEngine to export a new lmcache_mp.active_prefetch_jobs gauge (thread-safe count of _prefetch_jobs) and includes the same value in report_status. Updates METRICS.md to document the new gauge and adds unit tests covering gauge creation, callback behavior, and missing-OTel behavior.

Reviewed by Cursor Bugbot for commit 3b66f98. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces OTel observable gauges for monitoring multiprocess engine metrics, specifically tracking active prefetch jobs. It adds a register_gauge utility in otel_init.py and integrates it into the MPCacheEngine class. Feedback was provided regarding the use of a broad exception handler in the OTel initialization and the placement of the new private method _setup_metrics, which violates the project's style guide regarding class organization.

Comment thread lmcache/v1/mp_observability/otel_init.py Outdated
Comment thread lmcache/v1/multiprocess/server.py Outdated
@maobaolong
Copy link
Copy Markdown
Collaborator Author

@ApostaC @sammshen @chunxiaozheng @deng451e Would you like to take a look at this PR? Thanks.

@sammshen sammshen requested a review from royyhuang April 5, 2026 02:59
@royyhuang
Copy link
Copy Markdown
Contributor

royyhuang commented Apr 6, 2026

Hi @maobaolong, thanks for your PR. IIUC, this PR bypasses our event bus for observability to create and expose metrics directly from server attribute. Ideally, we would like all metrics/logs/spans to be created following one standard code path for readability and maintainability unless you have really strong reasons against it, e.g., something the event bus systems simply cannot do. In that case, I would also suggest we rework on the event bus to accomodate your specific needs.

For the specific metric mentioned in this PR, I believe @Oasis-Git is working on L2 related events. Once those events are merged into upstream, L2 related metrics/logs/spans should be really easy to integrate.

@maobaolong
Copy link
Copy Markdown
Collaborator Author

@royyhuang Thanks for the review! You're right that register_gauge does bypass the event bus — I want to explain why I think that's reasonable here.
The event bus is push-based, which works great for counters and histograms. But Observable Gauge is pull-based by nature — the OTel SDK invokes a callback at scrape time to read the current value. These two models don't quite fit together.
Metrics like active_prefetch_jobs are really just state snapshots (len(some_dict)) — they don't correspond to a discrete event. Routing them through the event bus would likely mean emitting events on every mutation or adding polling, both of which feel like more complexity than needed for a simple len() call.
I think of register_gauge as a small, scoped escape hatch for the one instrument type that doesn't naturally fit the event bus model. Happy to discuss if you see a cleaner way to handle this though!

@maobaolong maobaolong changed the title Introduce a simple way to register_gauge metrics. [MP] Introduce a simple way to register_gauge metrics. Apr 8, 2026
@maobaolong maobaolong added full Run comprehensive tests on this PR mp Buildkite trigger for multi-processing mode test labels Apr 8, 2026
maobaolong added a commit to maobaolong/LMCache that referenced this pull request Apr 8, 2026
…che#2906

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
@maobaolong maobaolong force-pushed the l2_simple_add_metrics branch from 923e0a6 to e5b5b2b Compare April 8, 2026 12:34
@maobaolong maobaolong requested a review from ApostaC as a code owner April 8, 2026 12:34
Comment thread lmcache/v1/mp_observability/otel_init.py
Comment thread lmcache/v1/mp_observability/otel_init.py
Comment thread lmcache/v1/multiprocess/server.py Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 28e5fc4. Configure here.

Comment thread tests/v1/mp_observability/test_register_gauge.py Outdated
Comment thread tests/v1/mp_observability/test_register_gauge.py Outdated
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
@maobaolong maobaolong force-pushed the l2_simple_add_metrics branch from 28e5fc4 to f9b4246 Compare April 9, 2026 08:39
@royyhuang
Copy link
Copy Markdown
Contributor

I think of register_gauge as a small, scoped escape hatch for the one instrument type that doesn't naturally fit the event bus model.

Hi @maobaolong, after discussed offline with @ApostaC, I think this indeed is a valid case and worth an escape hatch. PR itself is straight forward. Thanks for your contribution!

@ApostaC ApostaC enabled auto-merge (squash) April 11, 2026 00:08
@maobaolong
Copy link
Copy Markdown
Collaborator Author

@royyhuang Thanks for inspire me to study the event-bus design, it use a event driven mode, decouple event producer and event subscriber, it looks pretty good and suitable for majority scenarios. And also thanks @ApostaC @sammshen for your review!

@maobaolong
Copy link
Copy Markdown
Collaborator Author

@ApostaC @deng451e Would you like to help to review this PR this need the module owner's approve, thanks!

Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ApostaC ApostaC merged commit f7ae4e1 into LMCache:dev Apr 12, 2026
36 checks passed
Oasis-Git pushed a commit to Oasis-Git/LMCache that referenced this pull request Apr 13, 2026
Introduce a simple way to register_gauge metrics.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Co-authored-by: Roy Huang <roy.y.huang@gmail.com>
Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
ftian1 pushed a commit to ftian1/LMCache that referenced this pull request Apr 20, 2026
Introduce a simple way to register_gauge metrics.

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Co-authored-by: Roy Huang <roy.y.huang@gmail.com>
Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
sammshen pushed a commit to sammshen/LMCache that referenced this pull request Apr 28, 2026
Per review on PR feat/mp-l1-l2-state-metrics: adapt to the existing
``register_gauge`` helper (PR LMCache#2906) instead of creating raw OTel
``UpDownCounter`` instruments inline.  All four L1/L2 state metrics are
now ObservableGauges registered through the same helper as
``lmcache_mp.active_prefetch_jobs``.

Changes:

- ``register_gauge``: now accepts either a single-value callback (existing
  callers unchanged) or a callback returning ``list[(value, attrs)]`` for
  per-attribute-set observations.  An empty list yields no datapoints,
  which is the correct shape when nothing is in-flight.

- StoreController / PrefetchController: replaced the
  ``UpDownCounter.add(+/-1, attrs)`` calls and the cleanup-decrement
  loops with single ObservableGauges registered via ``register_gauge``.
  The callbacks read live state at scrape time via new public
  snapshot methods (``get_inflight_count_by_adapter`` /
  ``get_inflight_load_state_by_adapter``) so there is no incremental
  bookkeeping to keep balanced — when ``_in_flight_*`` is cleared at
  shutdown the gauges naturally report zero.

- Singleton dispatch: each controller class now guards gauge
  registration with a class-level ``_gauge_registered`` flag and
  dispatches the callback through ``cls._gauge_target`` (the most
  recent instance), matching the pattern already used by L1Manager.
  Required because the OTel SDK only honors the first registration
  for a given instrument name; without this guard, tests creating
  multiple controllers see stale data from a stopped instance.

- ``InFlightPrefetchRequest.load_bytes_by_adapter`` field is retained
  as the per-adapter byte cache the gauge callback reads at scrape
  time.  Cleanup decrement code has been removed (no longer needed
  with ObservableGauge semantics).

- Renamed cleanup test to ``test_no_leak_after_shutdown_with_inflight_loads``
  to reflect the new "gauge reads zero after dict is cleared" semantics.

- Updated user-facing docs (``docs/source/mp/observability.rst``) and
  design docs (``docs/design/v1/mp_observability/METRICS.md``) to list
  the four metrics under Observable Gauges with attribute and
  source-of-truth info.

All 6 new tests + 817 in tests/v1/distributed/ + tests/v1/mp_observability/
pass.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR mp Buildkite trigger for multi-processing mode test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants