[2/2] L2 CI: Telemetry Test by Oasis-Git · Pull Request #2913 · LMCache/LMCache

Oasis-Git · 2026-03-30T22:02:33Z

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

Note

Medium Risk
Touches L2 store/prefetch controllers to emit new observability events; while intended to be side-effect free, it adds hot-path EventBus.publish() calls that could affect performance or fail if metadata contracts drift.

Overview
Adds first-class L2 telemetry for multiprocess mode by introducing new EventTypes for L2 store and prefetch (lookup/load) lifecycles, emitting those events from the StoreController and PrefetchController.

Registers new L2MetricsSubscriber (OTel counters exported to Prometheus) and L2LoggingSubscriber (debug logs) in init_observability, and adds unit tests validating the L2 metrics counters via an in-memory OTel reader.

Extends the Buildkite L2 long-doc QA script to fetch Prometheus /metrics and fail the run if expected L1/L2 data-flow counters (store + prefetch) are missing/zero.

^{Written by Cursor Bugbot for commit 8c8321c. This will update automatically on new commits. Configure here.}

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

…ci-1

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

gemini-code-assist

Code Review

This pull request implements L2 storage observability and performance testing by adding a new Buildkite test script for L2 long-doc QA and integrating event-based logging and OpenTelemetry metrics for L2 store and prefetch operations. The changes also update the multiprocess test suite and storage controllers to use a new event bus. Feedback suggests using python3 for script consistency and correcting the key count logic in prefetch load events to ensure accurate reporting.

gemini-code-assist · 2026-03-30T22:06:15Z

+GPU_DEVICE="${GPU_FOR_VLLM:-0}"
+
+CUDA_VISIBLE_DEVICES="${GPU_DEVICE}" \
+python -m lmcache.v1.multiprocess.server \


For consistency with other parts of the script and to ensure the correct Python environment is used, it is recommended to use python3 instead of python.

Suggested change

python -m lmcache.v1.multiprocess.server \

python3 -m lmcache.v1.multiprocess.server \

gemini-code-assist · 2026-03-30T22:06:16Z

+                event_type=EventType.L2_PREFETCH_LOAD_SUBMITTED,
+                metadata={
+                    "request_id": request.request_id,
+                    "key_count": len(reserved_key_set),


The key_count reported in the L2_PREFETCH_LOAD_SUBMITTED event should reflect the number of keys actually being submitted for load, which is prefix_length. Using len(reserved_key_set) may over-report if some keys were successfully reserved in L1 but were excluded from the final load plan because they were beyond a gap in the contiguous prefix.

Suggested change

"key_count": len(reserved_key_set),

"key_count": prefix_length,

gemini-code-assist · 2026-03-30T22:06:16Z

+
+# Launch LMCache with L1 config
+CUDA_VISIBLE_DEVICES="${GPU_DEVICE}" \
+python -m lmcache.v1.multiprocess.server \


For consistency and to ensure the correct Python version is used, please use python3 instead of python.

Suggested change

python -m lmcache.v1.multiprocess.server \

python3 -m lmcache.v1.multiprocess.server \

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

sammshen

LGTM!

sammshen

LGTM!

royyhuang

LGTM

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-31T22:19:40Z

+                metadata={
+                    "request_id": request.request_id,
+                    "key_count": len(reserved_key_set),
+                    "adapter_count": len(trimmed_plan),


Prefetch load submission overcounts key totals

Medium Severity

L2_PREFETCH_LOAD_SUBMITTED reports key_count as len(reserved_key_set), but that set includes write-reserved keys later excluded from trimmed_plan. This makes lmcache_mp.l2_prefetch_load_keys count keys that were never submitted to submit_load_task, so L2 load telemetry is inflated.

cursor · 2026-03-31T22:19:40Z

+                    "request_id": request.request_id,
+                    "loaded_count": len(loaded_keys),
+                    "failed_count": len(failed_keys),
+                },


Prefetch load failure metric misclassifies dropped keys

Medium Severity

L2_PREFETCH_LOAD_COMPLETED publishes failed_count from failed_keys, which includes write-reserved keys that were never submitted for L2 load after prefix re-trimming. This records non-attempted keys as load failures and distorts lmcache_mp.l2_prefetch_failed_keys.

cursor · 2026-03-31T22:19:40Z

+curl -sf "http://localhost:${PROMETHEUS_PORT}/metrics" > "$L2_METRICS_FILE" 2>/dev/null || true
+
+if [ ! -s "$L2_METRICS_FILE" ]; then
+    echo "WARNING: Could not fetch metrics from Prometheus (port $PROMETHEUS_PORT). Skipping data flow check."


Metrics fetch failure silently bypasses telemetry validation

Medium Severity

The new telemetry verification step continues when /metrics cannot be fetched, because curl errors are suppressed and the branch only prints a warning. This allows the L2 telemetry test to pass even when observability is broken or unreachable.

cursor · 2026-03-31T22:19:40Z

+                metadata={
+                    "request_id": request.request_id,
+                    "prefix_hit_count": prefix_length,
+                },


Lookup metric uses post-reservation prefix

Medium Severity

L2_PREFETCH_LOOKUP_COMPLETED publishes prefix_hit_count from prefix_length computed after reserve_write filtering, not from pure L2 lookup results. This makes lmcache_mp.l2_prefetch_hit_keys depend on L1 reservation success and can report zero hits even when L2 lookup found a valid prefix.

* performance ci Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ci Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * l2 cache ci Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * lint Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> --------- Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai> Co-authored-by: Roy Huang <roy.y.huang@gmail.com>

Oasis-Git and others added 7 commits March 26, 2026 21:51

performance ci

0e2cb2d

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Merge branch 'dev' into l2ci-1

8851585

Merge branch 'dev' into l2ci-1

6bd8af7

fix ci

3024ff3

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Merge branch 'l2ci-1' of https://github.com/Oasis-Git/LMCache into l2…

b7cc7ed

…ci-1

Merge branch 'dev' into l2ci-1

e2517af

l2 cache ci

7199a13

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

Oasis-Git added 3 commits March 30, 2026 22:16

merge

79171a5

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

fix

8062adb

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

lint

c646b8a

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git added the full Run comprehensive tests on this PR label Mar 30, 2026

sammshen approved these changes Mar 30, 2026

View reviewed changes

Oasis-Git requested a review from royyhuang March 30, 2026 23:16

royyhuang approved these changes Mar 31, 2026

View reviewed changes

Merge branch 'dev' into l2ci-2

8c8321c

sammshen enabled auto-merge (squash) March 31, 2026 22:17

cursor Bot reviewed Mar 31, 2026

View reviewed changes

sammshen merged commit 1106823 into LMCache:dev Apr 1, 2026
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2/2] L2 CI: Telemetry Test#2913

[2/2] L2 CI: Telemetry Test#2913
sammshen merged 11 commits intoLMCache:devfrom
Oasis-Git:l2ci-2

Oasis-Git commented Mar 30, 2026 •

edited by cursor Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Uh oh!

sammshen left a comment

Uh oh!

sammshen left a comment

Uh oh!

royyhuang left a comment

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Mar 31, 2026

Uh oh!

cursor Bot Mar 31, 2026

Uh oh!

cursor Bot Mar 31, 2026

Uh oh!

cursor Bot Mar 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	python -m lmcache.v1.multiprocess.server \
	python3 -m lmcache.v1.multiprocess.server \

	"key_count": len(reserved_key_set),
	"key_count": prefix_length,

Conversation

Oasis-Git commented Mar 30, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

sammshen left a comment

Choose a reason for hiding this comment

Uh oh!

royyhuang left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Mar 31, 2026

Choose a reason for hiding this comment

Prefetch load submission overcounts key totals

Uh oh!

cursor Bot Mar 31, 2026

Choose a reason for hiding this comment

Prefetch load failure metric misclassifies dropped keys

Uh oh!

cursor Bot Mar 31, 2026

Choose a reason for hiding this comment

Metrics fetch failure silently bypasses telemetry validation

Uh oh!

cursor Bot Mar 31, 2026

Choose a reason for hiding this comment

Lookup metric uses post-reservation prefix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Oasis-Git commented Mar 30, 2026 •

edited by cursor Bot

Loading