Skip to content

[2/2] L2 CI: Telemetry Test#2913

Merged
sammshen merged 11 commits intoLMCache:devfrom
Oasis-Git:l2ci-2
Apr 1, 2026
Merged

[2/2] L2 CI: Telemetry Test#2913
sammshen merged 11 commits intoLMCache:devfrom
Oasis-Git:l2ci-2

Conversation

@Oasis-Git
Copy link
Copy Markdown
Member

@Oasis-Git Oasis-Git commented Mar 30, 2026

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Medium Risk
Touches L2 store/prefetch controllers to emit new observability events; while intended to be side-effect free, it adds hot-path EventBus.publish() calls that could affect performance or fail if metadata contracts drift.

Overview
Adds first-class L2 telemetry for multiprocess mode by introducing new EventTypes for L2 store and prefetch (lookup/load) lifecycles, emitting those events from the StoreController and PrefetchController.

Registers new L2MetricsSubscriber (OTel counters exported to Prometheus) and L2LoggingSubscriber (debug logs) in init_observability, and adds unit tests validating the L2 metrics counters via an in-memory OTel reader.

Extends the Buildkite L2 long-doc QA script to fetch Prometheus /metrics and fail the run if expected L1/L2 data-flow counters (store + prefetch) are missing/zero.

Written by Cursor Bugbot for commit 8c8321c. This will update automatically on new commits. Configure here.

Oasis-Git and others added 7 commits March 26, 2026 21:51
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements L2 storage observability and performance testing by adding a new Buildkite test script for L2 long-doc QA and integrating event-based logging and OpenTelemetry metrics for L2 store and prefetch operations. The changes also update the multiprocess test suite and storage controllers to use a new event bus. Feedback suggests using python3 for script consistency and correcting the key count logic in prefetch load events to ensure accurate reporting.

GPU_DEVICE="${GPU_FOR_VLLM:-0}"

CUDA_VISIBLE_DEVICES="${GPU_DEVICE}" \
python -m lmcache.v1.multiprocess.server \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other parts of the script and to ensure the correct Python environment is used, it is recommended to use python3 instead of python.

Suggested change
python -m lmcache.v1.multiprocess.server \
python3 -m lmcache.v1.multiprocess.server \

event_type=EventType.L2_PREFETCH_LOAD_SUBMITTED,
metadata={
"request_id": request.request_id,
"key_count": len(reserved_key_set),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The key_count reported in the L2_PREFETCH_LOAD_SUBMITTED event should reflect the number of keys actually being submitted for load, which is prefix_length. Using len(reserved_key_set) may over-report if some keys were successfully reserved in L1 but were excluded from the final load plan because they were beyond a gap in the contiguous prefix.

Suggested change
"key_count": len(reserved_key_set),
"key_count": prefix_length,


# Launch LMCache with L1 config
CUDA_VISIBLE_DEVICES="${GPU_DEVICE}" \
python -m lmcache.v1.multiprocess.server \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency and to ensure the correct Python version is used, please use python3 instead of python.

Suggested change
python -m lmcache.v1.multiprocess.server \
python3 -m lmcache.v1.multiprocess.server \

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
@Oasis-Git Oasis-Git added the full Run comprehensive tests on this PR label Mar 30, 2026
Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Oasis-Git Oasis-Git requested a review from royyhuang March 30, 2026 23:16
Copy link
Copy Markdown
Contributor

@royyhuang royyhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sammshen sammshen enabled auto-merge (squash) March 31, 2026 22:17
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

metadata={
"request_id": request.request_id,
"key_count": len(reserved_key_set),
"adapter_count": len(trimmed_plan),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefetch load submission overcounts key totals

Medium Severity

L2_PREFETCH_LOAD_SUBMITTED reports key_count as len(reserved_key_set), but that set includes write-reserved keys later excluded from trimmed_plan. This makes lmcache_mp.l2_prefetch_load_keys count keys that were never submitted to submit_load_task, so L2 load telemetry is inflated.

Fix in Cursor Fix in Web

"request_id": request.request_id,
"loaded_count": len(loaded_keys),
"failed_count": len(failed_keys),
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefetch load failure metric misclassifies dropped keys

Medium Severity

L2_PREFETCH_LOAD_COMPLETED publishes failed_count from failed_keys, which includes write-reserved keys that were never submitted for L2 load after prefix re-trimming. This records non-attempted keys as load failures and distorts lmcache_mp.l2_prefetch_failed_keys.

Fix in Cursor Fix in Web

curl -sf "http://localhost:${PROMETHEUS_PORT}/metrics" > "$L2_METRICS_FILE" 2>/dev/null || true

if [ ! -s "$L2_METRICS_FILE" ]; then
echo "WARNING: Could not fetch metrics from Prometheus (port $PROMETHEUS_PORT). Skipping data flow check."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics fetch failure silently bypasses telemetry validation

Medium Severity

The new telemetry verification step continues when /metrics cannot be fetched, because curl errors are suppressed and the branch only prints a warning. This allows the L2 telemetry test to pass even when observability is broken or unreachable.

Fix in Cursor Fix in Web

metadata={
"request_id": request.request_id,
"prefix_hit_count": prefix_length,
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lookup metric uses post-reservation prefix

Medium Severity

L2_PREFETCH_LOOKUP_COMPLETED publishes prefix_hit_count from prefix_length computed after reserve_write filtering, not from pure L2 lookup results. This makes lmcache_mp.l2_prefetch_hit_keys depend on L1 reservation success and can report zero hits even when L2 lookup found a valid prefix.

Fix in Cursor Fix in Web

@sammshen sammshen merged commit 1106823 into LMCache:dev Apr 1, 2026
35 checks passed
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
* performance ci

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ci

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* l2 cache ci

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* lint

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

---------

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
Co-authored-by: Roy Huang <roy.y.huang@gmail.com>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
* performance ci

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix ci

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* l2 cache ci

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* fix

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

* lint

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

---------

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Co-authored-by: Samuel Shen <slshen@tensormesh.ai>
Co-authored-by: Roy Huang <roy.y.huang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants