[2/2] L2 CI: Telemetry Test#2913
Conversation
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request implements L2 storage observability and performance testing by adding a new Buildkite test script for L2 long-doc QA and integrating event-based logging and OpenTelemetry metrics for L2 store and prefetch operations. The changes also update the multiprocess test suite and storage controllers to use a new event bus. Feedback suggests using python3 for script consistency and correcting the key count logic in prefetch load events to ensure accurate reporting.
| GPU_DEVICE="${GPU_FOR_VLLM:-0}" | ||
|
|
||
| CUDA_VISIBLE_DEVICES="${GPU_DEVICE}" \ | ||
| python -m lmcache.v1.multiprocess.server \ |
| event_type=EventType.L2_PREFETCH_LOAD_SUBMITTED, | ||
| metadata={ | ||
| "request_id": request.request_id, | ||
| "key_count": len(reserved_key_set), |
There was a problem hiding this comment.
The key_count reported in the L2_PREFETCH_LOAD_SUBMITTED event should reflect the number of keys actually being submitted for load, which is prefix_length. Using len(reserved_key_set) may over-report if some keys were successfully reserved in L1 but were excluded from the final load plan because they were beyond a gap in the contiguous prefix.
| "key_count": len(reserved_key_set), | |
| "key_count": prefix_length, |
|
|
||
| # Launch LMCache with L1 config | ||
| CUDA_VISIBLE_DEVICES="${GPU_DEVICE}" \ | ||
| python -m lmcache.v1.multiprocess.server \ |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| metadata={ | ||
| "request_id": request.request_id, | ||
| "key_count": len(reserved_key_set), | ||
| "adapter_count": len(trimmed_plan), |
There was a problem hiding this comment.
Prefetch load submission overcounts key totals
Medium Severity
L2_PREFETCH_LOAD_SUBMITTED reports key_count as len(reserved_key_set), but that set includes write-reserved keys later excluded from trimmed_plan. This makes lmcache_mp.l2_prefetch_load_keys count keys that were never submitted to submit_load_task, so L2 load telemetry is inflated.
| "request_id": request.request_id, | ||
| "loaded_count": len(loaded_keys), | ||
| "failed_count": len(failed_keys), | ||
| }, |
There was a problem hiding this comment.
Prefetch load failure metric misclassifies dropped keys
Medium Severity
L2_PREFETCH_LOAD_COMPLETED publishes failed_count from failed_keys, which includes write-reserved keys that were never submitted for L2 load after prefix re-trimming. This records non-attempted keys as load failures and distorts lmcache_mp.l2_prefetch_failed_keys.
| curl -sf "http://localhost:${PROMETHEUS_PORT}/metrics" > "$L2_METRICS_FILE" 2>/dev/null || true | ||
|
|
||
| if [ ! -s "$L2_METRICS_FILE" ]; then | ||
| echo "WARNING: Could not fetch metrics from Prometheus (port $PROMETHEUS_PORT). Skipping data flow check." |
There was a problem hiding this comment.
Metrics fetch failure silently bypasses telemetry validation
Medium Severity
The new telemetry verification step continues when /metrics cannot be fetched, because curl errors are suppressed and the branch only prints a warning. This allows the L2 telemetry test to pass even when observability is broken or unreachable.
| metadata={ | ||
| "request_id": request.request_id, | ||
| "prefix_hit_count": prefix_length, | ||
| }, |
There was a problem hiding this comment.
Lookup metric uses post-reservation prefix
Medium Severity
L2_PREFETCH_LOOKUP_COMPLETED publishes prefix_hit_count from prefix_length computed after reserve_write filtering, not from pure L2 lookup results. This makes lmcache_mp.l2_prefetch_hit_keys depend on L1 reservation success and can report zero hits even when L2 lookup found a valid prefix.
* performance ci Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ci Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * l2 cache ci Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * lint Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> --------- Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai> Co-authored-by: Roy Huang <roy.y.huang@gmail.com>
* performance ci Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix ci Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * l2 cache ci Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * fix Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> * lint Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> --------- Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com> Co-authored-by: Samuel Shen <slshen@tensormesh.ai> Co-authored-by: Roy Huang <roy.y.huang@gmail.com>


What this PR does / why we need it:
Special notes for your reviewers:
If applicable:
Note
Medium Risk
Touches L2 store/prefetch controllers to emit new observability events; while intended to be side-effect free, it adds hot-path
EventBus.publish()calls that could affect performance or fail if metadata contracts drift.Overview
Adds first-class L2 telemetry for multiprocess mode by introducing new
EventTypes for L2 store and prefetch (lookup/load) lifecycles, emitting those events from theStoreControllerandPrefetchController.Registers new
L2MetricsSubscriber(OTel counters exported to Prometheus) andL2LoggingSubscriber(debug logs) ininit_observability, and adds unit tests validating the L2 metrics counters via an in-memory OTel reader.Extends the Buildkite L2 long-doc QA script to fetch Prometheus
/metricsand fail the run if expected L1/L2 data-flow counters (store + prefetch) are missing/zero.Written by Cursor Bugbot for commit 8c8321c. This will update automatically on new commits. Configure here.