[Feat] L1 Subscriber#2986
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces GPU (L0) KV cache block lifecycle tracking through the new L0MetricsSubscriber and expands L1MetricsSubscriber to include chunk lifecycle histograms. It also implements the MP_VLLM_END_SESSION event to facilitate accurate session-based block release tracking. Feedback suggests improving the L0 reuse gap metrics by recording them at the moment of a cache hit rather than waiting for eviction, which would prevent data loss for long-lived blocks and capture the initial reuse gap more accurately.
| for i in range(1, len(history)): | ||
| gap = history[i] - history[i - 1] | ||
| self._reuse_gap_hist.record(gap) | ||
|
|
||
| # -- Sampling ---------------------------------------------------------- |
There was a problem hiding this comment.
The current logic for recording reuse gaps in _emit_eviction_metrics misses the gap between the initial allocation (alloc_time) and the first reuse event stored in access_history.
Additionally, recording reuse gaps only at eviction time (Case 5) means that metrics for long-lived blocks that are never evicted will never be reported. Consider recording the reuse gap immediately when a hit is detected in _process_block (Case 4), similar to how it is handled in L1MetricsSubscriber. When implementing this, ensure that distinct variable names are used for each phase's results to prevent accidental overwriting and ensure accurate reporting in summaries.
References
- When performance metrics are collected for different phases of an operation, ensure that distinct variable names are used for each phase's results to prevent accidental overwriting and ensure accurate reporting in summaries.
| else: | ||
| # First time seeing this key — sample or skip. | ||
| if not self._should_sample(): | ||
| self._skipped.add(key) |
royyhuang
left a comment
There was a problem hiding this comment.
The implementation itself is pretty straightforward. I do have a high-level concern on code structure. Currently, we are adding the L1 lifecycle histogram subscription to the original subscriber. However, when the event bus was designed, the goal was to separate and divide up the subscribers so that users can easily disable and enable part of the logging, tracing, and/or metrics to avoid unnecessary overhead.
Given the overhead from housekeeping the shadow map is non-negligible, especially memory even if with sampling, I would suggest we put lifecycle tracking into a separate subscriber so that user can enable this only when needed. This also gives clear structure on the code, avoiding all subscription functionality to be squeezed into a single file making the l1 subscriber unmaintainable.
|
Also please update EVENTS.md or METRICS.md if necessary. |
- Add L0MetricsSubscriber with shadow-based block tracking and OTel histograms (lifetime, idle_before_evict, reuse_gap) - Use _sampled set instead of _skipped for O(N*sample_rate) memory usage - Add MP_VLLM_BLOCK_ALLOCATION and MP_VLLM_END_SESSION events to server.py - Update EVENTS.md and METRICS.md with L0, L2, and MP_VLLM event/metric docs - Add unit tests for L0MetricsSubscriber Signed-off-by: yuwei <yuwei@dev.local>
Signed-off-by: yuwei <yuwei@dev.local>
The _sampled approach re-rolls the dice on every appearance of an untracked block, causing the effective sample rate to inflate over time (e.g. 10% becomes 20%+) and tracking to start mid-life with wrong alloc_time. The _skipped set makes a once-and-final decision. Signed-off-by: yuwei <yuwei@dev.local>
conftest.py is loaded by pytest as a plugin, not as a regular module, so explicit imports create a second instance with a separate reader. Use a regular module (otel_setup.py) instead. Signed-off-by: yuwei <yuwei@dev.local>
- Upgrade L1MetricsSubscriber with chunk lifecycle histograms (lifetime, idle_before_evict, reuse_gap, evict_reuse_gap) - Use _sampled set instead of _skipped for O(N*sample_rate) memory usage - Add --metrics-sample-rate CLI arg to ObservabilityConfig - Pass sample_rate through to L0 and L1 subscribers - Update METRICS.md with L1 chunk lifecycle histograms section - Add comprehensive unit tests for L1MetricsSubscriber Signed-off-by: yuwei <yuwei@dev.local>
Signed-off-by: yuwei <yuwei@dev.local>
Same fix as L0: _sampled re-rolls on every appearance, inflating the effective sample rate and starting tracking mid-life with wrong alloc_time. _skipped makes a once-and-final decision. Signed-off-by: yuwei <yuwei@dev.local>
Signed-off-by: yuwei <yuwei@dev.local>
…additions Signed-off-by: yuwei <yuwei@dev.local>
…config Signed-off-by: yuwei <yuwei@dev.local>
…ubscriber Use hash(key) % prime < threshold instead of maintaining a _skipped set. O(1) memory, deterministic (same key always gets same decision), and no unbounded growth for long-running services. Signed-off-by: yuwei <yuwei@dev.local>
…ubscriber Same approach as L0: hash(key) % prime < threshold. O(1) memory, deterministic, no unbounded set growth for long-running services. Signed-off-by: yuwei <yuwei@dev.local>
…s-sample-rate Signed-off-by: yuwei <yuwei@dev.local>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit fb0a56f. Configure here.
hash(block_id) == block_id for small ints in CPython, causing deterministic hash sampling to always pick the lowest-numbered blocks. _skipped set is fine for L0 since it's bounded by the finite number of physical GPU blocks. Signed-off-by: yuwei <yuwei@dev.local>
* [Feat] L0 Subscriber: GPU KV cache block lifecycle metrics Signed-off-by: yuwei <yuwei@dev.local> Co-authored-by: yuwei <yuwei@dev.local>

What this PR does / why we need it:
Special notes for your reviewers:
If applicable:
Note
Medium Risk
Adds a new event-bus subscriber and default-registered histogram tracking for L1 chunk lifecycles, which can introduce overhead and additional memory use (shadow maps) in production if sample rate is misconfigured.
Overview
Adds a new
L1LifecycleSubscriberthat records sampled OTel histograms for L1 chunk lifetime, idle-before-evict, reuse gaps, and evict-to-reuse gaps based on L1 read/write/evict events.Introduces a new
--metrics-sample-rate/metrics_sample_ratesetting (default0.01) and wires it into subscriber registration so bothL0LifecycleSubscriberand the new L1 lifecycle tracking use the same sampling control. Updates observability docs/metric references and adds unit tests validating histogram recording, deterministic sampling, and eviction sweep behavior.Reviewed by Cursor Bugbot for commit a1e30f5. Bugbot is set up for automated code reviews on this repo. Configure here.