[Feat] L0 Subscriber #2974
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements the L0MetricsSubscriber for tracking GPU KV cache block lifecycles and reporting OpenTelemetry metrics, including block lifetime, idle time, and reuse gaps. It adds a new MP_VLLM_END_SESSION event and integrates the subscriber into the observability configuration. Feedback suggests refining metric accuracy by explicitly tracking block release times, ensuring the initial access is recorded in the reuse history, and improving type safety for allocation records via a Protocol instead of generic objects.
|
|
||
| class _BlockStatus(Enum): | ||
| ACTIVE = "active" # Owned by at least one live request. | ||
| RELEASED = "released" # All owning requests have ended. | ||
|
|
||
|
|
||
| @dataclass | ||
| class _L0BlockState: | ||
| """Per-block lifecycle state in the shadow map.""" | ||
|
|
||
| token_ids: list[int] |
There was a problem hiding this comment.
The _L0BlockState dataclass should include a release_time field to accurately track the idle time of a block before it is evicted. Currently, idle_time is calculated using last_access_time, which represents the start of the last usage period, not the end of it.
| class _BlockStatus(Enum): | |
| ACTIVE = "active" # Owned by at least one live request. | |
| RELEASED = "released" # All owning requests have ended. | |
| @dataclass | |
| class _L0BlockState: | |
| """Per-block lifecycle state in the shadow map.""" | |
| token_ids: list[int] | |
| @dataclass | |
| class _L0BlockState: | |
| """Per-block lifecycle state in the shadow map.""" | |
| token_ids: list[int] | |
| owners: set[str] # Set of req_ids currently using this block. | |
| status: _BlockStatus | |
| alloc_time: float | |
| last_access_time: float | |
| release_time: float | None = None | |
| access_history: deque[float] = field( | |
| default_factory=lambda: deque(maxlen=_MAX_ACCESS_HISTORY) | |
| ) |
| """Process a single BlockAllocationRecord.""" | ||
| req_id: str = record.req_id # type: ignore[attr-defined] |
There was a problem hiding this comment.
When a block is marked as RELEASED, the current timestamp should be recorded as the release_time to allow for accurate idle time calculation upon eviction.
| """Process a single BlockAllocationRecord.""" | |
| req_id: str = record.req_id # type: ignore[attr-defined] | |
| if not state.owners: | |
| state.status = _BlockStatus.RELEASED | |
| state.release_time = now |
| if not block_ids: | ||
| return | ||
|
|
||
| num_blocks = len(block_ids) | ||
| block_size = ( |
There was a problem hiding this comment.
Using object as a type hint for record and then using # type: ignore[attr-defined] is not ideal for type safety. It is better to define a Protocol that specifies the expected attributes of the record object, especially since BlockAllocationRecord is defined in another module and we want to avoid tight coupling or circular dependencies.
from typing import Protocol
class _BlockAllocationRecord(Protocol):
req_id: str
new_block_ids: list[int]
new_token_ids: list[int]
def _process_record(self, record: _BlockAllocationRecord, now: float) -> None:
"""Process a single BlockAllocationRecord."""
req_id: str = record.req_id
block_ids: list[int] = record.new_block_ids
token_ids: list[int] = record.new_token_ids| ) | ||
| self._req_blocks.setdefault(req_id, set()).add(block_id) | ||
| return | ||
|
|
||
| # Block exists in shadow map. | ||
| if existing.token_ids == token_ids: | ||
| # Same content. |
There was a problem hiding this comment.
Initializing access_history with the current timestamp now ensures that the gap between the initial allocation and the first reuse is captured in the lmcache_mp.l0_block_reuse_gap_seconds metric. Without this, the first reuse gap is missed.
| ) | |
| self._req_blocks.setdefault(req_id, set()).add(block_id) | |
| return | |
| # Block exists in shadow map. | |
| if existing.token_ids == token_ids: | |
| # Same content. | |
| self._shadow[block_id] = _L0BlockState( | |
| token_ids=token_ids, | |
| owners={req_id}, | |
| status=_BlockStatus.ACTIVE, | |
| alloc_time=now, | |
| last_access_time=now, | |
| access_history=deque([now], maxlen=_MAX_ACCESS_HISTORY) | |
| ) |
| self._emit_eviction_metrics(existing, now) | ||
|
|
||
| # Clear old owners from reverse index. | ||
| for old_req in existing.owners: |
|
|
||
| def _emit_eviction_metrics(self, state: _L0BlockState, now: float) -> None: | ||
| """Record histogram observations for an evicted block.""" | ||
| lifetime = now - state.alloc_time | ||
| idle_time = now - state.last_access_time | ||
|
|
||
| self._lifetime_hist.record(lifetime) |
There was a problem hiding this comment.
Similar to the initial allocation, when starting fresh after an eviction, access_history should be initialized with now to capture the first reuse gap of the new block lifecycle.
| def _emit_eviction_metrics(self, state: _L0BlockState, now: float) -> None: | |
| """Record histogram observations for an evicted block.""" | |
| lifetime = now - state.alloc_time | |
| idle_time = now - state.last_access_time | |
| self._lifetime_hist.record(lifetime) | |
| self._shadow[block_id] = _L0BlockState( | |
| token_ids=token_ids, | |
| owners={req_id}, | |
| status=_BlockStatus.ACTIVE, | |
| alloc_time=now, | |
| last_access_time=now, | |
| access_history=deque([now], maxlen=_MAX_ACCESS_HISTORY) | |
| ) |
| for i in range(1, len(history)): | ||
| gap = history[i] - history[i - 1] | ||
| self._reuse_gap_hist.record(gap) | ||
|
|
There was a problem hiding this comment.
The idle_time should be calculated as the duration between the block's release and its eviction. If the block was evicted while still active (which is rare in vLLM), the idle time can be considered zero or calculated from the last access start.
| idle_time = now - (state.release_time if state.release_time is not None else state.last_access_time) |
| # already existed (e.g., cached request with no new tokens). | ||
| # Not a real allocation event; skip. | ||
| continue | ||
| self._process_block(block_id, chunk_tokens, req_id, now) |
There was a problem hiding this comment.
Token-to-block chunking uses incorrect distribution logic
Medium Severity
_process_record splits new_token_ids across blocks using ceiling division ((len(token_ids) + num_blocks - 1) // num_blocks), but vLLM uses a fixed block size (e.g., 16 tokens per block, with only the last block potentially shorter). If the same physical block appears in a later record with a different num_blocks, the subscriber computes a different token chunk for that block, causing existing.token_ids == token_ids to fail and triggering a false eviction detection (Case 5). This would inflate eviction metrics. A safer approach would be to use the engine's actual block size for chunking rather than inferring it from the record.
Reviewed by Cursor Bugbot for commit 0f34708. Configure here.
| from lmcache.v1.mp_observability.subscribers.metrics.l0 import ( | ||
| L0MetricsSubscriber, | ||
| _BlockStatus, | ||
| ) |
There was a problem hiding this comment.
Tests rely on private members instead of public interface
Low Severity
The test file imports the module-private _BlockStatus and extensively accesses subscriber._shadow (a private attribute) throughout many test methods. The project rules require no cross-class private member access and that tests verify the public interface and docstring contract rather than implementation details. The OTel histogram assertions already validate behavior through the public output; the _shadow assertions couple tests to internal state.
Additional Locations (1)
Triggered by project rule: LMCache Code Review Style Guide
Reviewed by Cursor Bugbot for commit 0f34708. Configure here.
| if block_id in self._skipped: | ||
| return |
There was a problem hiding this comment.
If sampling rate is very small in production, it's better to record "non-skipped" because that will be smaller
royyhuang
left a comment
There was a problem hiding this comment.
In addition to the EVENTS.md, please also update the METRICS.md to describe the new metrics added in this PR. Otherwise, LGTM!
- Add L0MetricsSubscriber with shadow-based block tracking and OTel histograms (lifetime, idle_before_evict, reuse_gap) - Use _sampled set instead of _skipped for O(N*sample_rate) memory usage - Add MP_VLLM_BLOCK_ALLOCATION and MP_VLLM_END_SESSION events to server.py - Update EVENTS.md and METRICS.md with L0, L2, and MP_VLLM event/metric docs - Add unit tests for L0MetricsSubscriber Signed-off-by: yuwei <yuwei@dev.local>
Signed-off-by: yuwei <yuwei@dev.local>
The _sampled approach re-rolls the dice on every appearance of an untracked block, causing the effective sample rate to inflate over time (e.g. 10% becomes 20%+) and tracking to start mid-life with wrong alloc_time. The _skipped set makes a once-and-final decision. Signed-off-by: yuwei <yuwei@dev.local>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit af7d40b. Configure here.
| # Shadow map: physical block_id -> lifecycle state. | ||
| self._shadow: dict[int, _L0BlockState] = {} | ||
| # Set of block_ids we decided NOT to sample. | ||
| self._skipped: set[int] = set() |
There was a problem hiding this comment.
Skipped set stores 99% of block IDs unnecessarily
Low Severity
The _skipped set records every block ID that was not sampled. With the default 1% sample_rate, this set holds ~99% of all unique block IDs, while _shadow (which already tracks sampled blocks) holds ~1%. Since _shadow.keys() already identifies sampled blocks, the _skipped set is an inverted representation that uses ~99× more memory than needed. Checking block_id not in self._shadow instead would eliminate this set entirely.
Additional Locations (1)
Triggered by project rule: LMCache Code Review Style Guide
Reviewed by Cursor Bugbot for commit af7d40b. Configure here.
conftest.py is loaded by pytest as a plugin, not as a regular module, so explicit imports create a second instance with a separate reader. Use a regular module (otel_setup.py) instead. Signed-off-by: yuwei <yuwei@dev.local>
Signed-off-by: yuwei <yuwei@dev.local> Co-authored-by: yuwei <yuwei@dev.local> Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: yuwei <yuwei@dev.local> Co-authored-by: yuwei <yuwei@dev.local>
Signed-off-by: yuwei <yuwei@dev.local> Co-authored-by: yuwei <yuwei@dev.local>


What this PR does / why we need it:
Special notes for your reviewers:
If applicable:
Note
Medium Risk
Adds new event types and publishes an
END_SESSIONevent from the MP server, plus introduces histogram-based lifecycle tracking that maintains in-memory shadow state; errors or overhead here could affect observability performance and correctness.Overview
Adds L0 (GPU) KV-cache lifecycle observability by introducing
L0LifecycleSubscriber, which listens toMP_VLLM_BLOCK_ALLOCATIONplus a newMP_VLLM_END_SESSIONevent and emits OTel histograms for block lifetime, idle time before eviction, and reuse gaps.Wires the new subscriber into MP observability initialization and publishes
MP_VLLM_END_SESSIONfromMPCacheEngine.end_session(), and updates the event/metrics docs accordingly. Test infrastructure is updated to share a single OTelMeterProvideracross metrics tests, and new unit tests validate eviction/reuse/end-session tracking and histogram emission.Reviewed by Cursor Bugbot for commit d15e5e2. Bugbot is set up for automated code reviews on this repo. Configure here.