Skip to content

[Feat] L0 Subscriber #2974

Merged
ApostaC merged 4 commits intoLMCache:devfrom
Oasis-Git:l0
Apr 14, 2026
Merged

[Feat] L0 Subscriber #2974
ApostaC merged 4 commits intoLMCache:devfrom
Oasis-Git:l0

Conversation

@Oasis-Git
Copy link
Copy Markdown
Member

@Oasis-Git Oasis-Git commented Apr 7, 2026

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Medium Risk
Adds new event types and publishes an END_SESSION event from the MP server, plus introduces histogram-based lifecycle tracking that maintains in-memory shadow state; errors or overhead here could affect observability performance and correctness.

Overview
Adds L0 (GPU) KV-cache lifecycle observability by introducing L0LifecycleSubscriber, which listens to MP_VLLM_BLOCK_ALLOCATION plus a new MP_VLLM_END_SESSION event and emits OTel histograms for block lifetime, idle time before eviction, and reuse gaps.

Wires the new subscriber into MP observability initialization and publishes MP_VLLM_END_SESSION from MPCacheEngine.end_session(), and updates the event/metrics docs accordingly. Test infrastructure is updated to share a single OTel MeterProvider across metrics tests, and new unit tests validate eviction/reuse/end-session tracking and histogram emission.

Reviewed by Cursor Bugbot for commit d15e5e2. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the L0MetricsSubscriber for tracking GPU KV cache block lifecycles and reporting OpenTelemetry metrics, including block lifetime, idle time, and reuse gaps. It adds a new MP_VLLM_END_SESSION event and integrates the subscriber into the observability configuration. Feedback suggests refining metric accuracy by explicitly tracking block release times, ensuring the initial access is recorded in the reuse history, and improving type safety for allocation records via a Protocol instead of generic objects.

Comment on lines +53 to +63

class _BlockStatus(Enum):
ACTIVE = "active" # Owned by at least one live request.
RELEASED = "released" # All owning requests have ended.


@dataclass
class _L0BlockState:
"""Per-block lifecycle state in the shadow map."""

token_ids: list[int]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _L0BlockState dataclass should include a release_time field to accurately track the idle time of a block before it is evicted. Currently, idle_time is calculated using last_access_time, which represents the start of the last usage period, not the end of it.

Suggested change
class _BlockStatus(Enum):
ACTIVE = "active" # Owned by at least one live request.
RELEASED = "released" # All owning requests have ended.
@dataclass
class _L0BlockState:
"""Per-block lifecycle state in the shadow map."""
token_ids: list[int]
@dataclass
class _L0BlockState:
"""Per-block lifecycle state in the shadow map."""
token_ids: list[int]
owners: set[str] # Set of req_ids currently using this block.
status: _BlockStatus
alloc_time: float
last_access_time: float
release_time: float | None = None
access_history: deque[float] = field(
default_factory=lambda: deque(maxlen=_MAX_ACCESS_HISTORY)
)

Comment on lines +161 to +162
"""Process a single BlockAllocationRecord."""
req_id: str = record.req_id # type: ignore[attr-defined]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When a block is marked as RELEASED, the current timestamp should be recorded as the release_time to allow for accurate idle time calculation upon eviction.

Suggested change
"""Process a single BlockAllocationRecord."""
req_id: str = record.req_id # type: ignore[attr-defined]
if not state.owners:
state.status = _BlockStatus.RELEASED
state.release_time = now

Comment on lines +166 to +170
if not block_ids:
return

num_blocks = len(block_ids)
block_size = (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using object as a type hint for record and then using # type: ignore[attr-defined] is not ideal for type safety. It is better to define a Protocol that specifies the expected attributes of the record object, especially since BlockAllocationRecord is defined in another module and we want to avoid tight coupling or circular dependencies.

from typing import Protocol

class _BlockAllocationRecord(Protocol):
    req_id: str
    new_block_ids: list[int]
    new_token_ids: list[int]

    def _process_record(self, record: _BlockAllocationRecord, now: float) -> None:
        """Process a single BlockAllocationRecord."""
        req_id: str = record.req_id
        block_ids: list[int] = record.new_block_ids
        token_ids: list[int] = record.new_token_ids

Comment on lines +211 to +217
)
self._req_blocks.setdefault(req_id, set()).add(block_id)
return

# Block exists in shadow map.
if existing.token_ids == token_ids:
# Same content.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Initializing access_history with the current timestamp now ensures that the gap between the initial allocation and the first reuse is captured in the lmcache_mp.l0_block_reuse_gap_seconds metric. Without this, the first reuse gap is missed.

Suggested change
)
self._req_blocks.setdefault(req_id, set()).add(block_id)
return
# Block exists in shadow map.
if existing.token_ids == token_ids:
# Same content.
self._shadow[block_id] = _L0BlockState(
token_ids=token_ids,
owners={req_id},
status=_BlockStatus.ACTIVE,
alloc_time=now,
last_access_time=now,
access_history=deque([now], maxlen=_MAX_ACCESS_HISTORY)
)

self._emit_eviction_metrics(existing, now)

# Clear old owners from reverse index.
for old_req in existing.owners:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When a block transitions from RELEASED back to ACTIVE (Case 4: true reuse), the release_time should be cleared.

Suggested change
for old_req in existing.owners:
existing.status = _BlockStatus.ACTIVE
existing.release_time = None

Comment on lines +257 to +263

def _emit_eviction_metrics(self, state: _L0BlockState, now: float) -> None:
"""Record histogram observations for an evicted block."""
lifetime = now - state.alloc_time
idle_time = now - state.last_access_time

self._lifetime_hist.record(lifetime)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the initial allocation, when starting fresh after an eviction, access_history should be initialized with now to capture the first reuse gap of the new block lifecycle.

Suggested change
def _emit_eviction_metrics(self, state: _L0BlockState, now: float) -> None:
"""Record histogram observations for an evicted block."""
lifetime = now - state.alloc_time
idle_time = now - state.last_access_time
self._lifetime_hist.record(lifetime)
self._shadow[block_id] = _L0BlockState(
token_ids=token_ids,
owners={req_id},
status=_BlockStatus.ACTIVE,
alloc_time=now,
last_access_time=now,
access_history=deque([now], maxlen=_MAX_ACCESS_HISTORY)
)

for i in range(1, len(history)):
gap = history[i] - history[i - 1]
self._reuse_gap_hist.record(gap)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The idle_time should be calculated as the duration between the block's release and its eviction. If the block was evicted while still active (which is rare in vLLM), the idle time can be considered zero or calculated from the last access start.

Suggested change
idle_time = now - (state.release_time if state.release_time is not None else state.last_access_time)

@Oasis-Git Oasis-Git added the full Run comprehensive tests on this PR label Apr 9, 2026
# already existed (e.g., cached request with no new tokens).
# Not a real allocation event; skip.
continue
self._process_block(block_id, chunk_tokens, req_id, now)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Token-to-block chunking uses incorrect distribution logic

Medium Severity

_process_record splits new_token_ids across blocks using ceiling division ((len(token_ids) + num_blocks - 1) // num_blocks), but vLLM uses a fixed block size (e.g., 16 tokens per block, with only the last block potentially shorter). If the same physical block appears in a later record with a different num_blocks, the subscriber computes a different token chunk for that block, causing existing.token_ids == token_ids to fail and triggering a false eviction detection (Case 5). This would inflate eviction metrics. A safer approach would be to use the engine's actual block size for chunking rather than inferring it from the record.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0f34708. Configure here.

from lmcache.v1.mp_observability.subscribers.metrics.l0 import (
L0MetricsSubscriber,
_BlockStatus,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests rely on private members instead of public interface

Low Severity

The test file imports the module-private _BlockStatus and extensively accesses subscriber._shadow (a private attribute) throughout many test methods. The project rules require no cross-class private member access and that tests verify the public interface and docstring contract rather than implementation details. The OTel histogram assertions already validate behavior through the public output; the _shadow assertions couple tests to internal state.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

Reviewed by Cursor Bugbot for commit 0f34708. Configure here.

@ApostaC ApostaC requested a review from royyhuang April 10, 2026 22:46
Comment thread lmcache/v1/mp_observability/subscribers/metrics/l0_lifecycle.py
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM!

Comment thread lmcache/v1/mp_observability/event.py
Comment on lines +196 to +197
if block_id in self._skipped:
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sampling rate is very small in production, it's better to record "non-skipped" because that will be smaller

Copy link
Copy Markdown
Contributor

@royyhuang royyhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the EVENTS.md, please also update the METRICS.md to describe the new metrics added in this PR. Otherwise, LGTM!

Comment thread lmcache/v1/mp_observability/subscribers/metrics/l0.py Outdated
Comment thread lmcache/v1/mp_observability/subscribers/metrics/l0_lifecycle.py
Comment thread lmcache/v1/mp_observability/subscribers/metrics/l0_lifecycle.py
Comment thread lmcache/v1/mp_observability/subscribers/metrics/__init__.py Outdated
Comment thread examples/disagg_prefill/disagg_proxy_server.py
- Add L0MetricsSubscriber with shadow-based block tracking and OTel histograms
  (lifetime, idle_before_evict, reuse_gap)
- Use _sampled set instead of _skipped for O(N*sample_rate) memory usage
- Add MP_VLLM_BLOCK_ALLOCATION and MP_VLLM_END_SESSION events to server.py
- Update EVENTS.md and METRICS.md with L0, L2, and MP_VLLM event/metric docs
- Add unit tests for L0MetricsSubscriber

Signed-off-by: yuwei <yuwei@dev.local>
yuwei added 2 commits April 13, 2026 20:21
The _sampled approach re-rolls the dice on every appearance of an
untracked block, causing the effective sample rate to inflate over
time (e.g. 10% becomes 20%+) and tracking to start mid-life with
wrong alloc_time. The _skipped set makes a once-and-final decision.

Signed-off-by: yuwei <yuwei@dev.local>
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit af7d40b. Configure here.

# Shadow map: physical block_id -> lifecycle state.
self._shadow: dict[int, _L0BlockState] = {}
# Set of block_ids we decided NOT to sample.
self._skipped: set[int] = set()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipped set stores 99% of block IDs unnecessarily

Low Severity

The _skipped set records every block ID that was not sampled. With the default 1% sample_rate, this set holds ~99% of all unique block IDs, while _shadow (which already tracks sampled blocks) holds ~1%. Since _shadow.keys() already identifies sampled blocks, the _skipped set is an inverted representation that uses ~99× more memory than needed. Checking block_id not in self._shadow instead would eliminate this set entirely.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by project rule: LMCache Code Review Style Guide

Reviewed by Cursor Bugbot for commit af7d40b. Configure here.

conftest.py is loaded by pytest as a plugin, not as a regular module,
so explicit imports create a second instance with a separate reader.
Use a regular module (otel_setup.py) instead.

Signed-off-by: yuwei <yuwei@dev.local>
@ApostaC ApostaC merged commit a6ee913 into LMCache:dev Apr 14, 2026
37 of 38 checks passed
maobaolong pushed a commit to maobaolong/LMCache that referenced this pull request Apr 14, 2026
Signed-off-by: yuwei <yuwei@dev.local>
Co-authored-by: yuwei <yuwei@dev.local>
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
ekaynar pushed a commit to ekaynar/LMCache that referenced this pull request Apr 15, 2026
Signed-off-by: yuwei <yuwei@dev.local>
Co-authored-by: yuwei <yuwei@dev.local>
ftian1 pushed a commit to ftian1/LMCache that referenced this pull request Apr 20, 2026
Signed-off-by: yuwei <yuwei@dev.local>
Co-authored-by: yuwei <yuwei@dev.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants