Add scheduler instance_id and model_name to L0 KV lifecycle tracking#3043
Add scheduler instance_id and model_name to L0 KV lifecycle tracking#3043ApostaC merged 2 commits intoLMCache:devfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces multi-instance and multi-model support for GPU KV cache block lifecycle tracking by adding instance_id and model_name attributes to observability events and metrics. Key changes include updating the shadow map in l0_lifecycle.py to use composite keys and modifying the BlockAllocationRecord to carry instance information. Feedback suggests that server.py should group records by instance_id before publishing events to ensure correct model attribution in mixed batches, and recommends using explicit type hints for BlockAllocationRecord instead of object to improve type safety and maintainability.
| # Look up model_name from the first record's instance_id. | ||
| model_name = "" | ||
| if records: | ||
| meta = self.gpu_context_meta.get(records[0].instance_id) | ||
| model_name = meta[0] if meta else "" | ||
| self._event_bus.publish( | ||
| Event( | ||
| event_type=EventType.MP_VLLM_BLOCK_ALLOCATION, | ||
| metadata={"records": records}, | ||
| metadata={"model_name": model_name, "records": records}, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Looking up model_name from only the first record in the batch is potentially incorrect. In a multi-instance or multi-model scenario, a single batch of records might contain entries from different instance_ids belonging to different models. This would lead to incorrect metric attribution. Furthermore, ensure that all relevant metadata, such as MemoryFormat (fmt), is retrieved and passed correctly during batched allocation to ensure correctness in multi-processing modes. Consider grouping the records by instance_id and publishing separate events for each unique instance.
| # Look up model_name from the first record's instance_id. | |
| model_name = "" | |
| if records: | |
| meta = self.gpu_context_meta.get(records[0].instance_id) | |
| model_name = meta[0] if meta else "" | |
| self._event_bus.publish( | |
| Event( | |
| event_type=EventType.MP_VLLM_BLOCK_ALLOCATION, | |
| metadata={"records": records}, | |
| metadata={"model_name": model_name, "records": records}, | |
| ) | |
| ) | |
| by_instance = {} | |
| for record in records: | |
| by_instance.setdefault(record.instance_id, []).append(record) | |
| for instance_id, inst_records in by_instance.items(): | |
| meta = self.gpu_context_meta.get(instance_id) | |
| model_name = meta[0] if meta else "" | |
| self._event_bus.publish( | |
| Event( | |
| event_type=EventType.MP_VLLM_BLOCK_ALLOCATION, | |
| metadata={"model_name": model_name, "records": inst_records}, | |
| ) | |
| ) |
References
- When performing batched memory allocation, ensure the MemoryFormat (fmt) is retrieved from the cache and passed to the allocator for correctness in multi-processing modes.
| def _process_record(self, model_name: str, record: object, now: float) -> None: | ||
| """Process a single BlockAllocationRecord.""" | ||
| req_id: str = record.req_id # type: ignore[attr-defined] | ||
| block_ids: list[int] = record.new_block_ids # type: ignore[attr-defined] | ||
| token_ids: list[int] = record.new_token_ids # type: ignore[attr-defined] | ||
| instance_id: int = record.instance_id # type: ignore[attr-defined] |
There was a problem hiding this comment.
The record parameter is typed as object, which forces the use of type: ignore[attr-defined] when accessing its fields. Per the Repository Style Guide (line 24), all new/modified functions should have proper type hints. Please import BlockAllocationRecord from lmcache.v1.multiprocess.custom_types and use it as the type hint for the record parameter to improve maintainability and type safety.
References
- All new functions have type hints (arguments + return values) (link)
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c0f9e17. Configure here.
| is unhealthy the report is silently dropped. | ||
|
|
||
| Args: | ||
| instance_id: The GPU instance ID (scheduler/worker identity). |
There was a problem hiding this comment.
GPU instance ID -> scheduler instance id
- Adapter sends os.getpid() and self.model_name in report_block_allocations — no vLLM change needed - Protocol: [int, str, list[BlockAllocationRecord]] - Server passes instance_id and model_name to EventBus - L0LifecycleSubscriber keys shadow map by (instance_id, block_id) - Emit OTel histograms with instance_id and model_name attributes for per-instance, per-model Prometheus metric slicing - Update EVENTS.md, METRICS.md, and observability.rst docs - Add test verifying OTel attributes on histogram data points Signed-off-by: yuwei <yuwei@dev.local>

What this PR does / why we need it:
Special notes for your reviewers:
If applicable:
Note
Medium Risk
Changes the wire protocol and handler signature for
REPORT_BLOCK_ALLOCATION, so mismatched client/server versions could break block-allocation reporting; metric attribute additions and shadow-map keying also affect lifecycle metric cardinality and behavior.Overview
Adds
instance_idandmodel_namepropagation toMP_VLLM_BLOCK_ALLOCATIONfrom the vLLM adapter through the multiprocess protocol/server event metadata, so L0 lifecycle tracking can distinguish blocks across multiple scheduler instances.Updates
L0LifecycleSubscriberto key its shadow state by(instance_id, block_id)and to emit eviction/reuse histograms withinstance_id/model_nameOTel attributes for Prometheus slicing; tests and observability docs are updated accordingly, including a new assertion that histogram data points carry these attributes.Reviewed by Cursor Bugbot for commit 2343215. Bugbot is set up for automated code reviews on this repo. Configure here.