[MP] Add tracing functionality for storage manager#3063
Conversation
Signed-off-by: ApostaC <yihua98@uchicago.edu>
There was a problem hiding this comment.
Code Review
This pull request introduces a trace recording subsystem for LMCache, enabling the capture of StorageManager API calls into a binary file for later replay. The implementation includes a tracing decorator, type codecs for msgpack serialization, a binary file format, and integration into the cache server. Feedback highlights a critical issue where overwriting the file header in-place with a different size would corrupt the length-prefixed stream, and suggests combining the length prefix and frame body into a single write operation to ensure atomicity and improve performance.
| def attach_storage_config(self, config: StorageManagerConfig) -> None: | ||
| """Rewrite the header now that the StorageManagerConfig is known. | ||
|
|
||
| The header is initially written with empty config fields so the | ||
| recorder can be constructed before the StorageManager. The | ||
| server lifecycle calls this method once the config is in hand. | ||
|
|
||
| For long traces this means the on-disk header at offset 0 is | ||
| re-written; the file position advances forward as records | ||
| accumulate. Implementation: seek to 0, overwrite the header | ||
| frame in-place, seek back to end. | ||
| """ | ||
| with self._lock: | ||
| if self._closed: | ||
| return | ||
| sm_json = json.dumps(self._safe_config_dict(config), sort_keys=True) | ||
| digest = hashlib.sha256(sm_json.encode("utf-8")).hexdigest() | ||
| cur = self._fd.tell() | ||
| self._fd.seek(0) | ||
| self._write_header(sm_json, digest) | ||
| self._fd.seek(cur) |
There was a problem hiding this comment.
The attach_storage_config method seeks to the beginning of the file and overwrites the header frame. Since the trace file is a sequential length-prefixed msgpack stream, overwriting the header with a frame of a different size (which is guaranteed here as the initial header has empty config strings while the updated one contains the full JSON config) will corrupt the stream. The new, larger header will overwrite the start of the subsequent Record frames, making the file unreadable by the TraceReader or any replay tool.
Instead of overwriting the header in-place, consider using a dedicated 'Configuration' record type that can be appended to the stream whenever the configuration is available or changes.
| self._fd.write(_LEN_STRUCT.pack(len(frame))) | ||
| self._fd.write(frame) |
There was a problem hiding this comment.
The _write_frame method performs two separate write() calls to the unbuffered file descriptor (buffering=0). This breaks the atomicity mentioned in the format.py documentation (line 13), which relies on a single write() call for frames smaller than PIPE_BUF. Additionally, performing two system calls per record is inefficient for high-frequency tracing.
Combine the length prefix and the frame body into a single bytes object before writing.
| self._fd.write(_LEN_STRUCT.pack(len(frame))) | |
| self._fd.write(frame) | |
| self._fd.write(_LEN_STRUCT.pack(len(frame)) + frame) |
| @@ -37,6 +37,14 @@ | |||
| from lmcache.v1.memory_management import MemoryObj | |||
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
|
|
||
|
|
||
| # Now that ``PrefetchHandle`` exists in this module's namespace, finish | ||
| # wiring its trace codec. Done here (rather than at codec import time) | ||
| # to avoid a ``storage_manager → trace.codecs → storage_manager`` cycle. | ||
| register_prefetch_handle_codec() |
There was a problem hiding this comment.
This is not good. The correct way to break the circular import is to move PrefetchHandle into lmcache/v1/distributed/api.py, and this makes more sense because it's a module-level API data class.
| event_bus = init_observability(obs_config) | ||
|
|
||
| # Wire up the trace recorder (no-op when --trace-level is unset). | ||
| maybe_register_trace_recorder(event_bus, obs_config, storage_manager_config) |
There was a problem hiding this comment.
Can we rename it to maybe_initialize_trace_recorder?
| lmcache_version: str | ||
| """``lmcache.__version__`` of the recording process.""" |
There was a problem hiding this comment.
Can you confirm whether LMCache has lmcache.__version__? I don't think so for now. Maybe maintain a separate version in this file? One potential problem is that it cannot capture the interface changes in the storage manager and other functions that is going to be captured.
There was a problem hiding this comment.
Will the recorder have a close function? (Not sure if it's needed).
The real thing behind this: how to make sure that the output trace file is closed correctly?
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 61cf5cb. Configure here.
|
|
||
| recorder = StorageTraceRecorder(output_path=output_path) | ||
| recorder.attach_storage_config(storage_manager_config) | ||
| bus.register_subscriber(recorder) |
There was a problem hiding this comment.
No cleanup on error leaves gate on and fd leaked
Medium Severity
If attach_storage_config or bus.register_subscriber raises after the StorageTraceRecorder constructor succeeds, the tracing gate remains permanently ON (set inside __init__) and the file descriptor leaks because close() is never called. Since the recorder is never registered on the bus, EventBus.stop() won't invoke its shutdown() either. Every subsequent traced call then pays argument-binding overhead for events that no subscriber will ever process. A try/except wrapping these calls that calls recorder.close() on failure would prevent the resource leak and stuck gate.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 61cf5cb. Configure here.
| @@ -0,0 +1,376 @@ | |||
| # Trace Recording (`lmcache trace`) | |||
There was a problem hiding this comment.
High-level question: should we trace MP server IPC calls, or just trace L1 calls?
There was a problem hiding this comment.
For me L1 calls are easiest to impl, no GPU required and useful --- I am OK with keeping it. But IPC calls seems to be most relevant in terms of evaluating GPU-related racing conditions.
There was a problem hiding this comment.
We will do MPCacheEngine and MQ (IPC) level tracing as the next step. For this version, let's just do storage-level tracing
|
|
||
| All captured calls publish the same `EventType.TRACE_CALL` event: | ||
|
|
||
| ```python |
There was a problem hiding this comment.
I'm not familiar with EventBus impl. Will the lmcache trace invokes too many events that causes EventBut to overflow?
There was a problem hiding this comment.
Not too many. Usually, a call to StorageManager's top-level function will emit multiple events. So this won't overflow event bus.
In the worst case, event bus has some overflow handling so that the memory usage won't explode
| `attach_storage_config`; gate flip on init / off on close; | ||
| publish-via-EventBus end-to-end (codec encode → file → reader → codec | ||
| decode); truncated-tail tolerance; bad-magic rejection; | ||
| `dropped_count` increments on unencodable args. |
There was a problem hiding this comment.
NIT: ideally the doc should be shorter --- this doc is for both human and LLMs.
| ~~~~~~ | ||
|
|
||
| Replaying a recorded trace is delivered separately via the | ||
| ``lmcache trace`` and ``lmcache bench trace-replay`` CLIs. |
There was a problem hiding this comment.
English-wise lmcache traces sounds like the same thing as lmcache bench trace-replay and it can be a bit confusing.
Maybe lmcache trace capture and lmcache trace replay and leave lmcache bench for just OpenAI API call benchmarks.
There was a problem hiding this comment.
Agree, will update in the next PR where lmcache trace is really implemented.
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class PrefetchHandle: |
There was a problem hiding this comment.
QQ: is this inside the scope of this PR, or it is just a handy refactor?
There was a problem hiding this comment.
It's a small refactor. Otherwise there will be circular import: previously the trace's module needs to import PrefetchHandle from StorageManager to do serde, and StorageManager needs to import the trace module to generate the trace.
This refactor actually should have been done eariler -- the PrefetchHandle is the public API data type, so it should stay in api.py not storage_manager.py
| register_codec( | ||
| torch.dtype, | ||
| TypeCodec(tag="torch.dtype", encode=_enc_torch_dtype, decode=_dec_torch_dtype), | ||
| ) |
There was a problem hiding this comment.
QQ: should we use Protobuf here so that we can easily ensure reader / writer has the same version, or is it too heavy?
There was a problem hiding this comment.
Asking this because the codec feels like redoing things in protobuf.
There was a problem hiding this comment.
Actually protobuf is more complicated. Here we just want a safe way to serialize/deserialize the function arguments when tracing.
Pickle is the easiest one to use, but it has big security problem -- attackers can embed executable code in the trace and execute it when user use lmcache trace replay
KuntaiDu
left a comment
There was a problem hiding this comment.
Left several quick design-related questions. PTAL.


What this PR does / why we need it:
Adds a trace-recording subsystem under
lmcache/v1/mp_observability/trace/that captures StorageManager public-API calls to a binary trace file. This enables offline inspection and replay of cache workloads for debugging and benchmarking.User-facing additions:
--trace-level storage: enable trace recording at the storage-manager level (onlystorageis supported today).--trace-output <path>: target trace file. Defaults to a timestamped file under$TMPDIRwhen--trace-levelis set without an explicit path.Public API under
lmcache.v1.mp_observability.trace:enable_tracingdecorator — publishes aTRACE_CALLevent on each call (entry-only, inputs only). When tracing is disabled (the default), this adds a single boolean attribute load per call.StorageTraceRecorder—EventSubscriberthat writesTRACE_CALLevents to the binary trace file.TraceReader— streaming reader for inspection.Special notes for your reviewers:
trace/codecs.py) to keep the binary format compact and stable across runs.server.py,blend_server_v2.py) viamaybe_register_trace_recordercalled from the observability lifecycle.If applicable:
Note
Medium Risk
Adds new file I/O and event publication in core
StorageManagercall paths when enabled, plus new CLI/config surface; disabled-by-default gating limits impact but regressions could affect performance or shutdown/flush behavior.Overview
Adds a new binary trace-recording path for MP mode that captures
StorageManagerpublic-API calls as unifiedEventType.TRACE_CALLevents and persists them to a length-prefixed msgpack trace file (newtrace/package with decorator, codec registry, file format, recorder, and streaming reader).Wires the feature into
lmcache servervia new observability flags--trace-level storageand--trace-output, initializes the recorder in MP server entry points, and instrumentsStorageManagermethods (including manual enter/exit events for theread_prefetched_resultscontext manager). Also movesPrefetchHandleintodistributed/api.py, updates docs/contracts, and adds unit tests covering gating/overhead, codec round-trips, and end-to-end record/write/read behavior.Reviewed by Cursor Bugbot for commit 61cf5cb. Bugbot is set up for automated code reviews on this repo. Configure here.