[CLI] Introduce lmcache trace CLI#3075
Conversation
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
There was a problem hiding this comment.
Code Review
This pull request introduces the trace replay subsystem for LMCache, enabling users to inspect and reissue recorded storage-level traces via the new lmcache trace CLI. The implementation includes a TraceCommand with subcommands for info, replay, and recording stubs, alongside a dedicated replay driver that honors recorded timings to maintain asynchronous consistency. Feedback identifies a performance bottleneck in latency statistics collection where _remove_argparse_group helper violates the repository's SLF discipline by accessing private argparse attributes, and directory validation should be moved before the main replay loop to ensure early failure on permission or path issues.
ApostaC
left a comment
There was a problem hiding this comment.
Comments are below.
There was a problem hiding this comment.
This change is not needed.
There was a problem hiding this comment.
All the trace replay-related functionality should be moved to under lmcache/cli/commands/trace.
Signed-off-by: ApostaC <yihua98@uchicago.edu>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 105fc8f. Configure here.
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
|
|
||
| t0 = time.monotonic() | ||
| try: | ||
| self._dispatcher.dispatch(record.qualname, context, decoded_args) |
There was a problem hiding this comment.
IIUC, even though the underline dispatched SM method is non-blocking but dispatch itself is blocking. I would recommend put this on separate thread to reduce the chance the replayer machine falling behind the collected trace.
There was a problem hiding this comment.
Even better if we can have dispatcher in C/C++ since later we could have the case that the trace is collected and aggregated from multiple nodes and replayed on a local dev machine. Python w/ GIL can easily fall behind the trace. (Also this can be in the long run)
There was a problem hiding this comment.
Oh this makes sense! Will do it in a follow-up PR.
| t0 = time.monotonic() | ||
| try: | ||
| self._dispatcher.dispatch(record.qualname, context, decoded_args) | ||
| latency = time.monotonic() - t0 |
There was a problem hiding this comment.
We are recording the elapsed time on the non-blocking methods not the actual IO time (even though it may reflect the actual performance to some extent). I feel this could be more useful if we report the actual performance metrics.
There was a problem hiding this comment.
The actual I/O time performance needs to be monitored via Prometheus metrics / OTel events. The event bus is enabled in the replayer, so it will generate the observability events.

Screenshot
What this PR does / why we need it:
Introduces a new
lmcache traceCLI for working with storage-level trace files produced bylmcache server --trace-level storage. This enables offline analysis and replay of recorded workloads, which is useful for:Subcommands:
lmcache trace info FILE— print header metadata and per-qualname record counts.lmcache trace replay FILE ...— reissue recorded calls against a freshStorageManager, honoring recorded inter-call timings. Accepts the standard storage-manager flags plus the full observability flag surface (shared withlmcache server),--verbose/--jsonl-outfor per-record output,--output-dir/--no-csv/--jsonfor aggregated summaries, and-qto suppress the terminal metrics table.Trace capture is intentionally not a
tracesubcommand — recording is bound to the live process vialmcache server --trace-level storage [--trace-output …]. Surfacing a CLI stub here would only duplicate that flag while suggesting a runtime-capture CLI that does not yet exist.The CLI, replay driver, dispatcher, and stats collector are colocated under
lmcache/cli/commands/trace/.Special notes for your reviewers:
If applicable:
docs/source/mp/tracing_and_debugging.rst, updateddocs/design/v1/mp_observability/trace.md)tests/cli/commands/test_trace_command.py,tests/cli/commands/trace/)Note
Medium Risk
Adds a new CLI entry point that constructs and drives a real
StorageManagerduring trace replay, plus new stats/export paths; failures could impact user workflows and produce misleading replay results if edge cases are missed. Also adjusts trace-config digest computation shared with the recorder, which affects mismatch detection semantics.Overview
Adds a new top-level
lmcache tracecommand withinfo(header + op counts) andreplay(timing-paced reissue of recordedStorageManagercalls) including optional per-record streaming (--verbose,--jsonl-out), aggregated CSV/JSON exports, and a terminal metrics table.Implements replay internals (
StorageReplayDriver,CallDispatcher,ReplayStatsCollector) to decode trace records, map recordedqualnames to liveStorageManagermethods (including FIFO pairing forread_prefetched_resultsenter/exit), honor recorded inter-call timing, and report replay/skip/failure totals plus config-digest mismatch.Refactors trace recording to expose
safe_storage_config_dict()so replay can compute the same header digest algorithm, adds extensive docs for tracing/replay, and updates CI to skip the newtests/cli/commands/tracesuite in the non-CUDA test job.Reviewed by Cursor Bugbot for commit c693525. Bugbot is set up for automated code reviews on this repo. Configure here.