[CLI] Introduce lmcache trace CLI by ApostaC · Pull Request #3075 · LMCache/LMCache

ApostaC · 2026-04-18T01:19:26Z

Screenshot

What this PR does / why we need it:

Introduces a new lmcache trace CLI for working with storage-level trace files produced by lmcache server --trace-level storage. This enables offline analysis and replay of recorded workloads, which is useful for:

Regression hunting — reproduce production bugs offline by replaying captured traces against a build under investigation.
Performance characterization — measure L1/L2 latency distributions under realistic access patterns without vLLM or a GPU.
Configuration tuning — compare L1 sizes, eviction policies, and L2 adapter choices against identical input.

Subcommands:

lmcache trace info FILE — print header metadata and per-qualname record counts.
lmcache trace replay FILE ... — reissue recorded calls against a fresh StorageManager, honoring recorded inter-call timings. Accepts the standard storage-manager flags plus the full observability flag surface (shared with lmcache server), --verbose/--jsonl-out for per-record output, --output-dir/--no-csv/--json for aggregated summaries, and -q to suppress the terminal metrics table.

Trace capture is intentionally not a trace subcommand — recording is bound to the live process via lmcache server --trace-level storage [--trace-output …]. Surfacing a CLI stub here would only duplicate that flag while suggesting a runtime-capture CLI that does not yet exist.

The CLI, replay driver, dispatcher, and stats collector are colocated under lmcache/cli/commands/trace/.

Special notes for your reviewers:

If applicable:

this PR contains user facing changes - docs added (docs/source/mp/tracing_and_debugging.rst, updated docs/design/v1/mp_observability/trace.md)
this PR contains unit tests (tests/cli/commands/test_trace_command.py, tests/cli/commands/trace/)

Note

Medium Risk
Adds a new CLI entry point that constructs and drives a real StorageManager during trace replay, plus new stats/export paths; failures could impact user workflows and produce misleading replay results if edge cases are missed. Also adjusts trace-config digest computation shared with the recorder, which affects mismatch detection semantics.

Overview
Adds a new top-level lmcache trace command with info (header + op counts) and replay (timing-paced reissue of recorded StorageManager calls) including optional per-record streaming (--verbose, --jsonl-out), aggregated CSV/JSON exports, and a terminal metrics table.

Implements replay internals (StorageReplayDriver, CallDispatcher, ReplayStatsCollector) to decode trace records, map recorded qualnames to live StorageManager methods (including FIFO pairing for read_prefetched_results enter/exit), honor recorded inter-call timing, and report replay/skip/failure totals plus config-digest mismatch.

Refactors trace recording to expose safe_storage_config_dict() so replay can compute the same header digest algorithm, adds extensive docs for tracing/replay, and updates CI to skip the new tests/cli/commands/trace suite in the non-CUDA test job.

^{Reviewed by Cursor Bugbot for commit c693525. Bugbot is set up for automated code reviews on this repo. Configure here.}

Signed-off-by: ApostaC <yihua98@uchicago.edu>

gemini-code-assist

Code Review

This pull request introduces the trace replay subsystem for LMCache, enabling users to inspect and reissue recorded storage-level traces via the new lmcache trace CLI. The implementation includes a TraceCommand with subcommands for info, replay, and recording stubs, alongside a dedicated replay driver that honors recorded timings to maintain asynchronous consistency. Feedback identifies a performance bottleneck in latency statistics collection where $O(N^2)$ insertion complexity should be replaced with a sort-on-summary approach. Additionally, the _remove_argparse_group helper violates the repository's SLF discipline by accessing private argparse attributes, and directory validation should be moved before the main replay loop to ensure early failure on permission or path issues.

ApostaC

Comments are below.

ApostaC · 2026-04-18T01:22:46Z

This change is not needed.

ApostaC · 2026-04-18T01:23:25Z

All the trace replay-related functionality should be moved to under lmcache/cli/commands/trace.

Signed-off-by: ApostaC <yihua98@uchicago.edu>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 105fc8f. Configure here.}

Signed-off-by: ApostaC <yihua98@uchicago.edu>

KuntaiDu

LGTM!

royyhuang

Otherwise, LGTM!

royyhuang · 2026-04-21T04:29:33Z

+
+            t0 = time.monotonic()
+            try:
+                self._dispatcher.dispatch(record.qualname, context, decoded_args)


IIUC, even though the underline dispatched SM method is non-blocking but dispatch itself is blocking. I would recommend put this on separate thread to reduce the chance the replayer machine falling behind the collected trace.

Even better if we can have dispatcher in C/C++ since later we could have the case that the trace is collected and aggregated from multiple nodes and replayed on a local dev machine. Python w/ GIL can easily fall behind the trace. (Also this can be in the long run)

Oh this makes sense! Will do it in a follow-up PR.

royyhuang · 2026-04-21T04:33:57Z

+            t0 = time.monotonic()
+            try:
+                self._dispatcher.dispatch(record.qualname, context, decoded_args)
+                latency = time.monotonic() - t0


We are recording the elapsed time on the non-blocking methods not the actual IO time (even though it may reflect the actual performance to some extent). I feel this could be more useful if we report the actual performance metrics.

The actual I/O time performance needs to be monitored via Prometheus metrics / OTel events. The event bus is enabled in the replayer, so it will generate the observability events.

ApostaC added 5 commits April 18, 2026 01:18

[Add] basic lmcache trace cli

7ddaff0

Signed-off-by: ApostaC <yihua98@uchicago.edu>

redesign trace replay cli

a8c938b

Signed-off-by: ApostaC <yihua98@uchicago.edu>

cli output tweak

4b01f4f

Signed-off-by: ApostaC <yihua98@uchicago.edu>

update user docs and cli tweak

69cbe20

Signed-off-by: ApostaC <yihua98@uchicago.edu>

fix format

6c572f3

Signed-off-by: ApostaC <yihua98@uchicago.edu>

ApostaC requested review from KuntaiDu, deng451e, hickeyma, royyhuang and sammshen as code owners April 18, 2026 01:19

gemini-code-assist Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread lmcache/tools/trace_replay/stats.py Outdated

Comment thread lmcache/tools/trace_replay/stats.py Outdated

Comment thread lmcache/cli/commands/trace/__init__.py Outdated

Comment thread lmcache/cli/commands/trace/__init__.py Outdated

cursor Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread lmcache/cli/commands/trace/driver.py

ApostaC commented Apr 18, 2026

View reviewed changes

fix comments

105fc8f

Signed-off-by: ApostaC <yihua98@uchicago.edu>

cursor Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread lmcache/cli/commands/trace/driver.py

ApostaC added 2 commits April 18, 2026 02:12

remove record subcommand and fix ut

f376fc3

Signed-off-by: ApostaC <yihua98@uchicago.edu>

remove useless test

c693525

Signed-off-by: ApostaC <yihua98@uchicago.edu>

KuntaiDu approved these changes Apr 20, 2026

View reviewed changes

royyhuang approved these changes Apr 21, 2026

View reviewed changes

ApostaC enabled auto-merge (squash) April 21, 2026 20:27

github-actions Bot added the full Run comprehensive tests on this PR label Apr 21, 2026

ApostaC merged commit f4620dd into LMCache:dev Apr 21, 2026
39 of 40 checks passed

KuntaiDu mentioned this pull request Apr 26, 2026

[MP] Add test-cache CLI command for GPU mode #3013

Open

2 tasks

Conversation

ApostaC commented Apr 18, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

ApostaC Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

ApostaC Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KuntaiDu left a comment

Choose a reason for hiding this comment

Uh oh!

royyhuang left a comment

Choose a reason for hiding this comment

Uh oh!

royyhuang Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

royyhuang Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ApostaC Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

royyhuang Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ApostaC Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ApostaC commented Apr 18, 2026 •

edited by cursor Bot

Loading