Skip to content

[CLI] Introduce lmcache trace CLI#3075

Merged
ApostaC merged 8 commits intoLMCache:devfrom
ApostaC:claude/silly-bhaskara-replay
Apr 21, 2026
Merged

[CLI] Introduce lmcache trace CLI#3075
ApostaC merged 8 commits intoLMCache:devfrom
ApostaC:claude/silly-bhaskara-replay

Conversation

@ApostaC
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC commented Apr 18, 2026

Screenshot

image

What this PR does / why we need it:

Introduces a new lmcache trace CLI for working with storage-level trace files produced by lmcache server --trace-level storage. This enables offline analysis and replay of recorded workloads, which is useful for:

  • Regression hunting — reproduce production bugs offline by replaying captured traces against a build under investigation.
  • Performance characterization — measure L1/L2 latency distributions under realistic access patterns without vLLM or a GPU.
  • Configuration tuning — compare L1 sizes, eviction policies, and L2 adapter choices against identical input.

Subcommands:

  • lmcache trace info FILE — print header metadata and per-qualname record counts.
  • lmcache trace replay FILE ... — reissue recorded calls against a fresh StorageManager, honoring recorded inter-call timings. Accepts the standard storage-manager flags plus the full observability flag surface (shared with lmcache server), --verbose/--jsonl-out for per-record output, --output-dir/--no-csv/--json for aggregated summaries, and -q to suppress the terminal metrics table.

Trace capture is intentionally not a trace subcommand — recording is bound to the live process via lmcache server --trace-level storage [--trace-output …]. Surfacing a CLI stub here would only duplicate that flag while suggesting a runtime-capture CLI that does not yet exist.

The CLI, replay driver, dispatcher, and stats collector are colocated under lmcache/cli/commands/trace/.

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added (docs/source/mp/tracing_and_debugging.rst, updated docs/design/v1/mp_observability/trace.md)
  • this PR contains unit tests (tests/cli/commands/test_trace_command.py, tests/cli/commands/trace/)

Note

Medium Risk
Adds a new CLI entry point that constructs and drives a real StorageManager during trace replay, plus new stats/export paths; failures could impact user workflows and produce misleading replay results if edge cases are missed. Also adjusts trace-config digest computation shared with the recorder, which affects mismatch detection semantics.

Overview
Adds a new top-level lmcache trace command with info (header + op counts) and replay (timing-paced reissue of recorded StorageManager calls) including optional per-record streaming (--verbose, --jsonl-out), aggregated CSV/JSON exports, and a terminal metrics table.

Implements replay internals (StorageReplayDriver, CallDispatcher, ReplayStatsCollector) to decode trace records, map recorded qualnames to live StorageManager methods (including FIFO pairing for read_prefetched_results enter/exit), honor recorded inter-call timing, and report replay/skip/failure totals plus config-digest mismatch.

Refactors trace recording to expose safe_storage_config_dict() so replay can compute the same header digest algorithm, adds extensive docs for tracing/replay, and updates CI to skip the new tests/cli/commands/trace suite in the non-CUDA test job.

Reviewed by Cursor Bugbot for commit c693525. Bugbot is set up for automated code reviews on this repo. Configure here.

ApostaC added 5 commits April 18, 2026 01:18
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the trace replay subsystem for LMCache, enabling users to inspect and reissue recorded storage-level traces via the new lmcache trace CLI. The implementation includes a TraceCommand with subcommands for info, replay, and recording stubs, alongside a dedicated replay driver that honors recorded timings to maintain asynchronous consistency. Feedback identifies a performance bottleneck in latency statistics collection where $O(N^2)$ insertion complexity should be replaced with a sort-on-summary approach. Additionally, the _remove_argparse_group helper violates the repository's SLF discipline by accessing private argparse attributes, and directory validation should be moved before the main replay loop to ensure early failure on permission or path issues.

Comment thread lmcache/tools/trace_replay/stats.py Outdated
Comment thread lmcache/tools/trace_replay/stats.py Outdated
Comment thread lmcache/cli/commands/trace/__init__.py Outdated
Comment thread lmcache/cli/commands/trace/__init__.py Outdated
Comment thread lmcache/cli/commands/trace/driver.py
Copy link
Copy Markdown
Contributor Author

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not needed.

Comment thread lmcache/tools/trace_replay/__init__.py Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the trace replay-related functionality should be moved to under lmcache/cli/commands/trace.

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 105fc8f. Configure here.

Comment thread lmcache/cli/commands/trace/driver.py
ApostaC added 2 commits April 18, 2026 02:12
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Copy Markdown
Contributor

@royyhuang royyhuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM!


t0 = time.monotonic()
try:
self._dispatcher.dispatch(record.qualname, context, decoded_args)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, even though the underline dispatched SM method is non-blocking but dispatch itself is blocking. I would recommend put this on separate thread to reduce the chance the replayer machine falling behind the collected trace.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better if we can have dispatcher in C/C++ since later we could have the case that the trace is collected and aggregated from multiple nodes and replayed on a local dev machine. Python w/ GIL can easily fall behind the trace. (Also this can be in the long run)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this makes sense! Will do it in a follow-up PR.

t0 = time.monotonic()
try:
self._dispatcher.dispatch(record.qualname, context, decoded_args)
latency = time.monotonic() - t0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are recording the elapsed time on the non-blocking methods not the actual IO time (even though it may reflect the actual performance to some extent). I feel this could be more useful if we report the actual performance metrics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual I/O time performance needs to be monitored via Prometheus metrics / OTel events. The event bus is enabled in the replayer, so it will generate the observability events.

@ApostaC ApostaC enabled auto-merge (squash) April 21, 2026 20:27
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Apr 21, 2026
@ApostaC ApostaC merged commit f4620dd into LMCache:dev Apr 21, 2026
39 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants