Skip to content

[MP] Add tracing functionality for storage manager#3063

Merged
ApostaC merged 6 commits intoLMCache:devfrom
ApostaC:claude/silly-bhaskara
Apr 18, 2026
Merged

[MP] Add tracing functionality for storage manager#3063
ApostaC merged 6 commits intoLMCache:devfrom
ApostaC:claude/silly-bhaskara

Conversation

@ApostaC
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC commented Apr 16, 2026

What this PR does / why we need it:

Adds a trace-recording subsystem under lmcache/v1/mp_observability/trace/ that captures StorageManager public-API calls to a binary trace file. This enables offline inspection and replay of cache workloads for debugging and benchmarking.

User-facing additions:

  • --trace-level storage: enable trace recording at the storage-manager level (only storage is supported today).
  • --trace-output <path>: target trace file. Defaults to a timestamped file under $TMPDIR when --trace-level is set without an explicit path.

Public API under lmcache.v1.mp_observability.trace:

  • enable_tracing decorator — publishes a TRACE_CALL event on each call (entry-only, inputs only). When tracing is disabled (the default), this adds a single boolean attribute load per call.
  • StorageTraceRecorderEventSubscriber that writes TRACE_CALL events to the binary trace file.
  • TraceReader — streaming reader for inspection.

Special notes for your reviewers:

  • Trace payloads are serialized via custom codecs (trace/codecs.py) to keep the binary format compact and stable across runs.
  • The recorder is wired into MP servers (server.py, blend_server_v2.py) via maybe_register_trace_recorder called from the observability lifecycle.
  • Replay lives in a separate package and is not part of this PR.

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Medium Risk
Adds new file I/O and event publication in core StorageManager call paths when enabled, plus new CLI/config surface; disabled-by-default gating limits impact but regressions could affect performance or shutdown/flush behavior.

Overview
Adds a new binary trace-recording path for MP mode that captures StorageManager public-API calls as unified EventType.TRACE_CALL events and persists them to a length-prefixed msgpack trace file (new trace/ package with decorator, codec registry, file format, recorder, and streaming reader).

Wires the feature into lmcache server via new observability flags --trace-level storage and --trace-output, initializes the recorder in MP server entry points, and instruments StorageManager methods (including manual enter/exit events for the read_prefetched_results context manager). Also moves PrefetchHandle into distributed/api.py, updates docs/contracts, and adds unit tests covering gating/overhead, codec round-trips, and end-to-end record/write/read behavior.

Reviewed by Cursor Bugbot for commit 61cf5cb. Bugbot is set up for automated code reviews on this repo. Configure here.

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Comment thread lmcache/v1/mp_observability/trace/recorder.py Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a trace recording subsystem for LMCache, enabling the capture of StorageManager API calls into a binary file for later replay. The implementation includes a tracing decorator, type codecs for msgpack serialization, a binary file format, and integration into the cache server. Feedback highlights a critical issue where overwriting the file header in-place with a different size would corrupt the length-prefixed stream, and suggests combining the length prefix and frame body into a single write operation to ensure atomicity and improve performance.

Comment on lines +116 to +136
def attach_storage_config(self, config: StorageManagerConfig) -> None:
"""Rewrite the header now that the StorageManagerConfig is known.

The header is initially written with empty config fields so the
recorder can be constructed before the StorageManager. The
server lifecycle calls this method once the config is in hand.

For long traces this means the on-disk header at offset 0 is
re-written; the file position advances forward as records
accumulate. Implementation: seek to 0, overwrite the header
frame in-place, seek back to end.
"""
with self._lock:
if self._closed:
return
sm_json = json.dumps(self._safe_config_dict(config), sort_keys=True)
digest = hashlib.sha256(sm_json.encode("utf-8")).hexdigest()
cur = self._fd.tell()
self._fd.seek(0)
self._write_header(sm_json, digest)
self._fd.seek(cur)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The attach_storage_config method seeks to the beginning of the file and overwrites the header frame. Since the trace file is a sequential length-prefixed msgpack stream, overwriting the header with a frame of a different size (which is guaranteed here as the initial header has empty config strings while the updated one contains the full JSON config) will corrupt the stream. The new, larger header will overwrite the start of the subsequent Record frames, making the file unreadable by the TraceReader or any replay tool.

Instead of overwriting the header in-place, consider using a dedicated 'Configuration' record type that can be appended to the stream whenever the configuration is available or changes.

Comment on lines +185 to +186
self._fd.write(_LEN_STRUCT.pack(len(frame)))
self._fd.write(frame)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _write_frame method performs two separate write() calls to the unbuffered file descriptor (buffering=0). This breaks the atomicity mentioned in the format.py documentation (line 13), which relies on a single write() call for frames smaller than PIPE_BUF. Additionally, performing two system calls per record is inefficient for high-frequency tracing.

Combine the length prefix and the frame body into a single bytes object before writing.

Suggested change
self._fd.write(_LEN_STRUCT.pack(len(frame)))
self._fd.write(frame)
self._fd.write(_LEN_STRUCT.pack(len(frame)) + frame)

@@ -37,6 +37,14 @@
from lmcache.v1.memory_management import MemoryObj
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Desing doc plz

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment thread lmcache/v1/mp_observability/trace/recorder.py Outdated
Comment thread lmcache/v1/mp_observability/trace/__init__.py
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Copy link
Copy Markdown
Contributor Author

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

Comment on lines +545 to +550


# Now that ``PrefetchHandle`` exists in this module's namespace, finish
# wiring its trace codec. Done here (rather than at codec import time)
# to avoid a ``storage_manager → trace.codecs → storage_manager`` cycle.
register_prefetch_handle_codec()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not good. The correct way to break the circular import is to move PrefetchHandle into lmcache/v1/distributed/api.py, and this makes more sense because it's a module-level API data class.

event_bus = init_observability(obs_config)

# Wire up the trace recorder (no-op when --trace-level is unset).
maybe_register_trace_recorder(event_bus, obs_config, storage_manager_config)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename it to maybe_initialize_trace_recorder?

Comment on lines +52 to +53
lmcache_version: str
"""``lmcache.__version__`` of the recording process."""
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm whether LMCache has lmcache.__version__? I don't think so for now. Maybe maintain a separate version in this file? One potential problem is that it cannot capture the interface changes in the storage manager and other functions that is going to be captured.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the recorder have a close function? (Not sure if it's needed).
The real thing behind this: how to make sure that the output trace file is closed correctly?

Signed-off-by: ApostaC <yihua98@uchicago.edu>
Comment thread tests/v1/mp_observability/trace/test_codecs.py Outdated
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 61cf5cb. Configure here.


recorder = StorageTraceRecorder(output_path=output_path)
recorder.attach_storage_config(storage_manager_config)
bus.register_subscriber(recorder)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No cleanup on error leaves gate on and fd leaked

Medium Severity

If attach_storage_config or bus.register_subscriber raises after the StorageTraceRecorder constructor succeeds, the tracing gate remains permanently ON (set inside __init__) and the file descriptor leaks because close() is never called. Since the recorder is never registered on the bus, EventBus.stop() won't invoke its shutdown() either. Every subsequent traced call then pays argument-binding overhead for events that no subscriber will ever process. A try/except wrapping these calls that calls recorder.close() on failure would prevent the resource leak and stuck gate.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 61cf5cb. Configure here.

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -0,0 +1,376 @@
# Trace Recording (`lmcache trace`)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level question: should we trace MP server IPC calls, or just trace L1 calls?

Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me L1 calls are easiest to impl, no GPU required and useful --- I am OK with keeping it. But IPC calls seems to be most relevant in terms of evaluating GPU-related racing conditions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will do MPCacheEngine and MQ (IPC) level tracing as the next step. For this version, let's just do storage-level tracing


All captured calls publish the same `EventType.TRACE_CALL` event:

```python
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with EventBus impl. Will the lmcache trace invokes too many events that causes EventBut to overflow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too many. Usually, a call to StorageManager's top-level function will emit multiple events. So this won't overflow event bus.

In the worst case, event bus has some overflow handling so that the memory usage won't explode

`attach_storage_config`; gate flip on init / off on close;
publish-via-EventBus end-to-end (codec encode → file → reader → codec
decode); truncated-tail tolerance; bad-magic rejection;
`dropped_count` increments on unencodable args.
Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: ideally the doc should be shorter --- this doc is for both human and LLMs.

~~~~~~

Replaying a recorded trace is delivered separately via the
``lmcache trace`` and ``lmcache bench trace-replay`` CLIs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

English-wise lmcache traces sounds like the same thing as lmcache bench trace-replay and it can be a bit confusing.

Maybe lmcache trace capture and lmcache trace replay and leave lmcache bench for just OpenAI API call benchmarks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, will update in the next PR where lmcache trace is really implemented.



@dataclass(frozen=True)
class PrefetchHandle:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: is this inside the scope of this PR, or it is just a handy refactor?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a small refactor. Otherwise there will be circular import: previously the trace's module needs to import PrefetchHandle from StorageManager to do serde, and StorageManager needs to import the trace module to generate the trace.

This refactor actually should have been done eariler -- the PrefetchHandle is the public API data type, so it should stay in api.py not storage_manager.py

register_codec(
torch.dtype,
TypeCodec(tag="torch.dtype", encode=_enc_torch_dtype, decode=_dec_torch_dtype),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: should we use Protobuf here so that we can easily ensure reader / writer has the same version, or is it too heavy?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking this because the codec feels like redoing things in protobuf.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually protobuf is more complicated. Here we just want a safe way to serialize/deserialize the function arguments when tracing.
Pickle is the easiest one to use, but it has big security problem -- attackers can embed executable code in the trace and execute it when user use lmcache trace replay

Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left several quick design-related questions. PTAL.

Copy link
Copy Markdown
Contributor

@KuntaiDu KuntaiDu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ApostaC ApostaC enabled auto-merge (squash) April 17, 2026 22:56
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Apr 17, 2026
@ApostaC ApostaC merged commit 0c4f03d into LMCache:dev Apr 18, 2026
51 of 54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants