feat: add chunk hashes logger to MP server for offline data analysis#2928
feat: add chunk hashes logger to MP server for offline data analysis#2928chunxiaozheng merged 11 commits intoLMCache:devfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an asynchronous ChunkHashLogger to record chunk hashes to rotating JSONL files during lookups in the multiprocess server, including new configuration options and CLI arguments. The review feedback highlights several areas for improvement: the file rotation logic should be decoupled from model name changes to prevent excessive file creation, the logger should initialize its file list from existing files to maintain retention limits across restarts, and file operations should be hardened with explicit encoding and better handle management.
c842e81 to
5676ad6
Compare
|
@sammshen Would you like to take a look at this PR? |
|
@yoo-kumaneko Thanks for your contribution! A minor question, will this have any performance impact? |
There was a problem hiding this comment.
@yoo-kumaneko Awesome feature, left some comments. @sammshen Would you like to take another look?
BTW, if you paste some analysis diagram to the description, it would helps reviewer to quick understand your motivation of this PR.
It should have a negligible performance effect. I've done a comparison test. As shown below, the hash logger has no visible effect in TTFT and other metrics. Hash logger turned onHash logger turned off |
I've added the diagrams to the PR description |
|
quick question, why are we not using the existing observability / prometheus modules? |
sammshen
left a comment
There was a problem hiding this comment.
blocking until consulting @ApostaC and @royyhuang
|
IIUC, it's:
|
sammshen
left a comment
There was a problem hiding this comment.
LGTM actually, the code changes to other config.py and server.py seems pretty minimal
Cherry-pick squashed changes from LMCache#2928 which adds a chunk hash file logger to the MP server for offline analysis. Signed-off-by: root <crclq2018@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: rigginschen <rigginschen@tencent.com>
Cherry-pick squashed changes from LMCache#2928 which adds a chunk hash file logger to the MP server for offline analysis. Signed-off-by: root <crclq2018@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: rigginschen <rigginschen@tencent.com>
ApostaC
left a comment
There was a problem hiding this comment.
Just one quick comment, please see below.
In the meantime, if we want to add new events or change existing event metadata schema, we should update https://github.com/yoo-kumaneko/LMCache/blob/dev/lmcache/v1/mp_observability/EVENTS.md to reflect the changes. This makes the code more maintainable for other developers (and AI tools)
| self._event_bus.publish( | ||
| Event( | ||
| event_type=EventType.MP_LOOKUP, | ||
| session_id=key.request_id, | ||
| metadata={ | ||
| "request_id": key.request_id, | ||
| "chunk_hashes": chunk_hashes, | ||
| "model_name": model_name, | ||
| "chunk_size": self.chunk_size, | ||
| "seq_len": len(key.token_ids), | ||
| "dtypes": [str(d) for d in layout_desc.dtypes], | ||
| "shapes": [list(s) for s in layout_desc.shapes], | ||
| }, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Potential alternative: we reuse the EventType.MP_LOOKUP_PREFETCH_START and just add more metadata to it?
@royyhuang Good to have your thoughts here as well.
There was a problem hiding this comment.
@ApostaC Yes, we can reuse EventType.MP_LOOKUP_PREFETCH_START. However, we’ll need to move the publication of this event to after the chunk hashes are computed (since they’re required) and after the layout == None check. Does that sound OK?
|
Btw, I really like the diagrams in the description! Will it be possible to put the analysis script in the |
Sure! |
…tening Add EventBus.has_subscribers() to cheaply check if any callback is registered for a given EventType. Gate the MP_LOOKUP publish in MPCacheEngine.lookup() behind this check so that the metadata dict (including dtype/shape list comprehensions) is never allocated when the lookup hash logger is disabled. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: rigginschen <rigginschen@tencent.com>
Head branch was pushed to by a user without write access
…LMCache#12)" This reverts commit ee037db. Signed-off-by: rigginschen <rigginschen@tencent.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: rigginschen <rigginschen@tencent.com>
|
EVENTS.md updated |
|
@yoo-kumaneko Please fix the UT, thanks! |
* Revert "feat: cherry-pick chunk hash file logger from PR LMCache#2928 (#12)" This reverts commit ee037db. Signed-off-by: rigginschen <rigginschen@tencent.com> * feat: add chunk hash logger as EventBus subscriber Add JSONL-based chunk hash logging to the multiprocess server for offline analysis of KV cache behavior. Implemented as a ChunkHashLoggingSubscriber on the EventBus — no extra queue or worker thread needed. Includes configurable log rotation, chunk metadata (chunk_size, seq_len, dtypes, shapes), and CLI args. Signed-off-by: Ryan <crclq2018@gmail.com> Signed-off-by: rigginschen <rigginschen@tencent.com> * refactor: rename ChunkHashLogger to LookupHashLogger Rename the chunk hash logging subscriber to lookup hash logger to better reflect that it logs hashes observed during lookup operations. - chunk_hash.py → lookup_hash.py - ChunkHashLogConfig → LookupHashLogConfig - ChunkHashLoggingSubscriber → LookupHashLoggingSubscriber - --chunk-hash-log-* CLI args → --lookup-hash-log-* - lookup_hashes_*.jsonl file name pattern - Update docs and tests accordingly Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: rigginschen <rigginschen@tencent.com> * Use tell to get the accurate file size Signed-off-by: rigginschen <rigginschen@tencent.com> --------- Signed-off-by: rigginschen <rigginschen@tencent.com> Signed-off-by: Ryan <crclq2018@gmail.com> Co-authored-by: rigginschen <rigginschen@tencent.com> Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: kumaneko <71458228+yoo-kumaneko@users.noreply.github.com>
Signed-off-by: kumaneko <71458228+yoo-kumaneko@users.noreply.github.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d816b9c. Configure here.
Move tests/v1/multiprocess/test_lookup_hash_logger.py to tests/v1/mp_observability/subscribers/logging/ to match the source file structure and ensure tests run under the standard CI suite. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: kumaneko <crclq2018@gmail.com>
…MCache#2928) * feat: add chunk hash logger as EventBus subscriber Add JSONL-based chunk hash logging to the multiprocess server for offline analysis of KV cache behavior. Implemented as a ChunkHashLoggingSubscriber on the EventBus — no extra queue or worker thread needed. Includes configurable log rotation, chunk metadata (chunk_size, seq_len, dtypes, shapes), and CLI args. Signed-off-by: Ryan <crclq2018@gmail.com> Signed-off-by: rigginschen <rigginschen@tencent.com> * refactor: rename ChunkHashLogger to LookupHashLogger Rename the chunk hash logging subscriber to lookup hash logger to better reflect that it logs hashes observed during lookup operations. - chunk_hash.py → lookup_hash.py - ChunkHashLogConfig → LookupHashLogConfig - ChunkHashLoggingSubscriber → LookupHashLoggingSubscriber - --chunk-hash-log-* CLI args → --lookup-hash-log-* - lookup_hashes_*.jsonl file name pattern - Update docs and tests accordingly Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: rigginschen <rigginschen@tencent.com> * Use tell to get the accurate file size Signed-off-by: rigginschen <rigginschen@tencent.com> * perf(mp): skip MP_LOOKUP event construction when no subscriber is listening Add EventBus.has_subscribers() to cheaply check if any callback is registered for a given EventType. Gate the MP_LOOKUP publish in MPCacheEngine.lookup() behind this check so that the metadata dict (including dtype/shape list comprehensions) is never allocated when the lookup hash logger is disabled. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: rigginschen <rigginschen@tencent.com> * docs(mp): document MP_LOOKUP event metadata contract in EVENTS.md Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: rigginschen <rigginschen@tencent.com> * test(mp): move lookup hash logger tests to correct directory Move tests/v1/multiprocess/test_lookup_hash_logger.py to tests/v1/mp_observability/subscribers/logging/ to match the source file structure and ensure tests run under the standard CI suite. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: kumaneko <crclq2018@gmail.com> --------- Signed-off-by: Ryan <crclq2018@gmail.com> Signed-off-by: rigginschen <rigginschen@tencent.com> Signed-off-by: kumaneko <71458228+yoo-kumaneko@users.noreply.github.com> Signed-off-by: kumaneko <crclq2018@gmail.com> Co-authored-by: rigginschen <rigginschen@tencent.com> Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…MCache#2928) * feat: add chunk hash logger as EventBus subscriber Add JSONL-based chunk hash logging to the multiprocess server for offline analysis of KV cache behavior. Implemented as a ChunkHashLoggingSubscriber on the EventBus — no extra queue or worker thread needed. Includes configurable log rotation, chunk metadata (chunk_size, seq_len, dtypes, shapes), and CLI args. Signed-off-by: Ryan <crclq2018@gmail.com> Signed-off-by: rigginschen <rigginschen@tencent.com> * refactor: rename ChunkHashLogger to LookupHashLogger Rename the chunk hash logging subscriber to lookup hash logger to better reflect that it logs hashes observed during lookup operations. - chunk_hash.py → lookup_hash.py - ChunkHashLogConfig → LookupHashLogConfig - ChunkHashLoggingSubscriber → LookupHashLoggingSubscriber - --chunk-hash-log-* CLI args → --lookup-hash-log-* - lookup_hashes_*.jsonl file name pattern - Update docs and tests accordingly Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: rigginschen <rigginschen@tencent.com> * Use tell to get the accurate file size Signed-off-by: rigginschen <rigginschen@tencent.com> * perf(mp): skip MP_LOOKUP event construction when no subscriber is listening Add EventBus.has_subscribers() to cheaply check if any callback is registered for a given EventType. Gate the MP_LOOKUP publish in MPCacheEngine.lookup() behind this check so that the metadata dict (including dtype/shape list comprehensions) is never allocated when the lookup hash logger is disabled. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: rigginschen <rigginschen@tencent.com> * docs(mp): document MP_LOOKUP event metadata contract in EVENTS.md Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: rigginschen <rigginschen@tencent.com> * test(mp): move lookup hash logger tests to correct directory Move tests/v1/multiprocess/test_lookup_hash_logger.py to tests/v1/mp_observability/subscribers/logging/ to match the source file structure and ensure tests run under the standard CI suite. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: kumaneko <crclq2018@gmail.com> --------- Signed-off-by: Ryan <crclq2018@gmail.com> Signed-off-by: rigginschen <rigginschen@tencent.com> Signed-off-by: kumaneko <71458228+yoo-kumaneko@users.noreply.github.com> Signed-off-by: kumaneko <crclq2018@gmail.com> Co-authored-by: rigginschen <rigginschen@tencent.com> Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Summary
Record chunk hashes computed during lookup() to rotating JSONL files for offline analysis. Uses async background thread to avoid adding I/O latency to the hot path. Files rotate on a configurable time interval (default 6h) and include human-readable timestamps and model names. Disabled by default (--chunk-hash-log-dir to enable).
Motivation
Collecting lookup chunk hashes allows us to analyze data distribution patterns.
These insights help guide infrastructure decisions, such as storage selection and capacity planning.
Offline Data Analysis Diagrams
As the figure shown below, we can get a precise estimate of the hit rate given the cache capacity.
We can also get other useful analysis results from the collected chunk data, like the rolling hit rate.
Note
Medium Risk
Adds new request-level telemetry and file I/O in the MP server path (though gated by
has_subscribers()and disabled by default), so regressions could impact lookup performance or disk usage when enabled.Overview
Adds a new
MP_LOOKUPobservability event emitted duringMPCacheEngine.lookup()containing per-request chunk hashes plus model/layout metadata, guarded byEventBus.has_subscribers()to avoid hot-path overhead when unused.Introduces
LookupHashLoggingSubscriberwithLookupHashLogConfigto write these events to rotating JSONL files (time/size based rotation with max-file retention), wires it intomp_observability/config.pyvia new CLI flags, and documents the new options/metadata contract. Includes a new test suite covering enable/disable behavior, rotation, retention, and JSON formatting.Reviewed by Cursor Bugbot for commit 1ced3e0. Bugbot is set up for automated code reviews on this repo. Configure here.