Skip to content

feat: add chunk hashes logger to MP server for offline data analysis#2928

Merged
chunxiaozheng merged 11 commits intoLMCache:devfrom
yoo-kumaneko:feature/chunk-hash-logger
Apr 14, 2026
Merged

feat: add chunk hashes logger to MP server for offline data analysis#2928
chunxiaozheng merged 11 commits intoLMCache:devfrom
yoo-kumaneko:feature/chunk-hash-logger

Conversation

@yoo-kumaneko
Copy link
Copy Markdown
Contributor

@yoo-kumaneko yoo-kumaneko commented Apr 1, 2026

Summary

Record chunk hashes computed during lookup() to rotating JSONL files for offline analysis. Uses async background thread to avoid adding I/O latency to the hot path. Files rotate on a configurable time interval (default 6h) and include human-readable timestamps and model names. Disabled by default (--chunk-hash-log-dir to enable).

Motivation

Collecting lookup chunk hashes allows us to analyze data distribution patterns.
These insights help guide infrastructure decisions, such as storage selection and capacity planning.

Offline Data Analysis Diagrams

As the figure shown below, we can get a precise estimate of the hit rate given the cache capacity.

Clipboard_Screenshot_1775544609

We can also get other useful analysis results from the collected chunk data, like the rolling hit rate.

Clipboard_Screenshot_1775544531

Note

Medium Risk
Adds new request-level telemetry and file I/O in the MP server path (though gated by has_subscribers() and disabled by default), so regressions could impact lookup performance or disk usage when enabled.

Overview
Adds a new MP_LOOKUP observability event emitted during MPCacheEngine.lookup() containing per-request chunk hashes plus model/layout metadata, guarded by EventBus.has_subscribers() to avoid hot-path overhead when unused.

Introduces LookupHashLoggingSubscriber with LookupHashLogConfig to write these events to rotating JSONL files (time/size based rotation with max-file retention), wires it into mp_observability/config.py via new CLI flags, and documents the new options/metadata contract. Includes a new test suite covering enable/disable behavior, rotation, retention, and JSON formatting.

Reviewed by Cursor Bugbot for commit 1ced3e0. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an asynchronous ChunkHashLogger to record chunk hashes to rotating JSONL files during lookups in the multiprocess server, including new configuration options and CLI arguments. The review feedback highlights several areas for improvement: the file rotation logic should be decoupled from model name changes to prevent excessive file creation, the logger should initialize its file list from existing files to maintain retention limits across restarts, and file operations should be hardened with explicit encoding and better handle management.

Comment thread lmcache/v1/multiprocess/chunk_hash_logger.py Outdated
Comment thread lmcache/v1/multiprocess/chunk_hash_logger.py Outdated
Comment thread lmcache/v1/multiprocess/chunk_hash_logger.py Outdated
Comment thread lmcache/v1/multiprocess/chunk_hash_logger.py Outdated
Comment thread lmcache/v1/multiprocess/chunk_hash_logger.py Outdated
Comment thread lmcache/v1/multiprocess/chunk_hash_logger.py Outdated
Comment thread lmcache/v1/multiprocess/chunk_hash_logger.py Outdated
Comment thread lmcache/v1/multiprocess/chunk_hash_logger.py Outdated
@yoo-kumaneko yoo-kumaneko force-pushed the feature/chunk-hash-logger branch from c842e81 to 5676ad6 Compare April 1, 2026 13:09
Comment thread lmcache/v1/multiprocess/chunk_hash_logger.py Outdated
@yoo-kumaneko
Copy link
Copy Markdown
Contributor Author

@sammshen Would you like to take a look at this PR?
I added a lookup logger that records to a file the chunk hashes being looked up.
If the logger is None, it becomes a no-op, so there is no performance overhead.

@yoo-kumaneko yoo-kumaneko requested a review from maobaolong April 3, 2026 05:15
@chunxiaozheng
Copy link
Copy Markdown
Collaborator

@yoo-kumaneko Thanks for your contribution! A minor question, will this have any performance impact?

Comment thread lmcache/v1/multiprocess/chunk_hash_logger.py Outdated
Comment thread lmcache/v1/multiprocess/chunk_hash_logger.py Outdated
Comment thread lmcache/v1/multiprocess/server.py Outdated
Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yoo-kumaneko Awesome feature, left some comments. @sammshen Would you like to take another look?

BTW, if you paste some analysis diagram to the description, it would helps reviewer to quick understand your motivation of this PR.

@yoo-kumaneko
Copy link
Copy Markdown
Contributor Author

yoo-kumaneko commented Apr 7, 2026

@yoo-kumaneko Thanks for your contribution! A minor question, will this have any performance impact?

It should have a negligible performance effect. I've done a comparison test. As shown below, the hash logger has no visible effect in TTFT and other metrics.

Hash logger turned on

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Maximum request concurrency:             4         
Request rate configured (RPS):           4.00      
Benchmark duration (s):                  105.11    
Total input tokens:                      1100000   
Total generated tokens:                  100       
Request throughput (req/s):              0.95      
Output token throughput (tok/s):         0.95      
Peak output token throughput (tok/s):    4.00      
Peak concurrent requests:                8.00      
Total token throughput (tok/s):          10466.40  
---------------Time to First Token----------------
Mean TTFT (ms):                          4170.34   
Median TTFT (ms):                        4562.69   
P99 TTFT (ms):                           7586.79   
==================================================

Hash logger turned off

============ Serving Benchmark Result ============
Successful requests:                     100       
Failed requests:                         0         
Maximum request concurrency:             4         
Request rate configured (RPS):           4.00      
Benchmark duration (s):                  105.32    
Total input tokens:                      1100000   
Total generated tokens:                  100       
Request throughput (req/s):              0.95      
Output token throughput (tok/s):         0.95      
Peak output token throughput (tok/s):    4.00      
Peak concurrent requests:                8.00      
Total token throughput (tok/s):          10444.90  
---------------Time to First Token----------------
Mean TTFT (ms):                          4182.33   
Median TTFT (ms):                        4560.56   
P99 TTFT (ms):                           7711.41   
==================================================

@yoo-kumaneko
Copy link
Copy Markdown
Contributor Author

@yoo-kumaneko Awesome feature, left some comments. @sammshen Would you like to take another look?

BTW, if you paste some analysis diagram to the description, it would helps reviewer to quick understand your motivation of this PR.

I've added the diagrams to the PR description

Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Comment thread lmcache/v1/multiprocess/config.py Outdated
Copy link
Copy Markdown
Collaborator

@chunxiaozheng chunxiaozheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@maobaolong maobaolong added the full Run comprehensive tests on this PR label Apr 7, 2026
@sammshen
Copy link
Copy Markdown
Contributor

sammshen commented Apr 8, 2026

quick question, why are we not using the existing observability / prometheus modules?

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking until consulting @ApostaC and @royyhuang

@sammshen
Copy link
Copy Markdown
Contributor

sammshen commented Apr 8, 2026

IIUC, it's:

  1. persistence for later analysis
  2. remember evicted chunks

Copy link
Copy Markdown
Contributor

@sammshen sammshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM actually, the code changes to other config.py and server.py seems pretty minimal

yoo-kumaneko pushed a commit to yoo-kumaneko/LMCache that referenced this pull request Apr 8, 2026
Cherry-pick squashed changes from LMCache#2928 which adds
a chunk hash file logger to the MP server for offline analysis.

Signed-off-by: root <crclq2018@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>
yoo-kumaneko pushed a commit to yoo-kumaneko/LMCache that referenced this pull request Apr 8, 2026
Cherry-pick squashed changes from LMCache#2928 which adds
a chunk hash file logger to the MP server for offline analysis.

Signed-off-by: root <crclq2018@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>
Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one quick comment, please see below.

In the meantime, if we want to add new events or change existing event metadata schema, we should update https://github.com/yoo-kumaneko/LMCache/blob/dev/lmcache/v1/mp_observability/EVENTS.md to reflect the changes. This makes the code more maintainable for other developers (and AI tools)

Comment thread lmcache/v1/multiprocess/server.py Outdated
Comment on lines +644 to +658
self._event_bus.publish(
Event(
event_type=EventType.MP_LOOKUP,
session_id=key.request_id,
metadata={
"request_id": key.request_id,
"chunk_hashes": chunk_hashes,
"model_name": model_name,
"chunk_size": self.chunk_size,
"seq_len": len(key.token_ids),
"dtypes": [str(d) for d in layout_desc.dtypes],
"shapes": [list(s) for s in layout_desc.shapes],
},
)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential alternative: we reuse the EventType.MP_LOOKUP_PREFETCH_START and just add more metadata to it?
@royyhuang Good to have your thoughts here as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ApostaC Yes, we can reuse EventType.MP_LOOKUP_PREFETCH_START. However, we’ll need to move the publication of this event to after the chunk hashes are computed (since they’re required) and after the layout == None check. Does that sound OK?

@ApostaC
Copy link
Copy Markdown
Contributor

ApostaC commented Apr 13, 2026

Btw, I really like the diagrams in the description! Will it be possible to put the analysis script in the lmcache/tools/ folder?

@yoo-kumaneko
Copy link
Copy Markdown
Contributor Author

Btw, I really like the diagrams in the description! Will it be possible to put the analysis script in the lmcache/tools/ folder?

Sure!

@yoo-kumaneko yoo-kumaneko changed the title feat: add chunk hash file logger to MP server for offline analysis feat: add chunk hashes logger to MP server for offline data analysis Apr 13, 2026
…tening

Add EventBus.has_subscribers() to cheaply check if any callback is
registered for a given EventType. Gate the MP_LOOKUP publish in
MPCacheEngine.lookup() behind this check so that the metadata dict
(including dtype/shape list comprehensions) is never allocated when
the lookup hash logger is disabled.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>
auto-merge was automatically disabled April 13, 2026 15:33

Head branch was pushed to by a user without write access

Comment thread lmcache/v1/mp_observability/event.py
@github-actions github-actions Bot removed the full Run comprehensive tests on this PR label Apr 13, 2026
yoo-kumaneko pushed a commit to yoo-kumaneko/LMCache that referenced this pull request Apr 13, 2026
…LMCache#12)"

This reverts commit ee037db.

Signed-off-by: rigginschen <rigginschen@tencent.com>
rigginschen and others added 2 commits April 14, 2026 00:53
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>
@yoo-kumaneko
Copy link
Copy Markdown
Contributor Author

EVENTS.md updated

Copy link
Copy Markdown
Contributor

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ApostaC ApostaC enabled auto-merge (squash) April 13, 2026 17:42
@ApostaC
Copy link
Copy Markdown
Contributor

ApostaC commented Apr 13, 2026

@yoo-kumaneko Please fix the UT, thanks!

@ApostaC ApostaC disabled auto-merge April 13, 2026 17:43
@github-actions github-actions Bot added full Run comprehensive tests on this PR and removed full Run comprehensive tests on this PR labels Apr 13, 2026
maobaolong pushed a commit to maobaolong/LMCache that referenced this pull request Apr 13, 2026
* Revert "feat: cherry-pick chunk hash file logger from PR LMCache#2928 (#12)"

This reverts commit ee037db.

Signed-off-by: rigginschen <rigginschen@tencent.com>

* feat: add chunk hash logger as EventBus subscriber

Add JSONL-based chunk hash logging to the multiprocess server for
offline analysis of KV cache behavior. Implemented as a
ChunkHashLoggingSubscriber on the EventBus — no extra queue or
worker thread needed. Includes configurable log rotation, chunk
metadata (chunk_size, seq_len, dtypes, shapes), and CLI args.

Signed-off-by: Ryan <crclq2018@gmail.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>

* refactor: rename ChunkHashLogger to LookupHashLogger

Rename the chunk hash logging subscriber to lookup hash logger to better
reflect that it logs hashes observed during lookup operations.

- chunk_hash.py → lookup_hash.py
- ChunkHashLogConfig → LookupHashLogConfig
- ChunkHashLoggingSubscriber → LookupHashLoggingSubscriber
- --chunk-hash-log-* CLI args → --lookup-hash-log-*
- lookup_hashes_*.jsonl file name pattern
- Update docs and tests accordingly

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>

* Use tell to get the accurate file size

Signed-off-by: rigginschen <rigginschen@tencent.com>

---------

Signed-off-by: rigginschen <rigginschen@tencent.com>
Signed-off-by: Ryan <crclq2018@gmail.com>
Co-authored-by: rigginschen <rigginschen@tencent.com>
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: kumaneko <71458228+yoo-kumaneko@users.noreply.github.com>
Comment thread lmcache/v1/mp_observability/subscribers/logging/lookup_hash.py
Signed-off-by: kumaneko <71458228+yoo-kumaneko@users.noreply.github.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d816b9c. Configure here.

yoo-kumaneko and others added 2 commits April 14, 2026 15:57
Move tests/v1/multiprocess/test_lookup_hash_logger.py to
tests/v1/mp_observability/subscribers/logging/ to match the source
file structure and ensure tests run under the standard CI suite.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: kumaneko <crclq2018@gmail.com>
@chunxiaozheng chunxiaozheng enabled auto-merge (squash) April 14, 2026 11:01
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Apr 14, 2026
@chunxiaozheng chunxiaozheng merged commit cfb5c52 into LMCache:dev Apr 14, 2026
39 checks passed
ekaynar pushed a commit to ekaynar/LMCache that referenced this pull request Apr 15, 2026
…MCache#2928)

* feat: add chunk hash logger as EventBus subscriber

Add JSONL-based chunk hash logging to the multiprocess server for
offline analysis of KV cache behavior. Implemented as a
ChunkHashLoggingSubscriber on the EventBus — no extra queue or
worker thread needed. Includes configurable log rotation, chunk
metadata (chunk_size, seq_len, dtypes, shapes), and CLI args.

Signed-off-by: Ryan <crclq2018@gmail.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>

* refactor: rename ChunkHashLogger to LookupHashLogger

Rename the chunk hash logging subscriber to lookup hash logger to better
reflect that it logs hashes observed during lookup operations.

- chunk_hash.py → lookup_hash.py
- ChunkHashLogConfig → LookupHashLogConfig
- ChunkHashLoggingSubscriber → LookupHashLoggingSubscriber
- --chunk-hash-log-* CLI args → --lookup-hash-log-*
- lookup_hashes_*.jsonl file name pattern
- Update docs and tests accordingly

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>

* Use tell to get the accurate file size

Signed-off-by: rigginschen <rigginschen@tencent.com>

* perf(mp): skip MP_LOOKUP event construction when no subscriber is listening

Add EventBus.has_subscribers() to cheaply check if any callback is
registered for a given EventType. Gate the MP_LOOKUP publish in
MPCacheEngine.lookup() behind this check so that the metadata dict
(including dtype/shape list comprehensions) is never allocated when
the lookup hash logger is disabled.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>

* docs(mp): document MP_LOOKUP event metadata contract in EVENTS.md

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>

* test(mp): move lookup hash logger tests to correct directory

Move tests/v1/multiprocess/test_lookup_hash_logger.py to
tests/v1/mp_observability/subscribers/logging/ to match the source
file structure and ensure tests run under the standard CI suite.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: kumaneko <crclq2018@gmail.com>

---------

Signed-off-by: Ryan <crclq2018@gmail.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>
Signed-off-by: kumaneko <71458228+yoo-kumaneko@users.noreply.github.com>
Signed-off-by: kumaneko <crclq2018@gmail.com>
Co-authored-by: rigginschen <rigginschen@tencent.com>
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
ftian1 pushed a commit to ftian1/LMCache that referenced this pull request Apr 20, 2026
…MCache#2928)

* feat: add chunk hash logger as EventBus subscriber

Add JSONL-based chunk hash logging to the multiprocess server for
offline analysis of KV cache behavior. Implemented as a
ChunkHashLoggingSubscriber on the EventBus — no extra queue or
worker thread needed. Includes configurable log rotation, chunk
metadata (chunk_size, seq_len, dtypes, shapes), and CLI args.

Signed-off-by: Ryan <crclq2018@gmail.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>

* refactor: rename ChunkHashLogger to LookupHashLogger

Rename the chunk hash logging subscriber to lookup hash logger to better
reflect that it logs hashes observed during lookup operations.

- chunk_hash.py → lookup_hash.py
- ChunkHashLogConfig → LookupHashLogConfig
- ChunkHashLoggingSubscriber → LookupHashLoggingSubscriber
- --chunk-hash-log-* CLI args → --lookup-hash-log-*
- lookup_hashes_*.jsonl file name pattern
- Update docs and tests accordingly

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>

* Use tell to get the accurate file size

Signed-off-by: rigginschen <rigginschen@tencent.com>

* perf(mp): skip MP_LOOKUP event construction when no subscriber is listening

Add EventBus.has_subscribers() to cheaply check if any callback is
registered for a given EventType. Gate the MP_LOOKUP publish in
MPCacheEngine.lookup() behind this check so that the metadata dict
(including dtype/shape list comprehensions) is never allocated when
the lookup hash logger is disabled.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>

* docs(mp): document MP_LOOKUP event metadata contract in EVENTS.md

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>

* test(mp): move lookup hash logger tests to correct directory

Move tests/v1/multiprocess/test_lookup_hash_logger.py to
tests/v1/mp_observability/subscribers/logging/ to match the source
file structure and ensure tests run under the standard CI suite.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: kumaneko <crclq2018@gmail.com>

---------

Signed-off-by: Ryan <crclq2018@gmail.com>
Signed-off-by: rigginschen <rigginschen@tencent.com>
Signed-off-by: kumaneko <71458228+yoo-kumaneko@users.noreply.github.com>
Signed-off-by: kumaneko <crclq2018@gmail.com>
Co-authored-by: rigginschen <rigginschen@tencent.com>
Co-authored-by: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR mp_mode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants