[CLI] Implementation of lmcache bench engine#2889
Conversation
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
There was a problem hiding this comment.
Code Review
This pull request introduces the lmcache bench engine command, a comprehensive benchmarking tool for inference engines. It features a modular architecture supporting multiple workloads, an interactive configuration mode, real-time progress monitoring, and detailed performance statistics. The implementation also extends the LMCache server to report KV cache size metrics. Review feedback identifies a violation of the project's encapsulation policy regarding private member access and suggests implementing a public callback registration method in the RequestSender class to improve modularity.
| request_sender._on_finished.extend( | ||
| [ | ||
| lambda result, _text: stats_collector.on_request_finished(result), | ||
| lambda result, _text: progress_monitor.on_request_finished( | ||
| result.request_id, | ||
| result.successful, | ||
| ), | ||
| workload.request_finished, | ||
| ] |
There was a problem hiding this comment.
Accessing the private member _on_finished directly violates the project's style guide, which prohibits accessing _-prefixed attributes across class boundaries. This can lead to maintenance issues as the internal implementation of RequestSender becomes tightly coupled with this orchestrator.
To improve encapsulation and adhere to the style guide, please add a public method to the RequestSender class for adding callbacks.
For example, you could add the following method to lmcache/cli/commands/bench/engine_bench/request_sender.py:
def add_on_finished_callback(self, callback: OnFinishedCallback) -> None:
"""Register a callback to be invoked when a request finishes."""
self._on_finished.append(callback)| request_sender._on_finished.extend( | |
| [ | |
| lambda result, _text: stats_collector.on_request_finished(result), | |
| lambda result, _text: progress_monitor.on_request_finished( | |
| result.request_id, | |
| result.successful, | |
| ), | |
| workload.request_finished, | |
| ] | |
| # 5. Wire callbacks on sender | |
| request_sender.add_on_finished_callback( | |
| lambda result, _text: stats_collector.on_request_finished(result) | |
| ) | |
| request_sender.add_on_finished_callback( | |
| lambda result, _text: progress_monitor.on_request_finished( | |
| result.request_id, | |
| result.successful, | |
| ) | |
| ) | |
| request_sender.add_on_finished_callback(workload.request_finished) |
References
- The code directly accesses a private member (
_on_finished) of theRequestSenderclass from another module. The style guide explicitly forbids accessing_-prefixed attributes across class boundaries to maintain encapsulation and modularity. (link)
There was a problem hiding this comment.
good catch, addressed
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
| # Launch the LMCache server (ZMQ + HTTP) | ||
| lmcache server --host 0.0.0.0 --port 5555 --l1-size-gb 100 --eviction-policy LRU | ||
|
|
||
| # Run a benchmark against the engine |
There was a problem hiding this comment.
High-level comments: can we have a non-interactive mode (like in docker -it vs not adding this parameter) for bash usage?
Signed-off-by: ApostaC <yihua98@uchicago.edu>
Signed-off-by: ApostaC <yihua98@uchicago.edu>
| @@ -0,0 +1,400 @@ | |||
| lmcache bench engine | |||
| ==================== | |||
|
|
|||
There was a problem hiding this comment.
@ApostaC can you also add your GIF at the top of this file?
| lmcache bench engine | ||
| ==================== | ||
|
|
||
| The ``lmcache bench engine`` command runs sustained performance benchmarks |
There was a problem hiding this comment.
It would be great if we can also print out the whole CLI command being executed before start benchmarking, so that user can copy-paste it and run it in the future.
There was a problem hiding this comment.
we can do it in the future, added to my backlog
| P90 TTFT (ms): 587.21 | ||
| P99 TTFT (ms): 837.32 | ||
| ------------------ Decoding Speed --------------------- | ||
| Mean decode (tok/s): 48.23 |
There was a problem hiding this comment.
Maybe let's use ITL here? Because throughput metrics (e.g. tokens/s) has no standard definition of P95 percentile
| @@ -0,0 +1,281 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
Would be great if we can have a quick doc on how to contribute workloads
There was a problem hiding this comment.
It's in docs/design/cli/bench-engine.md
Signed-off-by: ApostaC <yihua98@uchicago.edu>
* add lmcache bench engine command Signed-off-by: ApostaC <yihua98@uchicago.edu>
* add lmcache bench engine command Signed-off-by: ApostaC <yihua98@uchicago.edu>
What this PR does / why we need it:
Adds the
lmcache bench enginecommand — a sustained performance benchmarking tool for inference engines. It supports three workload types (long-doc-qa, multi-round-chat, random-prefill), interactive configuration, config file save/load, and real-time progress display.Key features:
long-doc-qa(KV cache reuse testing),multi-round-chat(stateful QPS-controlled sessions),random-prefill(concurrent prefill-only)--configfor reproducible benchmarks--lmcache-urlauto-resolvestokens_per_gb_kvcachefrom a running LMCache serverSpecial notes for your reviewers:
The implementation follows a bottom-up modular architecture (config → stats → request_sender → progress → workloads → orchestrator → interactive). Each module has its own test file. Adding a new workload requires: (1) a config dataclass with
resolve(), (2) aBaseWorkloadsubclass, (3)ConfigItementries inschema.py, (4) a dispatch branch increate_workload(). Seedocs/design/cli/bench-engine.mdfor the full design doc.If applicable:
Demo