[RFC] Design of LMCache CLI#2748
Conversation
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive design proposal for a new Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive design document for the LMCache CLI. The design is well-structured and covers a wide range of functionalities from server management to benchmarking. My feedback focuses on a few areas where the design could be clarified or improved for better usability and implementation, such as the --url flag's auto-detection logic, configuration for the HTTP server port, endpoint discovery for the describe command, and the implications of client-side tokenization.
| All client commands use a unified `--url` flag: | ||
| - KV cache targets: `--url localhost:5555` (ZMQ) | ||
| - Engine targets: `--url http://localhost:8000` (HTTP) |
There was a problem hiding this comment.
The auto-detection logic for the --url flag could be more explicit. The document states localhost:5555 implies ZMQ and http://localhost:8000 implies HTTP. This suggests the detection is based on the presence of the http:// scheme.
To improve clarity and consistency, I suggest two things:
- Explicitly document the detection rule (e.g., "if the URL starts with
http://orhttps://, it's treated as an HTTP target; otherwise, it's assumed to be ahost:portfor a ZMQ target"). - Consider supporting the
tcp://scheme for ZMQ targets (e.g.,--url tcp://localhost:5555), which would align with the ZMQ endpoint format shown in thedescribe kvcacheoutput.
| Replaces `python3 -m lmcache.v1.multiprocess.http_server`. Runs in foreground, | ||
| Ctrl-C to stop. HTTP frontend is enabled by default; use `--no-http` to run | ||
| ZMQ-only. |
There was a problem hiding this comment.
The design for lmcache server mentions that the HTTP frontend is enabled by default, and the describe kvcache example shows it running on http://localhost:8000. However, the lmcache server command example doesn't show an argument to configure this port.
The design should clarify how the HTTP frontend's port is determined. Is there a default value (e.g., 8000)? Can it be configured via a command-line argument (e.g., --http-port)? This detail seems to be missing from the add_http_frontend_args() composition.
| `describe kvcache` gathers data from multiple ZMQ request types (`NOOP` for debug | ||
| info, `GET_CHUNK_SIZE` for chunk size) and `/api/status` (HTTP) to build a | ||
| consolidated view. |
There was a problem hiding this comment.
The document states that lmcache describe kvcache gathers data from both ZMQ and HTTP (/api/status) endpoints to provide a consolidated view. However, the command example only accepts a single --url flag, which points to the ZMQ endpoint.
The design should specify how the CLI discovers the corresponding HTTP endpoint's address. A likely mechanism is that the ZMQ server provides the HTTP endpoint URL as part of its debug/status response. Explicitly stating this discovery mechanism would make the design clearer.
| $ lmcache query kvcache --url localhost:5555 \ | ||
| --prompt "{ffmpeg} What is the example usage of ffmpeg?" \ | ||
| --model meta-llama/Llama-3.1-8B-Instruct |
There was a problem hiding this comment.
The design for lmcache query kvcache states that the client will tokenize the --prompt to perform a cache lookup (lines 299-300). This implies that the CLI tool will have a dependency on a tokenizer library (like transformers) and may need to download model-specific tokenizer data.
This could make the CLI client quite "heavy" for what is often expected to be a lightweight tool.
Have you considered an alternative where the raw prompt is sent to the server, and the server (which already has the tokenizer and model context) performs the tokenization and lookup? This would keep the client thin and avoid versioning issues between the client's and server's tokenizers.
maobaolong
left a comment
There was a problem hiding this comment.
@KuntaiDu I like this design, this is much better than my naive implementation about CLI.
initial design of LMCache CLI Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>
initial design of LMCache CLI Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
initial design of LMCache CLI Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
What this PR does / why we need it:
This PR outlines the initial design of LMCache CLI. Looking for feedback.
LMCache CLI Design
Status: Proposal | Date: 2026-03-11
Why
Today users must remember
python3 -m lmcache.v1.multiprocess.http_server ...andsimilar module paths. We need a single
lmcachecommand as the front door to allLMCache functionality.
Command Overview
pingquerydescribebenchkvcacheAll client commands use a unified
--urlflag:--url localhost:5555(ZMQ)--url http://localhost:8000(HTTP)Commands in Detail
lmcache serverReplaces
python3 -m lmcache.v1.multiprocess.http_server. Runs in foreground,Ctrl-C to stop. HTTP frontend is enabled by default; use
--no-httpto runZMQ-only.
lmcache server \ --engine-type blend --host 0.0.0.0 --port 5555 \ --l1-size-gb 60 --eviction-policy LRU \ --no-http # opt out of HTTP frontendServer args are composed from existing helpers:
add_mp_server_args(),add_storage_manager_args(),add_prometheus_args(),add_telemetry_args(),add_http_frontend_args().lmcache describedescribe kvcachegathers data from multiple ZMQ request types (NOOPfor debuginfo,
GET_CHUNK_SIZEfor chunk size) and/api/status(HTTP) to build aconsolidated view.
lmcache pingPure liveness check for both targets. Returns OK/FAIL with round-trip time,
measuring only the network round-trip excluding local Python overhead.
ping kvcache-- singleNOOPround-trip over ZMQ:$ lmcache ping kvcache --url localhost:5555 ======= Ping KV Cache ======= Status: OK Round trip time (ms): 0.42 ==============================ping engine-- single/api/healthcheckround-trip over HTTP:$ lmcache ping engine --url http://localhost:8000 ======== Ping Engine ========= Status: OK Round trip time (ms): 12.3 ==============================lmcache querySingle-shot query with detailed metrics. Use this to test a specific request
and see what happened.
query engine-- single inference request with TTFT/TPOT. Supports{corpus}templates for realistic long-context prompts:
$ lmcache query engine --url http://localhost:8000 \ --prompt "{ffmpeg} What is the example usage of ffmpeg?" --max-tokens 128 ========== Query Engine Result ========== Prompt tokens: 8192 Corpus 'ffmpeg': 8186 Query: 6 Output tokens: 128 -----------Latency Metrics--------------- TTFT (ms): 892.3 TPOT (ms/token): 11.8 Total latency (ms): 2403.7 Throughput (tokens/s): 53.2 =========================================query kvcache-- query KV cache state for specific keys or tokens:lmcache benchbench kvcache-- exercises store/retrieve/lookup over ZMQ. Includes acorrectness check: each retrieved KV cache chunk is checksummed against the original
stored data to verify integrity under load.
Use
--verify-onlyto run the correctness check without reporting throughput(useful in CI), or
--no-verifyto skip checksums for pure throughput measurement.bench engine-- superset ofvllm bench serve. Same CLI args, same outputformat, plus an extra LMCache KV cache metrics section:
LMCache-specific additions on top of vLLM args:
--url(replaces--port),--promptwith{corpus}templates,--corpus name=pathfor custom corpora.lmcache kvcachePrompt Corpora
query engine,bench engine, andquery kvcachesupport{name}in--promptto expand built-in text corpora (e.g.,
{paul_graham}~12k tokens,{ffmpeg}~8k tokens). Custom corpora:
--corpus my_doc=./file.txt. Built-in corpora shipin
lmcache/cli/corpora/.Implementation Notes
Architecture
pkgutil.iter_modules()on thecommands/package. Drop a new file incommands/, defineregister_command(subparsers), done.CommandRegistrarprotocol: Each command module exposesregister_command(subparsers)which adds a subparser and setsparser.set_defaults(func=handler).send_request()helper: Creates a temporaryMessageQueueClient, submitsa ZMQ request, waits with timeout (default 5s), tears down. All ZMQ commands
use this. Extended to handle HTTP targets alongside ZMQ.
argparsewith subparsers (no new deps). Reuses existingadd_*_args()helpers.--urlflag: Unified connection flag with auto-detection(
localhost:5555→ ZMQ,http://localhost:8000→ HTTP).File layout
Other notes
lmcache = "lmcache.cli.main:main"inpyproject.toml.bench engine: Wrapsvllm.benchmarks, then queries/api/statusforcache metrics.
query kvcache: Tokenizes--promptusing the model's tokenizer, thenperforms a lookup over ZMQ to check which chunks are cached.
Phasing
server,ping kvcache,kvcache clear,kvcache end-session,describe kvcache, entry pointping engine,query engine,query kvcache,bench engine,bench kvcache,describe engine, corporakvcache evict(future)Existing
lmcache_serverentry point kept as a deprecated alias for 2 minor releases.