[RFC] Design of LMCache CLI by KuntaiDu · Pull Request #2748 · LMCache/LMCache

KuntaiDu · 2026-03-11T23:58:26Z

What this PR does / why we need it:

This PR outlines the initial design of LMCache CLI. Looking for feedback.

LMCache CLI Design

Status: Proposal | Date: 2026-03-11

Why

Today users must remember python3 -m lmcache.v1.multiprocess.http_server ... and
similar module paths. We need a single lmcache command as the front door to all
LMCache functionality.

Command Overview

lmcache
├── server                          # Launch LMCache server (ZMQ + HTTP)
├── describe {kvcache,engine}       # Rich status view of a running endpoint
├── ping     {kvcache,engine}       # Pure liveness check (OK/FAIL)
├── query    {kvcache,engine}       # Single-shot query with metrics
├── bench    {kvcache,engine}       # Sustained performance benchmarking
└── kvcache  {clear,end-session}    # KV cache management actions

Verb	Question it answers	Weight
`ping`	Is it alive?	Single-shot, instant (OK/FAIL)
`query`	What happens when I send one request?	Single-shot, with metrics
`describe`	What is this thing?	Rich status dashboard
`bench`	How fast is it?	Multi-iteration, metrics-heavy
`kvcache`	Mutate cache state	Clear, end-session, evict (future)

All client commands use a unified --url flag:

KV cache targets: --url localhost:5555 (ZMQ)
Engine targets: --url http://localhost:8000 (HTTP)

Commands in Detail

`lmcache server`

Replaces python3 -m lmcache.v1.multiprocess.http_server. Runs in foreground,
Ctrl-C to stop. HTTP frontend is enabled by default; use --no-http to run
ZMQ-only.

lmcache server \
    --engine-type blend --host 0.0.0.0 --port 5555 \
    --l1-size-gb 60 --eviction-policy LRU \
    --no-http  # opt out of HTTP frontend

Server args are composed from existing helpers: add_mp_server_args(),
add_storage_manager_args(), add_prometheus_args(), add_telemetry_args(),
add_http_frontend_args().

`lmcache describe`

$ lmcache describe kvcache --url localhost:5555

============ LMCache KV Cache Service ============
Health:                                  OK
ZMQ endpoint:                            tcp://localhost:5555
HTTP endpoint:                           http://localhost:8000
Engine type:                             blend
Chunk size:                              256
L1 capacity (GB):                        60.0
L1 used (GB):                            42.3 (70.5%)
Eviction policy:                         LRU
Cached objects:                          1024
Uptime:                                  2h 14m 32s
==================================================

$ lmcache describe engine --url http://localhost:8000

================ Inference Engine ================
Model:                                   meta-llama/Llama-3.1-70B-Instruct
Max context (tokens):                    131072
Status:                                  healthy
Running requests:                        3
==================================================

describe kvcache gathers data from multiple ZMQ request types (NOOP for debug
info, GET_CHUNK_SIZE for chunk size) and /api/status (HTTP) to build a
consolidated view.

`lmcache ping`

Pure liveness check for both targets. Returns OK/FAIL with round-trip time,
measuring only the network round-trip excluding local Python overhead.

ping kvcache -- single NOOP round-trip over ZMQ:

$ lmcache ping kvcache --url localhost:5555

======= Ping KV Cache =======
Status:                  OK
Round trip time (ms):    0.42
==============================

ping engine -- single /api/healthcheck round-trip over HTTP:

$ lmcache ping engine --url http://localhost:8000

======== Ping Engine =========
Status:                  OK
Round trip time (ms):    12.3
==============================

`lmcache query`

Single-shot query with detailed metrics. Use this to test a specific request
and see what happened.

query engine -- single inference request with TTFT/TPOT. Supports {corpus}
templates for realistic long-context prompts:

$ lmcache query engine --url http://localhost:8000 \
    --prompt "{ffmpeg} What is the example usage of ffmpeg?" --max-tokens 128

========== Query Engine Result ==========
Prompt tokens:                           8192
  Corpus 'ffmpeg':                       8186
  Query:                                 6
Output tokens:                           128
-----------Latency Metrics---------------
TTFT (ms):                               892.3
TPOT (ms/token):                         11.8
Total latency (ms):                      2403.7
Throughput (tokens/s):                   53.2
=========================================

query kvcache -- query KV cache state for specific keys or tokens:

# Check if a specific token sequence is cached (lookup)
$ lmcache query kvcache --url localhost:5555 \
    --prompt "{ffmpeg} What is the example usage of ffmpeg?" \
    --model meta-llama/Llama-3.1-8B-Instruct

======== Query KV Cache Result ==========
Prompt tokens:                           8192
Cached chunks:                           30/32 (93.8%)
Cached tokens:                           7680/8192
Cache status:                            HIT (partial)
=========================================

# Store-retrieve round-trip with latency and correctness
$ lmcache query kvcache --url localhost:5555 --round-trip

==== Query KV Cache Result (round-trip) ====
Store latency (ms):                      1.23
Retrieve latency (ms):                   0.87
Checksum:                                OK
============================================

`lmcache bench`

bench kvcache -- exercises store/retrieve/lookup over ZMQ. Includes a
correctness check: each retrieved KV cache chunk is checksummed against the original
stored data to verify integrity under load.

$ lmcache bench kvcache --url localhost:5555 --duration 30

========= Bench KV Cache Result (30s) =========
--------------Operations (ops/s)----------------
Store:                                   41.3
Retrieve:                                127.3
Lookup:                                  281.7
-----------------Hit Rate-----------------------
L1:                                      92.3%
L2:                                      67.8%
---------------Bandwidth (GB/s)-----------------
L1 read:                                 12.4
L1 write:                                8.7
L2 read:                                 2.1
L2 write:                                1.4
--------------Correctness-----------------------
Checksums:                               5060/5060 OK
================================================

Use --verify-only to run the correctness check without reporting throughput
(useful in CI), or --no-verify to skip checksums for pure throughput measurement.

bench engine -- superset of vllm bench serve. Same CLI args, same output
format, plus an extra LMCache KV cache metrics section:

# vllm bench serve compatible -- just swap the command name
$ lmcache bench engine \
    --url http://localhost:8000 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name random --random-input-len 7500 --random-output-len 200 \
    --num-prompts 30 --request-rate 1 --ignore-eos

============ Serving Benchmark Result ============
Successful requests:                     30
Benchmark duration (s):                  31.34
Total input tokens:                      224970
Total generated tokens:                  6000
Request throughput (req/s):              0.96
Output token throughput (tok/s):         191.44
Total Token throughput (tok/s):          7369.36
---------------Time to First Token----------------
Mean TTFT (ms):                          313.41
Median TTFT (ms):                        272.83
P99 TTFT (ms):                           837.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.84
Median TPOT (ms):                        8.72
P99 TPOT (ms):                           11.35
----------LMCache KV Cache Performance------------
KV cache hit rate (L1):                  92.3%
KV cache hit rate (L2):                  67.8%
L1 read bandwidth:                       12.4 GB/s
L1 write bandwidth:                      8.7 GB/s
Avg tokens saved by cache (per req):     6420
Cache-assisted TTFT savings (est.):      58.2%
==================================================

LMCache-specific additions on top of vLLM args: --url (replaces --port),
--prompt with {corpus} templates, --corpus name=path for custom corpora.

`lmcache kvcache`

$ lmcache kvcache clear --url localhost:5555

========== KV Cache Clear ==========
Status:                              OK
Objects removed:                     1024
====================================

$ lmcache kvcache end-session --url localhost:5555 <request_id>

======== KV Cache End Session ========
Status:                              OK
Request ID:                          <request_id>
======================================

Prompt Corpora

query engine, bench engine, and query kvcache support {name} in --prompt
to expand built-in text corpora (e.g., {paul_graham} ~12k tokens, {ffmpeg}
~8k tokens). Custom corpora: --corpus my_doc=./file.txt. Built-in corpora ship
in lmcache/cli/corpora/.

Implementation Notes

Architecture

Auto-discovery: Commands discovered via pkgutil.iter_modules() on the
commands/ package. Drop a new file in commands/, define
register_command(subparsers), done.
CommandRegistrar protocol: Each command module exposes
register_command(subparsers) which adds a subparser and sets
parser.set_defaults(func=handler).
send_request() helper: Creates a temporary MessageQueueClient, submits
a ZMQ request, waits with timeout (default 5s), tears down. All ZMQ commands
use this. Extended to handle HTTP targets alongside ZMQ.
Framework: argparse with subparsers (no new deps). Reuses existing
add_*_args() helpers.
--url flag: Unified connection flag with auto-detection
(localhost:5555 → ZMQ, http://localhost:8000 → HTTP).

File layout

lmcache/cli/
├── __init__.py
├── __main__.py          # Auto-discovery + dispatch
├── base.py              # send_request(), add_url_arg(), CommandRegistrar
├── commands/
│   ├── server.py        # lmcache server
│   ├── describe.py      # lmcache describe {kvcache,engine}
│   ├── ping.py          # lmcache ping {kvcache,engine}
│   ├── query.py         # lmcache query {kvcache,engine}
│   ├── bench.py         # lmcache bench {kvcache,engine}
│   └── kvcache.py       # lmcache kvcache {clear,end-session}
└── corpora/             # Built-in prompt corpora

Other notes

Entry point: lmcache = "lmcache.cli.main:main" in pyproject.toml.
bench engine: Wraps vllm.benchmarks, then queries /api/status for
cache metrics.
query kvcache: Tokenizes --prompt using the model's tokenizer, then
performs a lookup over ZMQ to check which chunks are cached.

Phasing

Phase	Scope
1	`server`, `ping kvcache`, `kvcache clear`, `kvcache end-session`, `describe kvcache`, entry point
2	`ping engine`, `query engine`, `query kvcache`, `bench engine`, `bench kvcache`, `describe engine`, corpora
3	`kvcache evict` (future)

Existing lmcache_server entry point kept as a deprecated alias for 2 minor releases.

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

gemini-code-assist · 2026-03-11T23:58:54Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive design proposal for a new lmcache command-line interface. The primary goal is to streamline user interaction with LMCache by providing a single, intuitive entry point for all functionalities, from launching servers to querying and benchmarking. This design aims to enhance usability and provide robust tools for managing and monitoring LMCache components, moving away from fragmented python3 -m commands.

Highlights

Unified CLI Introduction: Proposed a new lmcache command-line interface to consolidate various LMCache functionalities, replacing cumbersome python3 -m commands.
Command Design: Outlined detailed designs for several subcommands including server, describe, ping, query, bench, and kvcache, each serving specific operational or diagnostic purposes.
Prompt Corpora Support: Introduced support for built-in and custom prompt corpora, enhancing the query engine, bench engine, and query kvcache commands.
Architectural Principles: Detailed the underlying architecture, including auto-discovery of commands, a CommandRegistrar protocol, a unified send_request() helper, and argparse framework usage.
Phased Implementation Plan: Presented a three-phase plan for implementing the CLI, starting with core server and KV cache management commands, then expanding to engine interactions and benchmarking.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

docs/design/cli-design.md
- Added a detailed design document outlining the proposed LMCache CLI structure, commands, and implementation notes.

Activity

Pull request opened, proposing the initial design for the LMCache CLI.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive design document for the LMCache CLI. The design is well-structured and covers a wide range of functionalities from server management to benchmarking. My feedback focuses on a few areas where the design could be clarified or improved for better usability and implementation, such as the --url flag's auto-detection logic, configuration for the HTTP server port, endpoint discovery for the describe command, and the implications of client-side tokenization.

gemini-code-assist · 2026-03-12T00:02:48Z

+All client commands use a unified `--url` flag:
+- KV cache targets: `--url localhost:5555` (ZMQ)
+- Engine targets: `--url http://localhost:8000` (HTTP)


The auto-detection logic for the --url flag could be more explicit. The document states localhost:5555 implies ZMQ and http://localhost:8000 implies HTTP. This suggests the detection is based on the presence of the http:// scheme.

To improve clarity and consistency, I suggest two things:

Explicitly document the detection rule (e.g., "if the URL starts with http:// or https://, it's treated as an HTTP target; otherwise, it's assumed to be a host:port for a ZMQ target").

Consider supporting the tcp:// scheme for ZMQ targets (e.g., --url tcp://localhost:5555), which would align with the ZMQ endpoint format shown in the describe kvcache output.

gemini-code-assist · 2026-03-12T00:02:48Z

+Replaces `python3 -m lmcache.v1.multiprocess.http_server`. Runs in foreground,
+Ctrl-C to stop. HTTP frontend is enabled by default; use `--no-http` to run
+ZMQ-only.


The design for lmcache server mentions that the HTTP frontend is enabled by default, and the describe kvcache example shows it running on http://localhost:8000. However, the lmcache server command example doesn't show an argument to configure this port.

The design should clarify how the HTTP frontend's port is determined. Is there a default value (e.g., 8000)? Can it be configured via a command-line argument (e.g., --http-port)? This detail seems to be missing from the add_http_frontend_args() composition.

gemini-code-assist · 2026-03-12T00:02:48Z

+`describe kvcache` gathers data from multiple ZMQ request types (`NOOP` for debug
+info, `GET_CHUNK_SIZE` for chunk size) and `/api/status` (HTTP) to build a
+consolidated view.


The document states that lmcache describe kvcache gathers data from both ZMQ and HTTP (/api/status) endpoints to provide a consolidated view. However, the command example only accepts a single --url flag, which points to the ZMQ endpoint.

The design should specify how the CLI discovers the corresponding HTTP endpoint's address. A likely mechanism is that the ZMQ server provides the HTTP endpoint URL as part of its debug/status response. Explicitly stating this discovery mechanism would make the design clearer.

gemini-code-assist · 2026-03-12T00:02:49Z

+$ lmcache query kvcache --url localhost:5555 \
+    --prompt "{ffmpeg} What is the example usage of ffmpeg?" \
+    --model meta-llama/Llama-3.1-8B-Instruct


The design for lmcache query kvcache states that the client will tokenize the --prompt to perform a cache lookup (lines 299-300). This implies that the CLI tool will have a dependency on a tokenizer library (like transformers) and may need to download model-specific tokenizer data.

This could make the CLI client quite "heavy" for what is often expected to be a lightweight tool.

Have you considered an alternative where the raw prompt is sent to the server, and the server (which already has the tokenizer and model context) performs the tokenization and lookup? This would keep the client thin and avoid versioning issues between the client's and server's tokenizers.

maobaolong

@KuntaiDu I like this design, this is much better than my naive implementation about CLI.

chunxiaozheng

Good idea! LGTM!

initial design of LMCache CLI Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>

initial design of LMCache CLI Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

initial design of LMCache CLI

3bfb459

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

gemini-code-assist Bot reviewed Mar 12, 2026

View reviewed changes

maobaolong approved these changes Mar 12, 2026

View reviewed changes

maobaolong enabled auto-merge (squash) March 12, 2026 07:23

github-actions Bot added the full Run comprehensive tests on this PR label Mar 12, 2026

chunxiaozheng approved these changes Mar 12, 2026

View reviewed changes

maobaolong merged commit 40ac3e1 into LMCache:dev Mar 12, 2026
24 of 25 checks passed

KuntaiDu mentioned this pull request Mar 14, 2026

[CLI] Implement initial framework of LMCache CLI #2775

Merged

2 tasks

realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 20, 2026

[RFC] Design of LMCache CLI (LMCache#2748)

9df49ca

initial design of LMCache CLI Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>

jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026

[RFC] Design of LMCache CLI (LMCache#2748)

bb697ae

initial design of LMCache CLI Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026

[RFC] Design of LMCache CLI (LMCache#2748)

6dcbcab

initial design of LMCache CLI Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Design of LMCache CLI#2748

[RFC] Design of LMCache CLI#2748
maobaolong merged 1 commit intoLMCache:devfrom
KuntaiDu:kuntai-cli

KuntaiDu commented Mar 11, 2026

Uh oh!

gemini-code-assist Bot commented Mar 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 12, 2026

Uh oh!

gemini-code-assist Bot Mar 12, 2026

Uh oh!

gemini-code-assist Bot Mar 12, 2026

Uh oh!

gemini-code-assist Bot Mar 12, 2026

Uh oh!

maobaolong left a comment

Uh oh!

chunxiaozheng left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KuntaiDu commented Mar 11, 2026

LMCache CLI Design

Why

Command Overview

Commands in Detail

lmcache server

lmcache describe

lmcache ping

lmcache query

lmcache bench

lmcache kvcache

Prompt Corpora

Implementation Notes

Architecture

File layout

Other notes

Phasing

Uh oh!

gemini-code-assist Bot commented Mar 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

maobaolong left a comment

Choose a reason for hiding this comment

Uh oh!

chunxiaozheng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`lmcache server`

`lmcache describe`

`lmcache ping`

`lmcache query`

`lmcache bench`

`lmcache kvcache`