Skip to content

[RFC] Design of LMCache CLI#2748

Merged
maobaolong merged 1 commit intoLMCache:devfrom
KuntaiDu:kuntai-cli
Mar 12, 2026
Merged

[RFC] Design of LMCache CLI#2748
maobaolong merged 1 commit intoLMCache:devfrom
KuntaiDu:kuntai-cli

Conversation

@KuntaiDu
Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

This PR outlines the initial design of LMCache CLI. Looking for feedback.

LMCache CLI Design

Status: Proposal | Date: 2026-03-11

Why

Today users must remember python3 -m lmcache.v1.multiprocess.http_server ... and
similar module paths. We need a single lmcache command as the front door to all
LMCache functionality.

Command Overview

lmcache
├── server                          # Launch LMCache server (ZMQ + HTTP)
├── describe {kvcache,engine}       # Rich status view of a running endpoint
├── ping     {kvcache,engine}       # Pure liveness check (OK/FAIL)
├── query    {kvcache,engine}       # Single-shot query with metrics
├── bench    {kvcache,engine}       # Sustained performance benchmarking
└── kvcache  {clear,end-session}    # KV cache management actions
Verb Question it answers Weight
ping Is it alive? Single-shot, instant (OK/FAIL)
query What happens when I send one request? Single-shot, with metrics
describe What is this thing? Rich status dashboard
bench How fast is it? Multi-iteration, metrics-heavy
kvcache Mutate cache state Clear, end-session, evict (future)

All client commands use a unified --url flag:

  • KV cache targets: --url localhost:5555 (ZMQ)
  • Engine targets: --url http://localhost:8000 (HTTP)

Commands in Detail

lmcache server

Replaces python3 -m lmcache.v1.multiprocess.http_server. Runs in foreground,
Ctrl-C to stop. HTTP frontend is enabled by default; use --no-http to run
ZMQ-only.

lmcache server \
    --engine-type blend --host 0.0.0.0 --port 5555 \
    --l1-size-gb 60 --eviction-policy LRU \
    --no-http  # opt out of HTTP frontend

Server args are composed from existing helpers: add_mp_server_args(),
add_storage_manager_args(), add_prometheus_args(), add_telemetry_args(),
add_http_frontend_args().

lmcache describe

$ lmcache describe kvcache --url localhost:5555

============ LMCache KV Cache Service ============
Health:                                  OK
ZMQ endpoint:                            tcp://localhost:5555
HTTP endpoint:                           http://localhost:8000
Engine type:                             blend
Chunk size:                              256
L1 capacity (GB):                        60.0
L1 used (GB):                            42.3 (70.5%)
Eviction policy:                         LRU
Cached objects:                          1024
Uptime:                                  2h 14m 32s
==================================================

$ lmcache describe engine --url http://localhost:8000

================ Inference Engine ================
Model:                                   meta-llama/Llama-3.1-70B-Instruct
Max context (tokens):                    131072
Status:                                  healthy
Running requests:                        3
==================================================

describe kvcache gathers data from multiple ZMQ request types (NOOP for debug
info, GET_CHUNK_SIZE for chunk size) and /api/status (HTTP) to build a
consolidated view.

lmcache ping

Pure liveness check for both targets. Returns OK/FAIL with round-trip time,
measuring only the network round-trip excluding local Python overhead.

ping kvcache -- single NOOP round-trip over ZMQ:

$ lmcache ping kvcache --url localhost:5555

======= Ping KV Cache =======
Status:                  OK
Round trip time (ms):    0.42
==============================

ping engine -- single /api/healthcheck round-trip over HTTP:

$ lmcache ping engine --url http://localhost:8000

======== Ping Engine =========
Status:                  OK
Round trip time (ms):    12.3
==============================

lmcache query

Single-shot query with detailed metrics. Use this to test a specific request
and see what happened.

query engine -- single inference request with TTFT/TPOT. Supports {corpus}
templates for realistic long-context prompts:

$ lmcache query engine --url http://localhost:8000 \
    --prompt "{ffmpeg} What is the example usage of ffmpeg?" --max-tokens 128

========== Query Engine Result ==========
Prompt tokens:                           8192
  Corpus 'ffmpeg':                       8186
  Query:                                 6
Output tokens:                           128
-----------Latency Metrics---------------
TTFT (ms):                               892.3
TPOT (ms/token):                         11.8
Total latency (ms):                      2403.7
Throughput (tokens/s):                   53.2
=========================================

query kvcache -- query KV cache state for specific keys or tokens:

# Check if a specific token sequence is cached (lookup)
$ lmcache query kvcache --url localhost:5555 \
    --prompt "{ffmpeg} What is the example usage of ffmpeg?" \
    --model meta-llama/Llama-3.1-8B-Instruct

======== Query KV Cache Result ==========
Prompt tokens:                           8192
Cached chunks:                           30/32 (93.8%)
Cached tokens:                           7680/8192
Cache status:                            HIT (partial)
=========================================

# Store-retrieve round-trip with latency and correctness
$ lmcache query kvcache --url localhost:5555 --round-trip

==== Query KV Cache Result (round-trip) ====
Store latency (ms):                      1.23
Retrieve latency (ms):                   0.87
Checksum:                                OK
============================================

lmcache bench

bench kvcache -- exercises store/retrieve/lookup over ZMQ. Includes a
correctness check: each retrieved KV cache chunk is checksummed against the original
stored data to verify integrity under load.

$ lmcache bench kvcache --url localhost:5555 --duration 30

========= Bench KV Cache Result (30s) =========
--------------Operations (ops/s)----------------
Store:                                   41.3
Retrieve:                                127.3
Lookup:                                  281.7
-----------------Hit Rate-----------------------
L1:                                      92.3%
L2:                                      67.8%
---------------Bandwidth (GB/s)-----------------
L1 read:                                 12.4
L1 write:                                8.7
L2 read:                                 2.1
L2 write:                                1.4
--------------Correctness-----------------------
Checksums:                               5060/5060 OK
================================================

Use --verify-only to run the correctness check without reporting throughput
(useful in CI), or --no-verify to skip checksums for pure throughput measurement.

bench engine -- superset of vllm bench serve. Same CLI args, same output
format, plus an extra LMCache KV cache metrics section:

# vllm bench serve compatible -- just swap the command name
$ lmcache bench engine \
    --url http://localhost:8000 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name random --random-input-len 7500 --random-output-len 200 \
    --num-prompts 30 --request-rate 1 --ignore-eos

============ Serving Benchmark Result ============
Successful requests:                     30
Benchmark duration (s):                  31.34
Total input tokens:                      224970
Total generated tokens:                  6000
Request throughput (req/s):              0.96
Output token throughput (tok/s):         191.44
Total Token throughput (tok/s):          7369.36
---------------Time to First Token----------------
Mean TTFT (ms):                          313.41
Median TTFT (ms):                        272.83
P99 TTFT (ms):                           837.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          8.84
Median TPOT (ms):                        8.72
P99 TPOT (ms):                           11.35
----------LMCache KV Cache Performance------------
KV cache hit rate (L1):                  92.3%
KV cache hit rate (L2):                  67.8%
L1 read bandwidth:                       12.4 GB/s
L1 write bandwidth:                      8.7 GB/s
Avg tokens saved by cache (per req):     6420
Cache-assisted TTFT savings (est.):      58.2%
==================================================

LMCache-specific additions on top of vLLM args: --url (replaces --port),
--prompt with {corpus} templates, --corpus name=path for custom corpora.

lmcache kvcache

$ lmcache kvcache clear --url localhost:5555

========== KV Cache Clear ==========
Status:                              OK
Objects removed:                     1024
====================================

$ lmcache kvcache end-session --url localhost:5555 <request_id>

======== KV Cache End Session ========
Status:                              OK
Request ID:                          <request_id>
======================================

Prompt Corpora

query engine, bench engine, and query kvcache support {name} in --prompt
to expand built-in text corpora (e.g., {paul_graham} ~12k tokens, {ffmpeg}
~8k tokens). Custom corpora: --corpus my_doc=./file.txt. Built-in corpora ship
in lmcache/cli/corpora/.

Implementation Notes

Architecture

  • Auto-discovery: Commands discovered via pkgutil.iter_modules() on the
    commands/ package. Drop a new file in commands/, define
    register_command(subparsers), done.
  • CommandRegistrar protocol: Each command module exposes
    register_command(subparsers) which adds a subparser and sets
    parser.set_defaults(func=handler).
  • send_request() helper: Creates a temporary MessageQueueClient, submits
    a ZMQ request, waits with timeout (default 5s), tears down. All ZMQ commands
    use this. Extended to handle HTTP targets alongside ZMQ.
  • Framework: argparse with subparsers (no new deps). Reuses existing
    add_*_args() helpers.
  • --url flag: Unified connection flag with auto-detection
    (localhost:5555 → ZMQ, http://localhost:8000 → HTTP).

File layout

lmcache/cli/
├── __init__.py
├── __main__.py          # Auto-discovery + dispatch
├── base.py              # send_request(), add_url_arg(), CommandRegistrar
├── commands/
│   ├── server.py        # lmcache server
│   ├── describe.py      # lmcache describe {kvcache,engine}
│   ├── ping.py          # lmcache ping {kvcache,engine}
│   ├── query.py         # lmcache query {kvcache,engine}
│   ├── bench.py         # lmcache bench {kvcache,engine}
│   └── kvcache.py       # lmcache kvcache {clear,end-session}
└── corpora/             # Built-in prompt corpora

Other notes

  • Entry point: lmcache = "lmcache.cli.main:main" in pyproject.toml.
  • bench engine: Wraps vllm.benchmarks, then queries /api/status for
    cache metrics.
  • query kvcache: Tokenizes --prompt using the model's tokenizer, then
    performs a lookup over ZMQ to check which chunks are cached.

Phasing

Phase Scope
1 server, ping kvcache, kvcache clear, kvcache end-session, describe kvcache, entry point
2 ping engine, query engine, query kvcache, bench engine, bench kvcache, describe engine, corpora
3 kvcache evict (future)

Existing lmcache_server entry point kept as a deprecated alias for 2 minor releases.

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive design proposal for a new lmcache command-line interface. The primary goal is to streamline user interaction with LMCache by providing a single, intuitive entry point for all functionalities, from launching servers to querying and benchmarking. This design aims to enhance usability and provide robust tools for managing and monitoring LMCache components, moving away from fragmented python3 -m commands.

Highlights

  • Unified CLI Introduction: Proposed a new lmcache command-line interface to consolidate various LMCache functionalities, replacing cumbersome python3 -m commands.
  • Command Design: Outlined detailed designs for several subcommands including server, describe, ping, query, bench, and kvcache, each serving specific operational or diagnostic purposes.
  • Prompt Corpora Support: Introduced support for built-in and custom prompt corpora, enhancing the query engine, bench engine, and query kvcache commands.
  • Architectural Principles: Detailed the underlying architecture, including auto-discovery of commands, a CommandRegistrar protocol, a unified send_request() helper, and argparse framework usage.
  • Phased Implementation Plan: Presented a three-phase plan for implementing the CLI, starting with core server and KV cache management commands, then expanding to engine interactions and benchmarking.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • docs/design/cli-design.md
    • Added a detailed design document outlining the proposed LMCache CLI structure, commands, and implementation notes.
Activity
  • Pull request opened, proposing the initial design for the LMCache CLI.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive design document for the LMCache CLI. The design is well-structured and covers a wide range of functionalities from server management to benchmarking. My feedback focuses on a few areas where the design could be clarified or improved for better usability and implementation, such as the --url flag's auto-detection logic, configuration for the HTTP server port, endpoint discovery for the describe command, and the implications of client-side tokenization.

Comment thread docs/design/cli-design.md
Comment on lines +31 to +33
All client commands use a unified `--url` flag:
- KV cache targets: `--url localhost:5555` (ZMQ)
- Engine targets: `--url http://localhost:8000` (HTTP)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The auto-detection logic for the --url flag could be more explicit. The document states localhost:5555 implies ZMQ and http://localhost:8000 implies HTTP. This suggests the detection is based on the presence of the http:// scheme.

To improve clarity and consistency, I suggest two things:

  1. Explicitly document the detection rule (e.g., "if the URL starts with http:// or https://, it's treated as an HTTP target; otherwise, it's assumed to be a host:port for a ZMQ target").
  2. Consider supporting the tcp:// scheme for ZMQ targets (e.g., --url tcp://localhost:5555), which would align with the ZMQ endpoint format shown in the describe kvcache output.

Comment thread docs/design/cli-design.md
Comment on lines +41 to +43
Replaces `python3 -m lmcache.v1.multiprocess.http_server`. Runs in foreground,
Ctrl-C to stop. HTTP frontend is enabled by default; use `--no-http` to run
ZMQ-only.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The design for lmcache server mentions that the HTTP frontend is enabled by default, and the describe kvcache example shows it running on http://localhost:8000. However, the lmcache server command example doesn't show an argument to configure this port.

The design should clarify how the HTTP frontend's port is determined. Is there a default value (e.g., 8000)? Can it be configured via a command-line argument (e.g., --http-port)? This detail seems to be missing from the add_http_frontend_args() composition.

Comment thread docs/design/cli-design.md
Comment on lines +84 to +86
`describe kvcache` gathers data from multiple ZMQ request types (`NOOP` for debug
info, `GET_CHUNK_SIZE` for chunk size) and `/api/status` (HTTP) to build a
consolidated view.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The document states that lmcache describe kvcache gathers data from both ZMQ and HTTP (/api/status) endpoints to provide a consolidated view. However, the command example only accepts a single --url flag, which points to the ZMQ endpoint.

The design should specify how the CLI discovers the corresponding HTTP endpoint's address. A likely mechanism is that the ZMQ server provides the HTTP endpoint URL as part of its debug/status response. Explicitly stating this discovery mechanism would make the design clearer.

Comment thread docs/design/cli-design.md
Comment on lines +141 to +143
$ lmcache query kvcache --url localhost:5555 \
--prompt "{ffmpeg} What is the example usage of ffmpeg?" \
--model meta-llama/Llama-3.1-8B-Instruct
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The design for lmcache query kvcache states that the client will tokenize the --prompt to perform a cache lookup (lines 299-300). This implies that the CLI tool will have a dependency on a tokenizer library (like transformers) and may need to download model-specific tokenizer data.

This could make the CLI client quite "heavy" for what is often expected to be a lightweight tool.

Have you considered an alternative where the raw prompt is sent to the server, and the server (which already has the tokenizer and model context) performs the tokenization and lookup? This would keep the client thin and avoid versioning issues between the client's and server's tokenizers.

Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KuntaiDu I like this design, this is much better than my naive implementation about CLI.

@maobaolong maobaolong enabled auto-merge (squash) March 12, 2026 07:23
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Mar 12, 2026
Copy link
Copy Markdown
Collaborator

@chunxiaozheng chunxiaozheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! LGTM!

@maobaolong maobaolong merged commit 40ac3e1 into LMCache:dev Mar 12, 2026
24 of 25 checks passed
realAaronWu pushed a commit to realAaronWu/LMCache that referenced this pull request Mar 20, 2026
initial design of LMCache CLI

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: Aaron Wu <aaron.wu@dell.com>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
initial design of LMCache CLI

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
jooho-XCENA pushed a commit to xcena-dev/LMCache that referenced this pull request Apr 2, 2026
initial design of LMCache CLI

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants