[CLI] lmcache query engine subcommand by deng451e · Pull Request #2846 · LMCache/LMCache

deng451e · 2026-03-22T19:31:58Z

Summary

Add lmcache query engine: expand {name} via --documents, send one OpenAI-compatible request, and report token breakdown + latency metrics (incl. input tokens from usage)
Implement helpers:
- prompt.py: placeholder expansion + tokenizer-based token estimation
- request.py: streaming request + usage/latency metrics
Document in:
- lmcache/cli/documents/lmcache.txt: built in text document for test
- docs/design/cli/query-command.md
- docs/source/getting_started/cli.rst
- docs/source/developer_guide/cli.rst
Add tests: tests/cli/commands/test_query.py (args, mocked request, output, error handling)

Test Plan

query engine
query kvcache (not implemented)

gemini-code-assist · 2026-03-22T19:33:11Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new lmcache query command-line interface (CLI) subcommand for querying both the serving engine and KV cache worker. It provides a single-shot query interface with metrics output, supports prompt expansion using external files, and includes comprehensive documentation and unit tests.

Highlights

New CLI Subcommand: Introduced the lmcache query engine subcommand, enabling prompt expansion using external files and reporting latency and token-level metrics.
Documentation: Provided documentation for the new subcommand in docs/design/cli/query-command.md, docs/source/getting_started/cli.rst, and docs/source/developer_guide/cli.rst.
Unit Tests: Added unit tests in tests/cli/commands/test_query.py, covering token-count helper contracts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new lmcache query engine subcommand, providing functionality for single-shot inference queries with detailed performance metrics. The implementation is comprehensive, including prompt expansion, API endpoint fallback logic, and integration with the existing metrics framework. The changes are well-documented with a design document, updates to user guides, and tests for the main command class.

My review focuses on improving code quality, maintainability, and test coverage. I have identified a few areas for improvement: a missing docstring in violation of the style guide, an opportunity to refactor duplicated logic for better maintainability, and a minor grammatical correction in the design document. Most critically, the complex helper functions for token counting and prompt manipulation lack unit tests, which is a significant gap given their complexity and the project's testing standards. Addressing these points will enhance the robustness and long-term health of this new feature.

KuntaiDu · 2026-03-23T23:42:08Z

+_BUILTIN_CORPORA = {
+    "ffmpeg": (
+        "ffmpeg — multimedia framework. Example: ffmpeg -i in.mp4 "
+        "-c:v libx264 out.mk4\n"
+    ),
+}


Can we have a designated document folder for this?

KuntaiDu · 2026-03-23T23:42:40Z

+        misc.pop(0)
+
+
+def _stream(


Would be great if you could prompt-related logic into a separate file

KuntaiDu · 2026-03-23T23:44:37Z

+    corpus_args: list[str],
+    model_id: str,
+) -> None:
+    metrics.add("prompt_tokens", "Prompt tokens", prompt_tokens)


When calculating prompt tokens, maybe let's use a default tokenizer (e.g. openai/gpt-oss-120b) and in section title maybe we can write
--- Prompt tokens (est by gpt-oss, w/o chat template) ----

Signed-off-by: deng451e <838677410@qq.com>

Signed-off-by: lmcache-ci-bot <lmcache-ci@example.com>

KuntaiDu

Otherwise LGTM

Updated CLI documentation to reflect changes in command syntax. Signed-off-by: deng451e <57919305+deng451e@users.noreply.github.com>

Signed-off-by: deng451e <57919305+deng451e@users.noreply.github.com>

maobaolong

LGTM

maobaolong · 2026-03-25T03:34:11Z

@@ -0,0 +1,11 @@
+LMCache is a high-performance key–value (KV) cache management system designed to accelerate large language model (LLM) inference by efficiently storing, transferring, and reusing intermediate attention states. As modern LLM serving increasingly becomes bottlenecked by memory bandwidth, redundant computation, and cross-device communication, LMCache provides a system-level solution that decouples KV cache storage from the model execution pipeline and enables scalable, low-latency reuse across requests, processes, and even distributed nodes.


Do we really need to maintain this in lmcache code repository? @deng451e

This built-in test doc was suggested by Kuntai.

Basically I found that in LMCache dev a common subroutine is to compile a long request to trigger LMCache KV cache offloading, so I guess it is good for us to have some long docs in LMCache so that we can easily construct and send long prompt, and check if LLM response makes sense.

KuntaiDu · 2026-03-25T07:01:03Z

+# Single inference query
+$ lmcache query engine --url http://localhost:8000/v1 \
+  --prompt "{ctx} What is the example usage of lmcache?" \
+  --documents ctx=LMCache/lmcache/cli/documents/lmcache.txt  \


Just want to double check -- can we make sure that the lmcache document you put in can be directly referenced inside the prompt without supplying --documents parameter? For example

lmcache query engine --prompt "{lmcache} Summarize with lmcache" --url http://localhost:8000/v1

Ideally should just work.

right now {lmcache} still needs a file path via --documents to work. I can add support for built-in docs so it works out of the box

* Add query-command.md Signed-off-by: deng451e <838677410@qq.com> * lmcache query CLI Command Design Signed-off-by: deng451e <838677410@qq.com> * add cli query commnad Signed-off-by: deng451e <838677410@qq.com> * split query module Signed-off-by: deng451e <838677410@qq.com> * decompose query engine module Signed-off-by: deng451e <838677410@qq.com> * cli lmcache query engine Signed-off-by: lmcache-ci-bot <lmcache-ci@example.com> * Fix command syntax in CLI usage example Updated CLI documentation to reflect changes in command syntax. Signed-off-by: deng451e <57919305+deng451e@users.noreply.github.com> * Update query command documentation for lmcache Signed-off-by: deng451e <57919305+deng451e@users.noreply.github.com> --------- Signed-off-by: deng451e <838677410@qq.com> Signed-off-by: lmcache-ci-bot <lmcache-ci@example.com> Signed-off-by: deng451e <57919305+deng451e@users.noreply.github.com> Co-authored-by: lmcache-ci-bot <lmcache-ci@example.com>

gemini-code-assist Bot reviewed Mar 22, 2026

View reviewed changes

Comment thread lmcache/cli/commands/query.py Outdated

Comment thread tests/cli/commands/test_query.py Outdated

Comment thread docs/design/cli/query-command.md

Comment thread lmcache/cli/commands/query.py Outdated

deng451e requested review from KuntaiDu and sammshen March 23, 2026 22:43

KuntaiDu reviewed Mar 23, 2026

View reviewed changes

deng451e changed the title ~~[CLI] Implement lmcache query kvcache subcommand~~ [CLI] lmcache query engine subcommand Mar 23, 2026

deng451e and others added 6 commits March 24, 2026 19:13

Add query-command.md

86152a1

Signed-off-by: deng451e <838677410@qq.com>

lmcache query CLI Command Design

0f5ffc8

Signed-off-by: deng451e <838677410@qq.com>

add cli query commnad

e893173

Signed-off-by: deng451e <838677410@qq.com>

split query module

cc164c4

Signed-off-by: deng451e <838677410@qq.com>

decompose query engine module

ad55ba4

Signed-off-by: deng451e <838677410@qq.com>

cli lmcache query engine

b06b55d

Signed-off-by: lmcache-ci-bot <lmcache-ci@example.com>

deng451e force-pushed the cli_lmcache_query branch from 6ab1525 to b06b55d Compare March 24, 2026 19:22

Merge branch 'dev' into cli_lmcache_query

a9281b1

KuntaiDu approved these changes Mar 24, 2026

View reviewed changes

Comment thread docs/source/getting_started/cli.rst Outdated

deng451e added 2 commits March 24, 2026 17:26

Fix command syntax in CLI usage example

b15ce2e

Updated CLI documentation to reflect changes in command syntax. Signed-off-by: deng451e <57919305+deng451e@users.noreply.github.com>

Update query command documentation for lmcache

9e3cba3

Signed-off-by: deng451e <57919305+deng451e@users.noreply.github.com>

maobaolong approved these changes Mar 25, 2026

View reviewed changes

maobaolong reviewed Mar 25, 2026

View reviewed changes

sammshen enabled auto-merge (squash) March 25, 2026 06:49

github-actions Bot added the full Run comprehensive tests on this PR label Mar 25, 2026

KuntaiDu reviewed Mar 25, 2026

View reviewed changes

sammshen merged commit e2cc19b into LMCache:dev Mar 25, 2026
33 of 35 checks passed

		@@ -0,0 +1,11 @@
		LMCache is a high-performance key–value (KV) cache management system designed to accelerate large language model (LLM) inference by efficiently storing, transferring, and reusing intermediate attention states. As modern LLM serving increasingly becomes bottlenecked by memory bandwidth, redundant computation, and cross-device communication, LMCache provides a system-level solution that decouples KV cache storage from the model execution pipeline and enables scalable, low-latency reuse across requests, processes, and even distributed nodes.

Conversation

deng451e commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

gemini-code-assist Bot commented Mar 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KuntaiDu Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

KuntaiDu Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

KuntaiDu Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

KuntaiDu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maobaolong left a comment

Choose a reason for hiding this comment

Uh oh!

maobaolong Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

deng451e Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KuntaiDu Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KuntaiDu Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

deng451e Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deng451e commented Mar 22, 2026 •

edited

Loading

deng451e Mar 25, 2026 •

edited

Loading

KuntaiDu Mar 25, 2026 •

edited

Loading