[V1] Add num_cached_tokens stats for request output by simon-mo · Pull Request #17519 · vllm-project/vllm

simon-mo · 2025-05-01T05:50:14Z

V1 never supported this item in request output. so output.num_cached_tokens from LLM.generate is always None. This PR adds support for it.

Signed-off-by: simon-mo <xmo@berkeley.edu>

…okens Signed-off-by: simon-mo <xmo@berkeley.edu>

github-actions · 2025-05-01T05:50:24Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: simon-mo <xmo@berkeley.edu>

WoosukKwon · 2025-05-01T06:34:21Z

One caveat is that num_cached_tokens will be refreshed if the request is preempted and resumed. I think this needs to be documented somewhere.

Signed-off-by: simon-mo <xmo@berkeley.edu>

…okens

WoosukKwon

btw why do we want to have this feature?

WoosukKwon · 2025-05-08T17:15:51Z

        self._all_token_ids: list[int] = self.prompt_token_ids.copy()
        self.spec_token_ids: list[int] = []
        self.num_computed_tokens = 0
+        self.num_cached_tokens = 0


I think this is confusing.

@simon-mo Please add a comment. I think people will be confused between num_computed_tokens and num_cached_tokens otw.

WoosukKwon · 2025-05-08T17:17:12Z

I think this exposes implementation details to the API, which is not recommended unless we have a clear use case.

simon-mo · 2025-05-08T17:51:01Z

This needs to be piped to the API as part of the protocol here

https://openai.com/index/api-prompt-caching/

WoosukKwon

@simon-mo Oh I see, I didn't know that the prompt caching api had this.

WoosukKwon · 2025-05-08T19:50:38Z

        self._all_token_ids: list[int] = self.prompt_token_ids.copy()
        self.spec_token_ids: list[int] = []
        self.num_computed_tokens = 0
+        self.num_cached_tokens = 0


@simon-mo Please add a comment. I think people will be confused between num_computed_tokens and num_cached_tokens otw.

mergify · 2025-05-12T16:47:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @simon-mo.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

WoosukKwon · 2025-05-12T17:53:02Z

@simon-mo Also, I think we can initialize num_cached_tokens to -1 and update it only once when the request is first scheduled.

calvin0327 · 2025-05-15T08:40:29Z

I didn't notice this fix. I also submitted a PR to address this issue. #18192 😅

DarkLight1337 · 2025-05-23T03:28:40Z

Closing as superseded by #18149

simon-mo added 2 commits April 30, 2025 22:44

[V1] Add num_cached_tokens stats for request output

51df30f

Signed-off-by: simon-mo <xmo@berkeley.edu>

Merge branch 'main' of github.com:vllm-project/vllm into num_cached_t…

8f1879d

…okens Signed-off-by: simon-mo <xmo@berkeley.edu>

simon-mo requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners May 1, 2025 05:50

mergify Bot added the v1 label May 1, 2025

lint

38df11f

Signed-off-by: simon-mo <xmo@berkeley.edu>

comaniac approved these changes May 1, 2025

View reviewed changes

simon-mo enabled auto-merge (squash) May 1, 2025 06:09

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 1, 2025

simon-mo disabled auto-merge May 1, 2025 06:37

simon-mo mentioned this pull request May 1, 2025

[prototype] prioritized block soft pinning/evictions #17520

Closed

simon-mo added 2 commits May 8, 2025 10:07

document

b9571bc

Signed-off-by: simon-mo <xmo@berkeley.edu>

Merge branch 'main' of github.com:vllm-project/vllm into num_cached_t…

fdc814e

…okens

simon-mo enabled auto-merge (squash) May 8, 2025 17:08

WoosukKwon requested changes May 8, 2025

View reviewed changes

WoosukKwon disabled auto-merge May 8, 2025 17:16

WoosukKwon reviewed May 8, 2025

View reviewed changes

mergify Bot added the needs-rebase label May 12, 2025

WoosukKwon mentioned this pull request May 14, 2025

[Feature][V1]: suupports cached_tokens in response usage #18149

Merged

DarkLight1337 closed this May 23, 2025

Uh oh!

Conversation

simon-mo commented May 1, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 1, 2025

Uh oh!

WoosukKwon commented May 1, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

WoosukKwon May 8, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon May 8, 2025

Choose a reason for hiding this comment

Uh oh!

WoosukKwon commented May 8, 2025

Uh oh!

simon-mo commented May 8, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

WoosukKwon May 8, 2025

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 12, 2025

Uh oh!

WoosukKwon commented May 12, 2025

Uh oh!

calvin0327 commented May 15, 2025

Uh oh!

DarkLight1337 commented May 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

simon-mo commented May 1, 2025 •

edited by github-actions Bot

Loading