[V1] Add num_cached_tokens stats for request output#17519
[V1] Add num_cached_tokens stats for request output#17519simon-mo wants to merge 5 commits intovllm-project:mainfrom
Conversation
Signed-off-by: simon-mo <xmo@berkeley.edu>
…okens Signed-off-by: simon-mo <xmo@berkeley.edu>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
One caveat is that |
WoosukKwon
left a comment
There was a problem hiding this comment.
btw why do we want to have this feature?
| self._all_token_ids: list[int] = self.prompt_token_ids.copy() | ||
| self.spec_token_ids: list[int] = [] | ||
| self.num_computed_tokens = 0 | ||
| self.num_cached_tokens = 0 |
There was a problem hiding this comment.
I think this is confusing.
There was a problem hiding this comment.
@simon-mo Please add a comment. I think people will be confused between num_computed_tokens and num_cached_tokens otw.
|
I think this exposes implementation details to the API, which is not recommended unless we have a clear use case. |
|
This needs to be piped to the API as part of the protocol here |
WoosukKwon
left a comment
There was a problem hiding this comment.
@simon-mo Oh I see, I didn't know that the prompt caching api had this.
| self._all_token_ids: list[int] = self.prompt_token_ids.copy() | ||
| self.spec_token_ids: list[int] = [] | ||
| self.num_computed_tokens = 0 | ||
| self.num_cached_tokens = 0 |
There was a problem hiding this comment.
@simon-mo Please add a comment. I think people will be confused between num_computed_tokens and num_cached_tokens otw.
|
This pull request has merge conflicts that must be resolved before it can be |
|
@simon-mo Also, I think we can initialize |
|
I didn't notice this fix. I also submitted a PR to address this issue. #18192 😅 |
|
Closing as superseded by #18149 |
V1 never supported this item in request output. so
output.num_cached_tokensfrom LLM.generate is always None. This PR adds support for it.