[Observability]: Fix vllm cached and prompt tokens#2576
[Observability]: Fix vllm cached and prompt tokens#2576kobe0938 merged 3 commits intoLMCache:devfrom
Conversation
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Summary of ChangesHello @sammshen, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves an observability gap in the vLLM integration where metrics for cached and prompt tokens were only being captured for requests explicitly marked for loading. The changes ensure that these crucial performance metrics are consistently reported for all requests, significantly enhancing the accuracy and completeness of monitoring data within the system. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request fixes an observability issue in the vLLM integration, ensuring that cached and prompt tokens are correctly accounted for even when can_load=False. No vulnerabilities or critical issues were found, and the changes appear to be well-implemented and adhere to best practices. The fix correctly preserves the load_spec and updates metric collection logic. A suggestion has been made to refactor a section of the code to improve efficiency by consolidating two loops into a single pass.
| for idx, request in enumerate(metadata.requests): | ||
| if request.load_spec is None: | ||
| if request.load_spec is None or not request.load_spec.can_load: | ||
| continue | ||
| last_idx = idx | ||
|
|
||
| for idx, request in enumerate(metadata.requests): | ||
| if request.load_spec is None: | ||
| # Update metrics for all requests that have a load_spec | ||
| if request.load_spec is not None: | ||
| self._stats_monitor.update_interval_vllm_hit_tokens( | ||
| request.load_spec.vllm_cached_tokens | ||
| ) | ||
| self._stats_monitor.update_interval_prompt_tokens( | ||
| len(request.token_ids) | ||
| ) | ||
|
|
||
| if request.load_spec is None or not request.load_spec.can_load: | ||
| continue |
There was a problem hiding this comment.
While the logic is correct, iterating over metadata.requests twice is inefficient. The first loop finds last_idx, and the second loop updates metrics and then processes loadable requests. This can be optimized into a single pass over metadata.requests to collect loadable requests and update metrics, followed by a loop over only the loadable requests. This would improve performance, especially when metadata.requests is large.
| for idx, request in enumerate(metadata.requests): | |
| if request.load_spec is None: | |
| if request.load_spec is None or not request.load_spec.can_load: | |
| continue | |
| last_idx = idx | |
| for idx, request in enumerate(metadata.requests): | |
| if request.load_spec is None: | |
| # Update metrics for all requests that have a load_spec | |
| if request.load_spec is not None: | |
| self._stats_monitor.update_interval_vllm_hit_tokens( | |
| request.load_spec.vllm_cached_tokens | |
| ) | |
| self._stats_monitor.update_interval_prompt_tokens( | |
| len(request.token_ids) | |
| ) | |
| if request.load_spec is None or not request.load_spec.can_load: | |
| continue | |
| loadable_requests = [] | |
| for idx, request in enumerate(metadata.requests): | |
| # Update metrics for all requests that have a load_spec | |
| if request.load_spec is not None: | |
| self._stats_monitor.update_interval_vllm_hit_tokens( | |
| request.load_spec.vllm_cached_tokens | |
| ) | |
| self._stats_monitor.update_interval_prompt_tokens( | |
| len(request.token_ids) | |
| ) | |
| if request.load_spec and request.load_spec.can_load: | |
| loadable_requests.append((idx, request)) | |
| if not loadable_requests: | |
| return | |
| last_idx = loadable_requests[-1][0] | |
| for idx, request in loadable_requests: |
ApostaC
left a comment
There was a problem hiding this comment.
LGTM! Do we want to add the Prometheus metrics curl request to the comprehensive test so that the behavior can be secured?
* add hacky fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> * more elegant fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
* add hacky fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> * more elegant fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
* add hacky fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> * more elegant fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu>
* add hacky fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> * more elegant fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Signed-off-by: shaoxiawjc <wjc2800@163.com>
* add hacky fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> * more elegant fix Signed-off-by: Samuel Shen <slshen@uchciago.edu> --------- Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Signed-off-by: Aaron Wu <aaron.wu@dell.com>
FIX #2574
vllm cached tokens and prompt tokens were previously only valid for requests with
can_load=True(lmcache cached tokens >= vllm cached tokens).vllm with prefix caching
Logs: