[NPU] Fix Qwen3-AWQ npu_prefetch error#14884
[NPU] Fix Qwen3-AWQ npu_prefetch error#14884cqchou wants to merge 12 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @cqchou, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the prefetch cache mechanism within the Qwen3 model, specifically targeting NPU environments. It introduces a new helper method to encapsulate the logic for retrieving model weights for caching, making the process more robust and adaptable to various weight representations like quantized weights, thereby improving the efficiency and correctness of NPU operations. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the logic for getting the prefetch cache on NPU into a new method _get_prefetch_cache_for_npu. This change improves code organization and readability. More importantly, the new implementation is more robust as it now correctly handles quantized models by checking for qweight in addition to weight, which resolves a potential issue on NPU platforms. The overall change is a good improvement. I have provided one minor suggestion to further refine the implementation of the new helper function.
|
@ping1jing2 Hello, can you help review this PR? thanks! |
…/prefetch_cache
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…/prefetch_cache
…into fix/prefetch_cache
|
/tag-and-rerun-ci |
Motivation
AWQ uses
qweightto represent the quantized weights. Therefore, when using the AWQ model, directly accessing theweightattribute with the previous code will result in an error.Modifications
python/sglang/srt/models/qwen3.pyI modified the code to first check if the
weightproperty exists; if not, it then checks theqweightproperty; if neither exists, it returnsNone.Accuracy Tests
start server
test result
Benchmarking and Profiling
few_shot_gsm8k
bench_one_batch , with server
bench_one_batch, without server
bench_serving
Checklist