Skip to content

[NPU] Fix Qwen3-AWQ npu_prefetch error#14884

Open
cqchou wants to merge 12 commits intosgl-project:mainfrom
cqchou:fix/prefetch_cache
Open

[NPU] Fix Qwen3-AWQ npu_prefetch error#14884
cqchou wants to merge 12 commits intosgl-project:mainfrom
cqchou:fix/prefetch_cache

Conversation

@cqchou
Copy link
Copy Markdown

@cqchou cqchou commented Dec 11, 2025

Motivation

AWQ uses qweight to represent the quantized weights. Therefore, when using the AWQ model, directly accessing the weight attribute with the previous code will result in an error.

Modifications

python/sglang/srt/models/qwen3.py

I modified the code to first check if the weight property exists; if not, it then checks the qweight property; if neither exists, it returns None.

Accuracy Tests

start server

python3 -m sglang.launch_server  --model-path /root/.cache/Qwen3-14B-AWQ  --device npu   --attention-backend ascend   --host 0.0.0.0   --port 30000   --trust-remote-code   --mem-fraction-static 0.8   --disable-cuda-graph   --tp-size 4 --quantization awq

test result

curl http://127.0.0.1:30000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "temperature": 0.7,
  "max_tokens": 512,
  "top_p": 0.8,
  "frequency_penalty": 0,
  "presence_penalty": 0,
  "messages": [
    {
      "role": "system",
      "content": "You are Qwen"
    },
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ]
}
'
{"id":"cb87ebbbf1e84aecbee93b8a67d1bcd1","object":"chat.completion","created":1766066085,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOkay, the user is asking for the capital of France. I need to make sure I provide the correct answer. Let me recall... The capital of France is Paris. That's a well-known fact, but I should double-check to be certain. Paris is a major city in the Île-de-France region and is famous for landmarks like the Eiffel Tower and the Louvre. I don't think there's any confusion here. The user might be testing basic geography knowledge, or they might need it for a specific purpose liketravel or a school project. Either way, the answer is straightforward. I'll confirm that Paris is indeed the capital and maybe add a bit more context to be helpful. Let me make sure there are no other cities that could be considered capitals, but I'm pretty sure Paris is the only one. Alright, time to respond.\n</think>\n\nThecapital of France is **Paris**. It is a major global city known for its cultural landmarks, such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. Paris is also the political, economic, and cultural heart of France.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":24,"total_tokens":257,"completion_tokens":233,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

Benchmarking and Profiling

few_shot_gsm8k

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:38<00:00,  5.21it/s]
Accuracy: 0.835
Invalid: 0.000
Latency: 38.644 s
Output throughput: 835.136 token/s

bench_one_batch , with server

python -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path  /root/.cache/Qwen3-14B-AWQ   --batch-size 32 --input-len 256 --output-len 32

======== Warmup Begin ========
Warmup with batch_size=[32]
#Input tokens: 32768
#Output tokens: 512
batch size: 32
input_len: 1024
output_len: 16
latency: 4.34 s
input throughput: 9274.56 tok/s
output throughput: 637.13 tok/s
last_ttft: 3.53 s
last generation throughput: 157.03 tok/s
======== Warmup End   ========

#Input tokens: 8192
#Output tokens: 1024
batch size: 32
input_len: 256
output_len: 32
latency: 3.38 s
input throughput: 4610.07 tok/s
output throughput: 640.20 tok/s
last_ttft: 1.78 s
last generation throughput: 157.03 tok/s

Results are saved to result.jsonl

bench_one_batch, without server

python -m sglang.bench_one_batch --model-path /root/.cache/Qwen3-14B-AWQ  --batch 32 --input-len 256 --output-len 32  --mem-fraction-static 0.8


[2025-12-18 13:59:32 TP0] Load weight end. type=Qwen3ForCausalLM, dtype=torch.float16, avail mem=50.15 GB, mem usage=10.42 GB.
[2025-12-18 13:59:32 TP0] Using KV cache dtype: torch.float16
[2025-12-18 13:59:32 TP0] KV Cache is allocated. #tokens: 249216, K size: 19.02 GB, V size: 19.02 GB
[2025-12-18 13:59:32 TP0] Memory pool end. avail mem=11.63 GB
[2025-12-18 13:59:32 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=11.63 GB
[2025-12-18 13:59:32 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64]
Capturing batches (bs=64 avail_mem=11.62 GB):   0%|                                                                                                                                                                                                                                                               | 0/12 [00:00<?, ?it/s]/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/utils/storage.py:88: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if self.device.type != 'cpu':
Capturing batches (bs=1 avail_mem=11.14 GB): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:05<00:00,  2.17it/s]
[2025-12-18 13:59:39 TP0] Capture cuda graph end. Time elapsed: 6.85 s. mem usage=0.49 GB. avail mem=11.14 GB.
max_total_num_tokens=249216
Warmup ...
[rank0]:[W1218 13:59:40.648104640 compiler_depend.ts:117] Warning: Driver Version: 25.0.rc1.b050 is invalid or not supported yet. (function operator())
Prefill. latency: 1.05804 s, throughput:   7742.60 token/s
Decode 0. Batch size: 32, latency: 0.09417 s, throughput:    339.80 token/s
Decode 1. Batch size: 32, latency: 0.04829 s, throughput:    662.61 token/s
Decode 2. Batch size: 32, latency: 0.04782 s, throughput:    669.23 token/s
Decode 3. Batch size: 32, latency: 0.04780 s, throughput:    669.49 token/s
Decode 4. Batch size: 32, latency: 0.04765 s, throughput:    671.50 token/s
Decode.  median latency: 0.04748 s, median throughput:    673.98 token/s
Total. latency:  2.577 s, throughput:   3575.64 token/s
Benchmark ...
Prefill. latency: 0.94029 s, throughput:   8712.17 token/s
Decode 0. Batch size: 32, latency: 0.04835 s, throughput:    661.78 token/s
Decode 1. Batch size: 32, latency: 0.04851 s, throughput:    659.63 token/s
Decode 2. Batch size: 32, latency: 0.04723 s, throughput:    677.53 token/s
Decode 3. Batch size: 32, latency: 0.04710 s, throughput:    679.37 token/s
Decode 4. Batch size: 32, latency: 0.04709 s, throughput:    679.48 token/s
Decode.  median latency: 0.04700 s, median throughput:    680.81 token/s
Total. latency:  2.402 s, throughput:   3837.28 token/s
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

bench_serving

python3 -m sglang.bench_serving --backend sglang --num-prompt 10

#Input tokens: 1997
#Output tokens: 2798
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:28<00:00,  2.89s/it]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     10
Benchmark duration (s):                  28.88
Total input tokens:                      1997
Total input text tokens:                 1997
Total input vision tokens:               0
Total generated tokens:                  2798
Total generated tokens (retokenized):    2798
Request throughput (req/s):              0.35
Input token throughput (tok/s):          69.14
Output token throughput (tok/s):         96.87
Peak output token throughput (tok/s):    180.00
Peak concurrent requests:                10
Total token throughput (tok/s):          166.01
Concurrency:                             5.62
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   16226.11
Median E2E Latency (ms):                 18004.99
---------------Time to First Token----------------
Mean TTFT (ms):                          1614.04
Median TTFT (ms):                        1638.22
P99 TTFT (ms):                           1639.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.63
Median TPOT (ms):                        52.39
P99 TPOT (ms):                           54.24
---------------Inter-Token Latency----------------
Mean ITL (ms):                           52.41
Median ITL (ms):                         51.79
P95 ITL (ms):                            59.79
P99 ITL (ms):                            69.04
Max ITL (ms):                            289.93
==================================================

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @cqchou, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the prefetch cache mechanism within the Qwen3 model, specifically targeting NPU environments. It introduces a new helper method to encapsulate the logic for retrieving model weights for caching, making the process more robust and adaptable to various weight representations like quantized weights, thereby improving the efficiency and correctness of NPU operations.

Highlights

  • Refactored Prefetch Cache Logic: The logic for determining prefetch cache weights for NPU environments has been extracted into a dedicated private method, _get_prefetch_cache_for_npu.
  • Flexible Weight Retrieval: The new _get_prefetch_cache_for_npu method now dynamically checks for both weight and qweight attributes, improving compatibility with different weight formats, such as quantized weights, for MLP layers.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@cqchou cqchou changed the title Fix/prefetch cache Fix Qwen3-AWQ npu_prefetch error Dec 11, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the logic for getting the prefetch cache on NPU into a new method _get_prefetch_cache_for_npu. This change improves code organization and readability. More importantly, the new implementation is more robust as it now correctly handles quantized models by checking for qweight in addition to weight, which resolves a potential issue on NPU platforms. The overall change is a good improvement. I have provided one minor suggestion to further refine the implementation of the new helper function.

Comment thread python/sglang/srt/models/qwen3.py
@cqchou cqchou changed the title Fix Qwen3-AWQ npu_prefetch error [NPU] Fix Qwen3-AWQ npu_prefetch error Dec 11, 2025
@TheKonka
Copy link
Copy Markdown
Contributor

@ping1jing2 Hello, can you help review this PR? thanks!

@ping1jing2 ping1jing2 self-assigned this Dec 13, 2025
@ping1jing2 ping1jing2 added the npu label Dec 13, 2025
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Comment thread python/sglang/srt/models/qwen3.py Outdated
Comment thread python/sglang/srt/models/qwen3.py Outdated
@ping1jing2
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants