[NPU] Fix Qwen3-AWQ npu_prefetch error by cqchou · Pull Request #14884 · sgl-project/sglang

cqchou · 2025-12-11T08:24:13Z

Motivation

AWQ uses qweight to represent the quantized weights. Therefore, when using the AWQ model, directly accessing the weight attribute with the previous code will result in an error.

Modifications

python/sglang/srt/models/qwen3.py

I modified the code to first check if the weight property exists; if not, it then checks the qweight property; if neither exists, it returns None.

Accuracy Tests

start server

python3 -m sglang.launch_server  --model-path /root/.cache/Qwen3-14B-AWQ  --device npu   --attention-backend ascend   --host 0.0.0.0   --port 30000   --trust-remote-code   --mem-fraction-static 0.8   --disable-cuda-graph   --tp-size 4 --quantization awq

test result

curl http://127.0.0.1:30000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "temperature": 0.7,
  "max_tokens": 512,
  "top_p": 0.8,
  "frequency_penalty": 0,
  "presence_penalty": 0,
  "messages": [
    {
      "role": "system",
      "content": "You are Qwen"
    },
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ]
}
'
{"id":"cb87ebbbf1e84aecbee93b8a67d1bcd1","object":"chat.completion","created":1766066085,"model":"default","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\nOkay, the user is asking for the capital of France. I need to make sure I provide the correct answer. Let me recall... The capital of France is Paris. That's a well-known fact, but I should double-check to be certain. Paris is a major city in the Île-de-France region and is famous for landmarks like the Eiffel Tower and the Louvre. I don't think there's any confusion here. The user might be testing basic geography knowledge, or they might need it for a specific purpose liketravel or a school project. Either way, the answer is straightforward. I'll confirm that Paris is indeed the capital and maybe add a bit more context to be helpful. Let me make sure there are no other cities that could be considered capitals, but I'm pretty sure Paris is the only one. Alright, time to respond.\n</think>\n\nThecapital of France is **Paris**. It is a major global city known for its cultural landmarks, such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. Paris is also the political, economic, and cultural heart of France.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":24,"total_tokens":257,"completion_tokens":233,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

Benchmarking and Profiling

few_shot_gsm8k

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:38<00:00,  5.21it/s]
Accuracy: 0.835
Invalid: 0.000
Latency: 38.644 s
Output throughput: 835.136 token/s

bench_one_batch , with server

python -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path  /root/.cache/Qwen3-14B-AWQ   --batch-size 32 --input-len 256 --output-len 32

======== Warmup Begin ========
Warmup with batch_size=[32]
#Input tokens: 32768
#Output tokens: 512
batch size: 32
input_len: 1024
output_len: 16
latency: 4.34 s
input throughput: 9274.56 tok/s
output throughput: 637.13 tok/s
last_ttft: 3.53 s
last generation throughput: 157.03 tok/s
======== Warmup End   ========

#Input tokens: 8192
#Output tokens: 1024
batch size: 32
input_len: 256
output_len: 32
latency: 3.38 s
input throughput: 4610.07 tok/s
output throughput: 640.20 tok/s
last_ttft: 1.78 s
last generation throughput: 157.03 tok/s

Results are saved to result.jsonl

bench_one_batch, without server

python -m sglang.bench_one_batch --model-path /root/.cache/Qwen3-14B-AWQ  --batch 32 --input-len 256 --output-len 32  --mem-fraction-static 0.8


[2025-12-18 13:59:32 TP0] Load weight end. type=Qwen3ForCausalLM, dtype=torch.float16, avail mem=50.15 GB, mem usage=10.42 GB.
[2025-12-18 13:59:32 TP0] Using KV cache dtype: torch.float16
[2025-12-18 13:59:32 TP0] KV Cache is allocated. #tokens: 249216, K size: 19.02 GB, V size: 19.02 GB
[2025-12-18 13:59:32 TP0] Memory pool end. avail mem=11.63 GB
[2025-12-18 13:59:32 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=11.63 GB
[2025-12-18 13:59:32 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64]
Capturing batches (bs=64 avail_mem=11.62 GB):   0%|                                                                                                                                                                                                                                                               | 0/12 [00:00<?, ?it/s]/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/utils/storage.py:88: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  if self.device.type != 'cpu':
Capturing batches (bs=1 avail_mem=11.14 GB): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:05<00:00,  2.17it/s]
[2025-12-18 13:59:39 TP0] Capture cuda graph end. Time elapsed: 6.85 s. mem usage=0.49 GB. avail mem=11.14 GB.
max_total_num_tokens=249216
Warmup ...
[rank0]:[W1218 13:59:40.648104640 compiler_depend.ts:117] Warning: Driver Version: 25.0.rc1.b050 is invalid or not supported yet. (function operator())
Prefill. latency: 1.05804 s, throughput:   7742.60 token/s
Decode 0. Batch size: 32, latency: 0.09417 s, throughput:    339.80 token/s
Decode 1. Batch size: 32, latency: 0.04829 s, throughput:    662.61 token/s
Decode 2. Batch size: 32, latency: 0.04782 s, throughput:    669.23 token/s
Decode 3. Batch size: 32, latency: 0.04780 s, throughput:    669.49 token/s
Decode 4. Batch size: 32, latency: 0.04765 s, throughput:    671.50 token/s
Decode.  median latency: 0.04748 s, median throughput:    673.98 token/s
Total. latency:  2.577 s, throughput:   3575.64 token/s
Benchmark ...
Prefill. latency: 0.94029 s, throughput:   8712.17 token/s
Decode 0. Batch size: 32, latency: 0.04835 s, throughput:    661.78 token/s
Decode 1. Batch size: 32, latency: 0.04851 s, throughput:    659.63 token/s
Decode 2. Batch size: 32, latency: 0.04723 s, throughput:    677.53 token/s
Decode 3. Batch size: 32, latency: 0.04710 s, throughput:    679.37 token/s
Decode 4. Batch size: 32, latency: 0.04709 s, throughput:    679.48 token/s
Decode.  median latency: 0.04700 s, median throughput:    680.81 token/s
Total. latency:  2.402 s, throughput:   3837.28 token/s
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

bench_serving

python3 -m sglang.bench_serving --backend sglang --num-prompt 10

#Input tokens: 1997
#Output tokens: 2798
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:28<00:00,  2.89s/it]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     10
Benchmark duration (s):                  28.88
Total input tokens:                      1997
Total input text tokens:                 1997
Total input vision tokens:               0
Total generated tokens:                  2798
Total generated tokens (retokenized):    2798
Request throughput (req/s):              0.35
Input token throughput (tok/s):          69.14
Output token throughput (tok/s):         96.87
Peak output token throughput (tok/s):    180.00
Peak concurrent requests:                10
Total token throughput (tok/s):          166.01
Concurrency:                             5.62
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   16226.11
Median E2E Latency (ms):                 18004.99
---------------Time to First Token----------------
Mean TTFT (ms):                          1614.04
Median TTFT (ms):                        1638.22
P99 TTFT (ms):                           1639.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          51.63
Median TPOT (ms):                        52.39
P99 TPOT (ms):                           54.24
---------------Inter-Token Latency----------------
Mean ITL (ms):                           52.41
Median ITL (ms):                         51.79
P95 ITL (ms):                            59.79
P99 ITL (ms):                            69.04
Max ITL (ms):                            289.93
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

…into fix/prefetch_cache

gemini-code-assist · 2025-12-11T08:24:27Z

Summary of Changes

Hello @cqchou, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the prefetch cache mechanism within the Qwen3 model, specifically targeting NPU environments. It introduces a new helper method to encapsulate the logic for retrieving model weights for caching, making the process more robust and adaptable to various weight representations like quantized weights, thereby improving the efficiency and correctness of NPU operations.

Highlights

Refactored Prefetch Cache Logic: The logic for determining prefetch cache weights for NPU environments has been extracted into a dedicated private method, _get_prefetch_cache_for_npu.
Flexible Weight Retrieval: The new _get_prefetch_cache_for_npu method now dynamically checks for both weight and qweight attributes, improving compatibility with different weight formats, such as quantized weights, for MLP layers.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the logic for getting the prefetch cache on NPU into a new method _get_prefetch_cache_for_npu. This change improves code organization and readability. More importantly, the new implementation is more robust as it now correctly handles quantized models by checking for qweight in addition to weight, which resolves a potential issue on NPU platforms. The overall change is a good improvement. I have provided one minor suggestion to further refine the implementation of the new helper function.

…eparately

TheKonka · 2025-12-12T09:02:29Z

@ping1jing2 Hello, can you help review this PR? thanks!

…/prefetch_cache

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…/prefetch_cache

…into fix/prefetch_cache

ping1jing2 · 2025-12-31T02:00:43Z

/tag-and-rerun-ci

周长其 and others added 4 commits December 11, 2025 16:12

Fix Qwen3-AWQ npu_prefetch error

94ad7df

Fix Qwen3-AWQ npu_prefetch error

dc27d45

Merge branch 'sgl-project:main' into fix/prefetch_cache

877580f

Merge branch 'fix/prefetch_cache' of https://github.com/cqchou/sglang …

efcc720

…into fix/prefetch_cache

cqchou changed the title ~~Fix/prefetch cache~~ Fix Qwen3-AWQ npu_prefetch error Dec 11, 2025

gemini-code-assist Bot reviewed Dec 11, 2025

View reviewed changes

Comment thread python/sglang/srt/models/qwen3.py

cqchou changed the title ~~Fix Qwen3-AWQ npu_prefetch error~~ [NPU] Fix Qwen3-AWQ npu_prefetch error Dec 11, 2025

use getattr with default value to avoid calling hasattr and getattr s…

e0c6f52

…eparately

ping1jing2 self-assigned this Dec 13, 2025

ping1jing2 added the npu label Dec 13, 2025

Merge branch 'main' of https://github.com/sgl-project/sglang into fix…

7f37568

…/prefetch_cache

22dimensions mentioned this pull request Dec 16, 2025

[NPU] support GPTQ quantization on npu #15203

Merged

8 tasks

Apply suggestions from code review

3474841

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

ping1jing2 reviewed Dec 26, 2025

View reviewed changes

Comment thread python/sglang/srt/models/qwen3.py Outdated

Comment thread python/sglang/srt/models/qwen3.py Outdated

cqchou added 4 commits December 30, 2025 11:06

Merge branch 'main' of https://github.com/sgl-project/sglang into fix…

0c0fd48

…/prefetch_cache

fix code review

5b20a7d

Merge branch 'fix/prefetch_cache' of https://github.com/cqchou/sglang …

3fcac67

…into fix/prefetch_cache

fix code review

33a9695

ping1jing2 approved these changes Dec 31, 2025

View reviewed changes

github-actions Bot added the run-ci label Dec 31, 2025

Merge branch 'main' into fix/prefetch_cache

7007747

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU] Fix Qwen3-AWQ npu_prefetch error#14884

[NPU] Fix Qwen3-AWQ npu_prefetch error#14884
cqchou wants to merge 12 commits intosgl-project:mainfrom
cqchou:fix/prefetch_cache

cqchou commented Dec 11, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Dec 11, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

TheKonka commented Dec 12, 2025

Uh oh!

Uh oh!

Uh oh!

ping1jing2 commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cqchou commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

TheKonka commented Dec 12, 2025

Uh oh!

Uh oh!

Uh oh!

ping1jing2 commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cqchou commented Dec 11, 2025 •

edited

Loading