[NPU] support GPTQ quantization on npu#15203
Conversation
Summary of ChangesHello @22dimensions, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances SGLang's capabilities by integrating GPTQ quantization support for Ascend NPU hardware. It introduces a specialized linear method ( Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces GPTQ quantization support for Ascend NPUs, which is a great addition for improving performance on this hardware. The changes include a new GPTQLinearAscendMethod, NPU-specific kernels, and a test case. My review focuses on correctness and maintainability. I've found a couple of critical issues, including a syntax error and a bug in tensor reshaping that could lead to runtime errors. I've also included some suggestions to improve code clarity and test portability. Overall, this is a solid contribution, and after addressing the critical issues, it should be in good shape.
a99e6c6 to
ff71232
Compare
|
Hi! The accuracy of the int4 model looks very low, have you tested it on the GPU, does it show the same accuracy? |
ff71232 to
3573793
Compare
No, I will paste the gpu precision later, maybe 4bit quantization is little too low for Qwen3-1.7B. |
|
Could you pls provide more background for "Fix qwen3 cache error in quantization case on npu." ? For example, showing status before and after fix would be sufficient. |
It might be worth doing a test on another dataset or model, from my point of view the accuracy looks strange. It should be lower than int8, but is it that low? Especially when activations are not quantized. |
Sorry for the unclear description. And I found that this pr #14884 fix the same issue as i encountered. I think I can update my branch after it is merged. |
I just test GLM-4-9B-0414 series model, here is the result: python3 -m sglang.launch_server --model-path ZhipuAI/GLM-4-9B-0414 --device npu --attention-backend ascend --port 30000 --mem-fraction-static 0.8
python ./python/sglang/test/few_shot_gsm8k.py
Accuracy: 0.790
Invalid: 0.000
Latency: 40.490 s
Output throughput: 543.412 token/s
SGLANG_USE_MODELSCOPE=true python3 -m sglang.launch_server --model-path tclf90/glm-4-9b-0414-gptq-int4 --device npu --attention-backend ascend --port 30000 --mem-fraction-static 0.8 --quantization gptq
python ./python/sglang/test/few_shot_gsm8k.py
Accuracy: 0.750
Invalid: 0.000
Latency: 27.975 s
Output throughput: 1000.675 token/sthis data looks resonable |
05b6c00 to
f35f0bc
Compare
|
cc: @ping1jing2 |
|
cc: @iforgetmyname |
a066006 to
ac482f3
Compare
1ac6cfb to
0db278e
Compare
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
b7bc26a to
b9edcde
Compare
b9edcde to
40e38a3
Compare
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
40e38a3 to
38dcb10
Compare
|
/rerun-failed-ci |
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Signed-off-by: 22dimensions <waitingwind@foxmail.com>
Motivation
This PR follows #15202 and Roadmap of NPU support #13664.
Modifications
GPTQLinearAscendMethodclassqweightqzerosto supported dtypetorch_npu.npu_weight_quant_batchmatmulkernel in linear's forward methodqweightto represent the quantized weight notweightweightprocess_weights_after_loadingAccuracy Tests And Benchmarking and Profiling
Same model but with different data types to see the difference:
Checklist