[NPU] Support GGUF quantization for Ascend NPU (dense + MoE)#17883
[NPU] Support GGUF quantization for Ascend NPU (dense + MoE)#17883sglang-npu-bot merged 20 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @TheKonka, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates GGUF quantization support for models running on Ascend NPUs. It introduces specialized methods for handling GGUF weights, including pre-dequantization and optimized MoE layer operations, to ensure efficient and accurate inference on Ascend hardware. The changes also include updates to the GGUF model loader to correctly interpret and process MoE expert weights. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for GGUF quantization on Ascend NPUs. The overall approach of pre-dequantizing weights during model loading and then using NPU-accelerated kernels for inference is a solid strategy, trading increased memory usage for better performance. The changes are well-structured, introducing NPU-specific methods for linear, embedding, and MoE layers, and adapting the GGUF weight loader for MoE models. I've identified a few areas where code duplication can be refactored to enhance maintainability. Otherwise, the implementation looks good.
| if "w13" in name: | ||
| # w13 is gate+up fused | ||
| weight_list = [] | ||
| for e in range(num_experts): | ||
| if e in expert_weights: | ||
| w1 = expert_weights[e].get("w1") | ||
| w3 = expert_weights[e].get("w3") | ||
|
|
||
| if w1 is not None and w3 is not None: | ||
| fused = torch.cat([w1, w3], dim=0) | ||
| weight_list.append(fused) | ||
|
|
||
| if weight_list: | ||
| stacked = torch.stack(weight_list, dim=0) | ||
| param.materialize(stacked.shape, dtype=stacked.dtype) | ||
| param.data.copy_(stacked) | ||
| elif "w2" in name: | ||
| # w2 is down projection | ||
| weight_list = [] | ||
| for e in range(num_experts): | ||
| if e in expert_weights and "w2" in expert_weights[e]: | ||
| w2_weight = expert_weights[e]["w2"] | ||
| weight_list.append(w2_weight) | ||
|
|
||
| if weight_list: | ||
| stacked = torch.stack(weight_list, dim=0) | ||
| param.materialize(stacked.shape, dtype=stacked.dtype) | ||
| param.data.copy_(stacked) |
There was a problem hiding this comment.
There's significant code duplication in how w13 and w2 weights are materialized. The logic for collecting weights into a list, stacking them, and then materializing the parameter is nearly identical in both if "w13" in name: and elif "w2" in name: blocks. This could be refactored into a helper function to improve code clarity and maintainability.
| # Pre-dequantize w13 weights (gate+up projections) | ||
| if w13_qtype not in UNQUANTIZED_TYPES: | ||
| num_experts = w13_qweight.shape[0] | ||
| w13_dequant_list = [] | ||
|
|
||
| block_size, type_size = gguf.GGML_QUANT_SIZES[w13_qtype] | ||
|
|
||
| for e in range(num_experts): | ||
| qweight_cpu = w13_qweight[e].cpu().numpy() | ||
| rows = w13_qweight[e].shape[0] | ||
| cols = w13_qweight[e].shape[1] // type_size * block_size | ||
|
|
||
| dequant_np = gguf_dequantize(qweight_cpu.flatten(), w13_qtype) | ||
| dequant = ( | ||
| torch.from_numpy(dequant_np) | ||
| .to(dtype=self.params_dtype, device=w13_qweight.device) | ||
| .reshape(rows, cols) | ||
| .transpose(-1, -2) | ||
| .contiguous() | ||
| ) | ||
| w13_dequant_list.append(dequant) | ||
|
|
||
| w13_full = torch.stack(w13_dequant_list, dim=0) | ||
|
|
||
| layer.register_buffer("w13_dequant", w13_full, persistent=False) | ||
| else: | ||
| layer.register_buffer("w13_dequant", w13_qweight.data, persistent=False) | ||
|
|
||
| # Pre-dequantize w2 weights (down projection) | ||
| w2_qweight = layer.w2_qweight | ||
| w2_qtype = layer.w2_qweight_type.weight_type | ||
|
|
||
| if w2_qtype not in UNQUANTIZED_TYPES: | ||
| num_experts = w2_qweight.shape[0] | ||
| w2_dequant_list = [] | ||
|
|
||
| block_size, type_size = gguf.GGML_QUANT_SIZES[w2_qtype] | ||
|
|
||
| for e in range(num_experts): | ||
| qweight_cpu = w2_qweight[e].cpu().numpy() | ||
| rows = w2_qweight[e].shape[0] | ||
| cols = w2_qweight[e].shape[1] // type_size * block_size | ||
|
|
||
| dequant_np = gguf_dequantize(qweight_cpu.flatten(), w2_qtype) | ||
| dequant = ( | ||
| torch.from_numpy(dequant_np) | ||
| .to(dtype=self.params_dtype, device=w2_qweight.device) | ||
| .reshape(rows, cols) | ||
| .transpose(-1, -2) | ||
| .contiguous() | ||
| ) | ||
| w2_dequant_list.append(dequant) | ||
|
|
||
| w2_full = torch.stack(w2_dequant_list, dim=0) | ||
|
|
||
| layer.register_buffer("w2_dequant", w2_full, persistent=False) | ||
| else: | ||
| layer.register_buffer("w2_dequant", w2_qweight.data, persistent=False) |
There was a problem hiding this comment.
The logic for pre-dequantizing w13 and w2 weights is very similar and largely duplicated. This can be refactored into a private helper method to reduce redundancy and improve maintainability. For example, a method like _dequantize_expert_weights(self, qweight, qtype) could encapsulate the common logic for both.
| if is_moe_weight: | ||
| # MoE weights need special handling - extract layer_id and weight type | ||
| # Format: blk.{layer_id}.ffn_gate_exps.weight | ||
| import re | ||
|
|
||
| match = re.match(r"blk\.(\d+)\.(ffn_\w+_exps)\.weight", tensor_name) | ||
| if match: | ||
| layer_id = int(match.group(1)) | ||
| weight_pattern = match.group(2) | ||
| hf_weight_name = MOE_WEIGHT_PATTERNS.get(weight_pattern) |
There was a problem hiding this comment.
There are a couple of improvements that can be made here and in the second loop for handling MoE weights:
- The
import restatement is inside the loop (here and on line 1012). It should be moved to the top of thegguf_quant_weights_iteratorfunction. - The logic for parsing the MoE tensor name using a regular expression is duplicated in both loops. This could be extracted into a small helper function to improve maintainability and reduce redundancy.
|
@ping1jing2 @iforgetmyname Hello, can you check this PR? Thanks! |
ok |
|
@OrangeRedeng @TamirBaydasov please review this PR, thanks |
|
Hi! Could you please add gguf test to CI? We are planning to refactor the whole quantization folder at some point, so quantization tests will help a lot in preserving functionality going forward. |
…t/npu_gguf # Conflicts: # python/sglang/srt/layers/quantization/gguf.py # test/srt/run_suite.py
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
Hi! Could you please update the documentation to include information about GGUF on NPU? https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/quantization.md and https://github.com/sgl-project/sglang/blob/main/docs/platforms/ascend/ascend_npu_quantization.md |
…t/npu_gguf # Conflicts: # test/srt/run_suite.py
…ject#17883) Co-authored-by: ronnie_zheng <zl19940307@163.com>


Motivation
Enable GGUF quantized models (e.g., Q4_K_M, Q8_0, Q5_K_M) to run on Ascend NPU hardware. GGUF is a popular format for quantized LLM models, and this PR adds native NPU support with optimized performance.
Modifications
Accuracy Tests
gsm8k
Qwen3-14B-Q4_K_M.gguf
Qwen3-30B-A3B-Q4_K_M.gguf
Benchmarking and Profiling
Qwen3-14B-Q4_K_M.gguf
Qwen3-30B-A3B-Q4_K_M.gguf
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci