[NPU] Support GGUF quantization for Ascend NPU (dense + MoE) by TheKonka · Pull Request #17883 · sgl-project/sglang

TheKonka · 2026-01-28T13:25:56Z

Motivation

Enable GGUF quantized models (e.g., Q4_K_M, Q8_0, Q5_K_M) to run on Ascend NPU hardware. GGUF is a popular format for quantized LLM models, and this PR adds native NPU support with optimized performance.

Modifications

Add GGUF quantization methods for NPU (python/sglang/srt/layers/quantization/gguf.py)

GGUFLinearAscendMethod: Linear layer support with pre-dequantization at load time
GGUFMoEAscendMethod: MoE layer support using npu_grouped_matmul for high-performance expert computation
GGUFEmbeddingAscendMethod: Embedding layer support
ggml_dequantize_ascend(): NPU-specific dequantization function with chunked processing to avoid OOM

Support GGUF weight loading in FusedMoE (python/sglang/srt/layers/moe/fused_moe_triton/layer.py)

Add materialize_gguf_weights() to assemble expert weights from data containers
Handle TP sharding for MoE expert weights at load time
Support fused w13 (gate+up) weight assembly

Minor fixes

Fix weight loading path for GGUF in model loader
Support TP sharding for w1/w2/w3 expert weights

Accuracy Tests

gsm8k

Qwen3-14B-Q4_K_M.gguf

python3 -m sglang.launch_server --model-path /root/.cache/Qwen3-14B-GGUF/Qwen3-14B-Q4_K_M.gguf --device npu --attention-backend ascend --host 0.0.0.0 --port 30000 --mem-fraction-static 0.7 --tp-size 2

 python bench_sglang.py
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:22<00:00,  9.01it/s]
Accuracy: 0.905
Invalid: 0.000
Latency: 22.275 s
Output throughput: 1214.871 token/s

Qwen3-30B-A3B-Q4_K_M.gguf

python3 -m sglang.launch_server --model-path /root/.cache/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-Q4_K_M.gguf --device npu --attention-backend ascend --host 0.0.0.0 --port 30000 --mem-fraction-static 0.8 --tp-size 2

python bench_sglang.py
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:40<00:00,  4.95it/s]
Accuracy: 0.900
Invalid: 0.000
Latency: 40.654 s
Output throughput: 665.395 token/s

Benchmarking and Profiling

Qwen3-14B-Q4_K_M.gguf

python3 -m sglang.launch_server --model-path /root/.cache/Qwen3-14B-GGUF/Qwen3-14B-Q4_K_M.gguf --device npu --attention-backend ascend --host 0.0.0.0 --port 30000 --mem-fraction-static 0.7 --tp-size 2

 python3 -m sglang.bench_serving     --backend sglang     --num-prompts 1000     --dataset-name random     --random-input-len 1024     --random-output-len 256     --request-rate inf  --dataset-path  /tmp/ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer  /root/.cache/Qwen3-14B/
benchmark_args=Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=None, dataset_name='random', dataset_path='/tmp/ShareGPT_V3_unfiltered_cleaned_split.json', model=None, served_model_name=None, tokenizer='/root/.cache/Qwen3-14B/', num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=256, random_range_ratio=0.0, image_count=1, image_resolution='1080p', random_image_count=False, image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, print_requests=False, disable_tqdm=False, disable_stream=False, return_logprob=False, return_routed_experts=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, plot_throughput=False, profile_activities=['CPU', 'GPU'], profile_num_steps=None, profile_by_stage=False, profile_stages=None, lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, gsp_range_ratio=1.0, gsp_fast_prepare=False, gsp_send_routing_key=False, gsp_num_turns=1, gsp_ordered=False, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None, header=None)
Fail to load tokenizer config with error=Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/root/.cache/Qwen3-14B-GGUF/Qwen3-14B-Q4_K_M.gguf'. Use `repo_type` argument if needed.

WARNING It is recommended to use the `Chat` or `Instruct` model for benchmarking.
Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.

Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='/tmp/ShareGPT_V3_unfiltered_cleaned_split.json', model='/root/.cache/Qwen3-14B-GGUF/Qwen3-14B-Q4_K_M.gguf', served_model_name=None, tokenizer='/root/.cache/Qwen3-14B/', num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=256, random_range_ratio=0.0, image_count=1, image_resolution='1080p', random_image_count=False, image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, print_requests=False,disable_tqdm=False, disable_stream=False, return_logprob=False, return_routed_experts=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, plot_throughput=False, profile_activities=['CPU', 'GPU'], profile_num_steps=None, profile_by_stage=False, profile_stages=None, lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256,gsp_range_ratio=1.0, gsp_fast_prepare=False, gsp_send_routing_key=False, gsp_num_turns=1, gsp_ordered=False, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None, header=None)

#Input tokens: 512842
#Output tokens: 128903
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:00<00:00, 16.62it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     1000
Benchmark duration (s):                  60.20
Total input tokens:                      512842
Total input text tokens:                 512842
Total generated tokens:                  128903
Total generated tokens (retokenized):    128898
Request throughput (req/s):              16.61
Input token throughput (tok/s):          8519.64
Output token throughput (tok/s):         2141.41
Peak output token throughput (tok/s):    6758.00
Peak concurrent requests:                1000
Total token throughput (tok/s):          10661.05
Concurrency:                             693.59
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   41750.91
Median E2E Latency (ms):                 45693.65
P90 E2E Latency (ms):                    57402.67
P99 E2E Latency (ms):                    59126.95
---------------Time to First Token----------------
Mean TTFT (ms):                          21012.26
Median TTFT (ms):                        15791.00
P99 TTFT (ms):                           48139.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          238.81
Median TPOT (ms):                        168.06
P99 TPOT (ms):                           1304.34
---------------Inter-Token Latency----------------
Mean ITL (ms):                           162.15
Median ITL (ms):                         140.85
P95 ITL (ms):                            287.97
P99 ITL (ms):                            579.08
Max ITL (ms):                            11751.39
==================================================

Qwen3-30B-A3B-Q4_K_M.gguf

python3 -m sglang.launch_server --model-path /root/.cache/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-Q4_K_M.gguf --device npu --attention-backend ascend --host 0.0.0.0 --port 30000 --mem-fraction-static 0.8 --tp-size 2

python3 -m sglang.bench_serving     --backend sglang     --num-prompts 1000     --dataset-name random     --random-input-len 1024     --random-output-len 256     --request-rate inf  --dataset-path  /tmp/ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer  /root/.cache/Qwen3-30B-A3B/
benchmark_args=Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=None, dataset_name='random', dataset_path='/tmp/ShareGPT_V3_unfiltered_cleaned_split.json', model=None, served_model_name=None, tokenizer='/root/.cache/Qwen3-30B-A3B/', num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=256, random_range_ratio=0.0, image_count=1, image_resolution='1080p', random_image_count=False, image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, print_requests=False, disable_tqdm=False, disable_stream=False, return_logprob=False, return_routed_experts=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, plot_throughput=False, profile_activities=['CPU', 'GPU'], profile_num_steps=None, profile_by_stage=False, profile_stages=None, lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, gsp_range_ratio=1.0, gsp_fast_prepare=False, gsp_send_routing_key=False, gsp_num_turns=1, gsp_ordered=False, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None, header=None)
Fail to load tokenizer config with error=Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/root/.cache/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-Q4_K_M.gguf'. Use `repo_type` argument if needed.

WARNING It is recommended to use the `Chat` or `Instruct` model for benchmarking.
Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.

Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='/tmp/ShareGPT_V3_unfiltered_cleaned_split.json', model='/root/.cache/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-Q4_K_M.gguf', served_model_name=None, tokenizer='/root/.cache/Qwen3-30B-A3B/', num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=256, random_range_ratio=0.0, image_count=1, image_resolution='1080p', random_image_count=False, image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, print_requests=False, disable_tqdm=False, disable_stream=False, return_logprob=False, return_routed_experts=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, plot_throughput=False, profile_activities=['CPU', 'GPU'], profile_num_steps=None, profile_by_stage=False, profile_stages=None, lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, gsp_range_ratio=1.0, gsp_fast_prepare=False, gsp_send_routing_key=False, gsp_num_turns=1, gsp_ordered=False, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None, header=None)

#Input tokens: 512842
#Output tokens: 128903
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:30<00:00, 11.05it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     1000
Benchmark duration (s):                  90.55
Total input tokens:                      512842
Total input text tokens:                 512842
Total generated tokens:                  128903
Total generated tokens (retokenized):    128897
Request throughput (req/s):              11.04
Input token throughput (tok/s):          5663.82
Output token throughput (tok/s):         1423.60
Peak output token throughput (tok/s):    2989.00
Peak concurrent requests:                1000
Total token throughput (tok/s):          7087.42
Concurrency:                             558.75
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   50593.37
Median E2E Latency (ms):                 52226.80
P90 E2E Latency (ms):                    84576.60
P99 E2E Latency (ms):                    89176.30
---------------Time to First Token----------------
Mean TTFT (ms):                          25895.55
Median TTFT (ms):                        19940.53
P99 TTFT (ms):                           66017.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          212.05
Median TPOT (ms):                        192.30
P99 TPOT (ms):                           485.94
---------------Inter-Token Latency----------------
Mean ITL (ms):                           193.10
Median ITL (ms):                         170.50
P95 ITL (ms):                            339.69
P99 ITL (ms):                            382.97
Max ITL (ms):                            4153.15
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-28T13:26:18Z

Summary of Changes

Hello @TheKonka, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates GGUF quantization support for models running on Ascend NPUs. It introduces specialized methods for handling GGUF weights, including pre-dequantization and optimized MoE layer operations, to ensure efficient and accurate inference on Ascend hardware. The changes also include updates to the GGUF model loader to correctly interpret and process MoE expert weights.

Highlights

Ascend NPU Support for GGUF: Introduced comprehensive support for GGUF quantized models on Ascend NPUs, enabling efficient inference on this hardware platform.
NPU-Specific Quantization Methods: Implemented dedicated GGUFLinearAscendMethod, GGUFMoEAscendMethod, and GGUFEmbeddingAscendMethod to tailor GGUF quantization handling specifically for Ascend NPUs.
Enhanced MoE Weight Handling: Improved GGUF weight loading mechanisms to correctly parse and materialize packed Mixture of Experts (MoE) weights, ensuring proper initialization for NPU operations.
Pre-Dequantization for Performance: Integrated a pre-dequantization step for GGUF weights, performed on the CPU during model loading, to reduce runtime overhead and accelerate inference on the NPU.
Optimized MoE Operations: Leveraged NPU-specific grouped matrix multiplication and MoE routing kernels to achieve highly efficient forward passes for Mixture of Experts layers on Ascend hardware.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for GGUF quantization on Ascend NPUs. The overall approach of pre-dequantizing weights during model loading and then using NPU-accelerated kernels for inference is a solid strategy, trading increased memory usage for better performance. The changes are well-structured, introducing NPU-specific methods for linear, embedding, and MoE layers, and adapting the GGUF weight loader for MoE models. I've identified a few areas where code duplication can be refactored to enhance maintainability. Otherwise, the implementation looks good.

gemini-code-assist · 2026-01-28T13:31:16Z

+                    if "w13" in name:
+                        # w13 is gate+up fused
+                        weight_list = []
+                        for e in range(num_experts):
+                            if e in expert_weights:
+                                w1 = expert_weights[e].get("w1")
+                                w3 = expert_weights[e].get("w3")
+
+                                if w1 is not None and w3 is not None:
+                                    fused = torch.cat([w1, w3], dim=0)
+                                    weight_list.append(fused)
+
+                        if weight_list:
+                            stacked = torch.stack(weight_list, dim=0)
+                            param.materialize(stacked.shape, dtype=stacked.dtype)
+                            param.data.copy_(stacked)
+                    elif "w2" in name:
+                        # w2 is down projection
+                        weight_list = []
+                        for e in range(num_experts):
+                            if e in expert_weights and "w2" in expert_weights[e]:
+                                w2_weight = expert_weights[e]["w2"]
+                                weight_list.append(w2_weight)
+
+                        if weight_list:
+                            stacked = torch.stack(weight_list, dim=0)
+                            param.materialize(stacked.shape, dtype=stacked.dtype)
+                            param.data.copy_(stacked)


There's significant code duplication in how w13 and w2 weights are materialized. The logic for collecting weights into a list, stacking them, and then materializing the parameter is nearly identical in both if "w13" in name: and elif "w2" in name: blocks. This could be refactored into a helper function to improve code clarity and maintainability.

gemini-code-assist · 2026-01-28T13:31:16Z

+        # Pre-dequantize w13 weights (gate+up projections)
+        if w13_qtype not in UNQUANTIZED_TYPES:
+            num_experts = w13_qweight.shape[0]
+            w13_dequant_list = []
+
+            block_size, type_size = gguf.GGML_QUANT_SIZES[w13_qtype]
+
+            for e in range(num_experts):
+                qweight_cpu = w13_qweight[e].cpu().numpy()
+                rows = w13_qweight[e].shape[0]
+                cols = w13_qweight[e].shape[1] // type_size * block_size
+
+                dequant_np = gguf_dequantize(qweight_cpu.flatten(), w13_qtype)
+                dequant = (
+                    torch.from_numpy(dequant_np)
+                    .to(dtype=self.params_dtype, device=w13_qweight.device)
+                    .reshape(rows, cols)
+                    .transpose(-1, -2)
+                    .contiguous()
+                )
+                w13_dequant_list.append(dequant)
+
+            w13_full = torch.stack(w13_dequant_list, dim=0)
+
+            layer.register_buffer("w13_dequant", w13_full, persistent=False)
+        else:
+            layer.register_buffer("w13_dequant", w13_qweight.data, persistent=False)
+
+        # Pre-dequantize w2 weights (down projection)
+        w2_qweight = layer.w2_qweight
+        w2_qtype = layer.w2_qweight_type.weight_type
+
+        if w2_qtype not in UNQUANTIZED_TYPES:
+            num_experts = w2_qweight.shape[0]
+            w2_dequant_list = []
+
+            block_size, type_size = gguf.GGML_QUANT_SIZES[w2_qtype]
+
+            for e in range(num_experts):
+                qweight_cpu = w2_qweight[e].cpu().numpy()
+                rows = w2_qweight[e].shape[0]
+                cols = w2_qweight[e].shape[1] // type_size * block_size
+
+                dequant_np = gguf_dequantize(qweight_cpu.flatten(), w2_qtype)
+                dequant = (
+                    torch.from_numpy(dequant_np)
+                    .to(dtype=self.params_dtype, device=w2_qweight.device)
+                    .reshape(rows, cols)
+                    .transpose(-1, -2)
+                    .contiguous()
+                )
+                w2_dequant_list.append(dequant)
+
+            w2_full = torch.stack(w2_dequant_list, dim=0)
+
+            layer.register_buffer("w2_dequant", w2_full, persistent=False)
+        else:
+            layer.register_buffer("w2_dequant", w2_qweight.data, persistent=False)


The logic for pre-dequantizing w13 and w2 weights is very similar and largely duplicated. This can be refactored into a private helper method to reduce redundancy and improve maintainability. For example, a method like _dequantize_expert_weights(self, qweight, qtype) could encapsulate the common logic for both.

gemini-code-assist · 2026-01-28T13:31:17Z

+        if is_moe_weight:
+            # MoE weights need special handling - extract layer_id and weight type
+            # Format: blk.{layer_id}.ffn_gate_exps.weight
+            import re
+
+            match = re.match(r"blk\.(\d+)\.(ffn_\w+_exps)\.weight", tensor_name)
+            if match:
+                layer_id = int(match.group(1))
+                weight_pattern = match.group(2)
+                hf_weight_name = MOE_WEIGHT_PATTERNS.get(weight_pattern)


There are a couple of improvements that can be made here and in the second loop for handling MoE weights:

The import re statement is inside the loop (here and on line 1012). It should be moved to the top of the gguf_quant_weights_iterator function.

The logic for parsing the MoE tensor name using a regular expression is duplicated in both loops. This could be extracted into a small helper function to improve maintainability and reduce redundancy.

TheKonka · 2026-02-02T06:43:05Z

@ping1jing2 @iforgetmyname Hello, can you check this PR? Thanks!

ping1jing2 · 2026-02-02T19:44:50Z

@ping1jing2 @iforgetmyname Hello, can you check this PR? Thanks!

ok

ping1jing2 · 2026-02-02T19:55:27Z

@OrangeRedeng @TamirBaydasov please review this PR, thanks

TamirBaydasov · 2026-02-03T10:39:56Z

Hi! Could you please add gguf test to CI? We are planning to refactor the whole quantization folder at some point, so quantization tests will help a lot in preserving functionality going forward.

…t/npu_gguf # Conflicts: # python/sglang/srt/layers/quantization/gguf.py # test/srt/run_suite.py

ping1jing2 · 2026-03-13T08:31:29Z

/tag-and-rerun-ci

ping1jing2 · 2026-03-17T18:58:30Z

/rerun-failed-ci

…t/npu_gguf

ping1jing2 · 2026-03-19T08:22:17Z

/rerun-failed-ci

…t/npu_gguf

ping1jing2 · 2026-03-20T12:49:04Z

/rerun-failed-ci

ping1jing2 · 2026-03-30T13:25:08Z

/rerun-failed-ci

ping1jing2 · 2026-04-01T20:37:38Z

/rerun-failed-ci

OrangeRedeng · 2026-04-02T07:57:59Z

Hi! Could you please update the documentation to include information about GGUF on NPU? https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/quantization.md and https://github.com/sgl-project/sglang/blob/main/docs/platforms/ascend/ascend_npu_quantization.md

…t/npu_gguf # Conflicts: # test/srt/run_suite.py

ping1jing2 · 2026-04-25T14:15:56Z

as for failed CI

ERROR: test_mmlu (__main__.TestUnifiedSWARadixCache)
Simple-evals MMLU multi-task accuracy.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/utils/common.py", line 2645, in retry
    return fn()
  File "/actions-runner/_work/sglang/sglang/python/sglang/test/test_utils.py", line 2151, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
AssertionError: 0.71875 not greater than or equal to 0.75

our test results on H100

…ject#17883) Co-authored-by: ronnie_zheng <zl19940307@163.com>

[NPU] Support GGUF quantization for Ascend NPU

95c4314

TheKonka requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners January 28, 2026 13:25

gemini-code-assist Bot reviewed Jan 28, 2026

View reviewed changes

TheKonka added 2 commits January 29, 2026 15:43

fix bug

541625f

Merge branch 'main' into feat/npu_gguf

2cca846

TheKonka changed the title ~~[NPU] Support GGUF quantization for Ascend NPU~~ [NPU] Support GGUF quantization for Ascend NPU (dense + MoE) Jan 30, 2026

ping1jing2 self-assigned this Feb 2, 2026

ping1jing2 mentioned this pull request Feb 2, 2026

[NPU] [Roadmap] NPU quantization 2026 Q1 Roadmap #14424

Open

34 tasks

ping1jing2 reviewed Feb 2, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/moe/fused_moe_triton/layer.py Outdated

Comment thread python/sglang/srt/layers/moe/fused_moe_triton/layer.py Outdated

Comment thread python/sglang/srt/models/qwen2_moe.py

update test

45bae31

TheKonka requested a review from iforgetmyname as a code owner February 13, 2026 02:19

github-actions Bot added the npu label Feb 13, 2026

update test

d9fea91

TamirBaydasov reviewed Feb 18, 2026

View reviewed changes

Comment thread python/sglang/srt/model_loader/weight_utils.py

Merge branch 'main' of https://github.com/sgl-project/sglang into fea…

da0a5fa

…t/npu_gguf # Conflicts: # python/sglang/srt/layers/quantization/gguf.py # test/srt/run_suite.py

github-actions Bot added the run-ci label Mar 13, 2026

Merge branch 'main' into feat/npu_gguf

38f4b15

ping1jing2 requested a review from b8zhong as a code owner March 17, 2026 14:20

TheKonka added 2 commits March 19, 2026 11:32

Merge branch 'main' of https://github.com/sgl-project/sglang into fea…

b10a124

…t/npu_gguf

gguf test device use a3

6a8b578

Merge branch 'main' of https://github.com/sgl-project/sglang into fea…

b253a82

…t/npu_gguf

Merge branch 'main' into feat/npu_gguf

560b946

Merge branch 'main' into feat/npu_gguf

ac233ff

TheKonka added 4 commits April 2, 2026 17:13

fix qwen3 moe gguf norm_topk_prob

2c5568c

update npu_gguf test file

6b2dab4

update npu_gguf test file

3485ea2

update docs

fca6a93

github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization labels Apr 2, 2026

TheKonka and others added 4 commits April 3, 2026 10:22

Merge branch 'main' of https://github.com/sgl-project/sglang into fea…

2bcd6d6

…t/npu_gguf # Conflicts: # test/srt/run_suite.py

Merge branch 'main' into feat/npu_gguf

82757d7

Merge branch 'main' into feat/npu_gguf

82f6193

Merge branch 'main' into feat/npu_gguf

4b421cf

ping1jing2 requested a review from wisclmy0611 as a code owner April 23, 2026 18:13

sglang-npu-bot merged commit 046c14a into sgl-project:main Apr 25, 2026
195 of 244 checks passed

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026

[NPU] Support GGUF quantization for Ascend NPU (dense + MoE) (sgl-pro…

9e622fb

…ject#17883) Co-authored-by: ronnie_zheng <zl19940307@163.com>

Conversation

TheKonka commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

gsm8k

Qwen3-14B-Q4_K_M.gguf

Qwen3-30B-A3B-Q4_K_M.gguf

Benchmarking and Profiling

Qwen3-14B-Q4_K_M.gguf

Qwen3-30B-A3B-Q4_K_M.gguf

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

TheKonka commented Feb 2, 2026

Uh oh!

ping1jing2 commented Feb 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ping1jing2 commented Feb 2, 2026

Uh oh!

TamirBaydasov commented Feb 3, 2026

Uh oh!

Uh oh!

ping1jing2 commented Mar 13, 2026

Uh oh!

ping1jing2 commented Mar 17, 2026

Uh oh!

ping1jing2 commented Mar 19, 2026

Uh oh!

ping1jing2 commented Mar 20, 2026

Uh oh!

ping1jing2 commented Mar 30, 2026

Uh oh!

ping1jing2 commented Apr 1, 2026

Uh oh!

OrangeRedeng commented Apr 2, 2026

Uh oh!

ping1jing2 commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TheKonka commented Jan 28, 2026 •

edited

Loading