Skip to content

[NPU] Support GGUF quantization for Ascend NPU (dense + MoE)#17883

Merged
sglang-npu-bot merged 20 commits intosgl-project:mainfrom
TheKonka:feat/npu_gguf
Apr 25, 2026
Merged

[NPU] Support GGUF quantization for Ascend NPU (dense + MoE)#17883
sglang-npu-bot merged 20 commits intosgl-project:mainfrom
TheKonka:feat/npu_gguf

Conversation

@TheKonka
Copy link
Copy Markdown
Contributor

@TheKonka TheKonka commented Jan 28, 2026

Motivation

Enable GGUF quantized models (e.g., Q4_K_M, Q8_0, Q5_K_M) to run on Ascend NPU hardware. GGUF is a popular format for quantized LLM models, and this PR adds native NPU support with optimized performance.

Modifications

  1. Add GGUF quantization methods for NPU (python/sglang/srt/layers/quantization/gguf.py)
  • GGUFLinearAscendMethod: Linear layer support with pre-dequantization at load time
  • GGUFMoEAscendMethod: MoE layer support using npu_grouped_matmul for high-performance expert computation
  • GGUFEmbeddingAscendMethod: Embedding layer support
  • ggml_dequantize_ascend(): NPU-specific dequantization function with chunked processing to avoid OOM
  1. Support GGUF weight loading in FusedMoE (python/sglang/srt/layers/moe/fused_moe_triton/layer.py)
  • Add materialize_gguf_weights() to assemble expert weights from data containers
  • Handle TP sharding for MoE expert weights at load time
  • Support fused w13 (gate+up) weight assembly
  1. Minor fixes
  • Fix weight loading path for GGUF in model loader
  • Support TP sharding for w1/w2/w3 expert weights

Accuracy Tests

gsm8k

Qwen3-14B-Q4_K_M.gguf

python3 -m sglang.launch_server --model-path /root/.cache/Qwen3-14B-GGUF/Qwen3-14B-Q4_K_M.gguf --device npu --attention-backend ascend --host 0.0.0.0 --port 30000 --mem-fraction-static 0.7 --tp-size 2

 python bench_sglang.py
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:22<00:00,  9.01it/s]
Accuracy: 0.905
Invalid: 0.000
Latency: 22.275 s
Output throughput: 1214.871 token/s

Qwen3-30B-A3B-Q4_K_M.gguf

python3 -m sglang.launch_server --model-path /root/.cache/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-Q4_K_M.gguf --device npu --attention-backend ascend --host 0.0.0.0 --port 30000 --mem-fraction-static 0.8 --tp-size 2

python bench_sglang.py
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:40<00:00,  4.95it/s]
Accuracy: 0.900
Invalid: 0.000
Latency: 40.654 s
Output throughput: 665.395 token/s

Benchmarking and Profiling

Qwen3-14B-Q4_K_M.gguf

python3 -m sglang.launch_server --model-path /root/.cache/Qwen3-14B-GGUF/Qwen3-14B-Q4_K_M.gguf --device npu --attention-backend ascend --host 0.0.0.0 --port 30000 --mem-fraction-static 0.7 --tp-size 2

 python3 -m sglang.bench_serving     --backend sglang     --num-prompts 1000     --dataset-name random     --random-input-len 1024     --random-output-len 256     --request-rate inf  --dataset-path  /tmp/ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer  /root/.cache/Qwen3-14B/
benchmark_args=Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=None, dataset_name='random', dataset_path='/tmp/ShareGPT_V3_unfiltered_cleaned_split.json', model=None, served_model_name=None, tokenizer='/root/.cache/Qwen3-14B/', num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=256, random_range_ratio=0.0, image_count=1, image_resolution='1080p', random_image_count=False, image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, print_requests=False, disable_tqdm=False, disable_stream=False, return_logprob=False, return_routed_experts=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, plot_throughput=False, profile_activities=['CPU', 'GPU'], profile_num_steps=None, profile_by_stage=False, profile_stages=None, lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, gsp_range_ratio=1.0, gsp_fast_prepare=False, gsp_send_routing_key=False, gsp_num_turns=1, gsp_ordered=False, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None, header=None)
Fail to load tokenizer config with error=Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/root/.cache/Qwen3-14B-GGUF/Qwen3-14B-Q4_K_M.gguf'. Use `repo_type` argument if needed.

WARNING It is recommended to use the `Chat` or `Instruct` model for benchmarking.
Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.

Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='/tmp/ShareGPT_V3_unfiltered_cleaned_split.json', model='/root/.cache/Qwen3-14B-GGUF/Qwen3-14B-Q4_K_M.gguf', served_model_name=None, tokenizer='/root/.cache/Qwen3-14B/', num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=256, random_range_ratio=0.0, image_count=1, image_resolution='1080p', random_image_count=False, image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, print_requests=False,disable_tqdm=False, disable_stream=False, return_logprob=False, return_routed_experts=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, plot_throughput=False, profile_activities=['CPU', 'GPU'], profile_num_steps=None, profile_by_stage=False, profile_stages=None, lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256,gsp_range_ratio=1.0, gsp_fast_prepare=False, gsp_send_routing_key=False, gsp_num_turns=1, gsp_ordered=False, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None, header=None)

#Input tokens: 512842
#Output tokens: 128903
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:00<00:00, 16.62it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     1000
Benchmark duration (s):                  60.20
Total input tokens:                      512842
Total input text tokens:                 512842
Total generated tokens:                  128903
Total generated tokens (retokenized):    128898
Request throughput (req/s):              16.61
Input token throughput (tok/s):          8519.64
Output token throughput (tok/s):         2141.41
Peak output token throughput (tok/s):    6758.00
Peak concurrent requests:                1000
Total token throughput (tok/s):          10661.05
Concurrency:                             693.59
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   41750.91
Median E2E Latency (ms):                 45693.65
P90 E2E Latency (ms):                    57402.67
P99 E2E Latency (ms):                    59126.95
---------------Time to First Token----------------
Mean TTFT (ms):                          21012.26
Median TTFT (ms):                        15791.00
P99 TTFT (ms):                           48139.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          238.81
Median TPOT (ms):                        168.06
P99 TPOT (ms):                           1304.34
---------------Inter-Token Latency----------------
Mean ITL (ms):                           162.15
Median ITL (ms):                         140.85
P95 ITL (ms):                            287.97
P99 ITL (ms):                            579.08
Max ITL (ms):                            11751.39
==================================================

Qwen3-30B-A3B-Q4_K_M.gguf

python3 -m sglang.launch_server --model-path /root/.cache/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-Q4_K_M.gguf --device npu --attention-backend ascend --host 0.0.0.0 --port 30000 --mem-fraction-static 0.8 --tp-size 2

python3 -m sglang.bench_serving     --backend sglang     --num-prompts 1000     --dataset-name random     --random-input-len 1024     --random-output-len 256     --request-rate inf  --dataset-path  /tmp/ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer  /root/.cache/Qwen3-30B-A3B/
benchmark_args=Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=None, dataset_name='random', dataset_path='/tmp/ShareGPT_V3_unfiltered_cleaned_split.json', model=None, served_model_name=None, tokenizer='/root/.cache/Qwen3-30B-A3B/', num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=256, random_range_ratio=0.0, image_count=1, image_resolution='1080p', random_image_count=False, image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, print_requests=False, disable_tqdm=False, disable_stream=False, return_logprob=False, return_routed_experts=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, plot_throughput=False, profile_activities=['CPU', 'GPU'], profile_num_steps=None, profile_by_stage=False, profile_stages=None, lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, gsp_range_ratio=1.0, gsp_fast_prepare=False, gsp_send_routing_key=False, gsp_num_turns=1, gsp_ordered=False, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None, header=None)
Fail to load tokenizer config with error=Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/root/.cache/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-Q4_K_M.gguf'. Use `repo_type` argument if needed.

WARNING It is recommended to use the `Chat` or `Instruct` model for benchmarking.
Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.

Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='/tmp/ShareGPT_V3_unfiltered_cleaned_split.json', model='/root/.cache/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-Q4_K_M.gguf', served_model_name=None, tokenizer='/root/.cache/Qwen3-30B-A3B/', num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=256, random_range_ratio=0.0, image_count=1, image_resolution='1080p', random_image_count=False, image_format='jpeg', image_content='random', request_rate=inf, use_trace_timestamps=False, max_concurrency=None, output_file=None, output_details=False, print_requests=False, disable_tqdm=False, disable_stream=False, return_logprob=False, return_routed_experts=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, plot_throughput=False, profile_activities=['CPU', 'GPU'], profile_num_steps=None, profile_by_stage=False, profile_stages=None, lora_name=None, lora_request_distribution='uniform', lora_zipf_alpha=1.5, prompt_suffix='', pd_separated=False, profile_prefill_url=None, profile_decode_url=None, flush_cache=False, warmup_requests=1, tokenize_prompt=False, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256, gsp_range_ratio=1.0, gsp_fast_prepare=False, gsp_send_routing_key=False, gsp_num_turns=1, gsp_ordered=False, mooncake_slowdown_factor=1.0, mooncake_num_rounds=1, mooncake_workload='conversation', tag=None, header=None)

#Input tokens: 512842
#Output tokens: 128903
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [01:30<00:00, 11.05it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max request concurrency:                 not set
Successful requests:                     1000
Benchmark duration (s):                  90.55
Total input tokens:                      512842
Total input text tokens:                 512842
Total generated tokens:                  128903
Total generated tokens (retokenized):    128897
Request throughput (req/s):              11.04
Input token throughput (tok/s):          5663.82
Output token throughput (tok/s):         1423.60
Peak output token throughput (tok/s):    2989.00
Peak concurrent requests:                1000
Total token throughput (tok/s):          7087.42
Concurrency:                             558.75
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   50593.37
Median E2E Latency (ms):                 52226.80
P90 E2E Latency (ms):                    84576.60
P99 E2E Latency (ms):                    89176.30
---------------Time to First Token----------------
Mean TTFT (ms):                          25895.55
Median TTFT (ms):                        19940.53
P99 TTFT (ms):                           66017.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          212.05
Median TPOT (ms):                        192.30
P99 TPOT (ms):                           485.94
---------------Inter-Token Latency----------------
Mean ITL (ms):                           193.10
Median ITL (ms):                         170.50
P95 ITL (ms):                            339.69
P99 ITL (ms):                            382.97
Max ITL (ms):                            4153.15
==================================================

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @TheKonka, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates GGUF quantization support for models running on Ascend NPUs. It introduces specialized methods for handling GGUF weights, including pre-dequantization and optimized MoE layer operations, to ensure efficient and accurate inference on Ascend hardware. The changes also include updates to the GGUF model loader to correctly interpret and process MoE expert weights.

Highlights

  • Ascend NPU Support for GGUF: Introduced comprehensive support for GGUF quantized models on Ascend NPUs, enabling efficient inference on this hardware platform.
  • NPU-Specific Quantization Methods: Implemented dedicated GGUFLinearAscendMethod, GGUFMoEAscendMethod, and GGUFEmbeddingAscendMethod to tailor GGUF quantization handling specifically for Ascend NPUs.
  • Enhanced MoE Weight Handling: Improved GGUF weight loading mechanisms to correctly parse and materialize packed Mixture of Experts (MoE) weights, ensuring proper initialization for NPU operations.
  • Pre-Dequantization for Performance: Integrated a pre-dequantization step for GGUF weights, performed on the CPU during model loading, to reduce runtime overhead and accelerate inference on the NPU.
  • Optimized MoE Operations: Leveraged NPU-specific grouped matrix multiplication and MoE routing kernels to achieve highly efficient forward passes for Mixture of Experts layers on Ascend hardware.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for GGUF quantization on Ascend NPUs. The overall approach of pre-dequantizing weights during model loading and then using NPU-accelerated kernels for inference is a solid strategy, trading increased memory usage for better performance. The changes are well-structured, introducing NPU-specific methods for linear, embedding, and MoE layers, and adapting the GGUF weight loader for MoE models. I've identified a few areas where code duplication can be refactored to enhance maintainability. Otherwise, the implementation looks good.

Comment on lines +1163 to +1190
if "w13" in name:
# w13 is gate+up fused
weight_list = []
for e in range(num_experts):
if e in expert_weights:
w1 = expert_weights[e].get("w1")
w3 = expert_weights[e].get("w3")

if w1 is not None and w3 is not None:
fused = torch.cat([w1, w3], dim=0)
weight_list.append(fused)

if weight_list:
stacked = torch.stack(weight_list, dim=0)
param.materialize(stacked.shape, dtype=stacked.dtype)
param.data.copy_(stacked)
elif "w2" in name:
# w2 is down projection
weight_list = []
for e in range(num_experts):
if e in expert_weights and "w2" in expert_weights[e]:
w2_weight = expert_weights[e]["w2"]
weight_list.append(w2_weight)

if weight_list:
stacked = torch.stack(weight_list, dim=0)
param.materialize(stacked.shape, dtype=stacked.dtype)
param.data.copy_(stacked)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's significant code duplication in how w13 and w2 weights are materialized. The logic for collecting weights into a list, stacking them, and then materializing the parameter is nearly identical in both if "w13" in name: and elif "w2" in name: blocks. This could be refactored into a helper function to improve code clarity and maintainability.

Comment on lines +839 to +896
# Pre-dequantize w13 weights (gate+up projections)
if w13_qtype not in UNQUANTIZED_TYPES:
num_experts = w13_qweight.shape[0]
w13_dequant_list = []

block_size, type_size = gguf.GGML_QUANT_SIZES[w13_qtype]

for e in range(num_experts):
qweight_cpu = w13_qweight[e].cpu().numpy()
rows = w13_qweight[e].shape[0]
cols = w13_qweight[e].shape[1] // type_size * block_size

dequant_np = gguf_dequantize(qweight_cpu.flatten(), w13_qtype)
dequant = (
torch.from_numpy(dequant_np)
.to(dtype=self.params_dtype, device=w13_qweight.device)
.reshape(rows, cols)
.transpose(-1, -2)
.contiguous()
)
w13_dequant_list.append(dequant)

w13_full = torch.stack(w13_dequant_list, dim=0)

layer.register_buffer("w13_dequant", w13_full, persistent=False)
else:
layer.register_buffer("w13_dequant", w13_qweight.data, persistent=False)

# Pre-dequantize w2 weights (down projection)
w2_qweight = layer.w2_qweight
w2_qtype = layer.w2_qweight_type.weight_type

if w2_qtype not in UNQUANTIZED_TYPES:
num_experts = w2_qweight.shape[0]
w2_dequant_list = []

block_size, type_size = gguf.GGML_QUANT_SIZES[w2_qtype]

for e in range(num_experts):
qweight_cpu = w2_qweight[e].cpu().numpy()
rows = w2_qweight[e].shape[0]
cols = w2_qweight[e].shape[1] // type_size * block_size

dequant_np = gguf_dequantize(qweight_cpu.flatten(), w2_qtype)
dequant = (
torch.from_numpy(dequant_np)
.to(dtype=self.params_dtype, device=w2_qweight.device)
.reshape(rows, cols)
.transpose(-1, -2)
.contiguous()
)
w2_dequant_list.append(dequant)

w2_full = torch.stack(w2_dequant_list, dim=0)

layer.register_buffer("w2_dequant", w2_full, persistent=False)
else:
layer.register_buffer("w2_dequant", w2_qweight.data, persistent=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for pre-dequantizing w13 and w2 weights is very similar and largely duplicated. This can be refactored into a private helper method to reduce redundancy and improve maintainability. For example, a method like _dequantize_expert_weights(self, qweight, qtype) could encapsulate the common logic for both.

Comment on lines +973 to +982
if is_moe_weight:
# MoE weights need special handling - extract layer_id and weight type
# Format: blk.{layer_id}.ffn_gate_exps.weight
import re

match = re.match(r"blk\.(\d+)\.(ffn_\w+_exps)\.weight", tensor_name)
if match:
layer_id = int(match.group(1))
weight_pattern = match.group(2)
hf_weight_name = MOE_WEIGHT_PATTERNS.get(weight_pattern)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There are a couple of improvements that can be made here and in the second loop for handling MoE weights:

  1. The import re statement is inside the loop (here and on line 1012). It should be moved to the top of the gguf_quant_weights_iterator function.
  2. The logic for parsing the MoE tensor name using a regular expression is duplicated in both loops. This could be extracted into a small helper function to improve maintainability and reduce redundancy.

@TheKonka TheKonka changed the title [NPU] Support GGUF quantization for Ascend NPU [NPU] Support GGUF quantization for Ascend NPU (dense + MoE) Jan 30, 2026
@TheKonka
Copy link
Copy Markdown
Contributor Author

TheKonka commented Feb 2, 2026

@ping1jing2 @iforgetmyname Hello, can you check this PR? Thanks!

@ping1jing2
Copy link
Copy Markdown
Collaborator

@ping1jing2 @iforgetmyname Hello, can you check this PR? Thanks!

ok

Comment thread python/sglang/srt/layers/moe/fused_moe_triton/layer.py Outdated
Comment thread python/sglang/srt/layers/moe/fused_moe_triton/layer.py Outdated
Comment thread python/sglang/srt/models/qwen2_moe.py
@ping1jing2
Copy link
Copy Markdown
Collaborator

@OrangeRedeng @TamirBaydasov please review this PR, thanks

@TamirBaydasov
Copy link
Copy Markdown
Contributor

Hi! Could you please add gguf test to CI? We are planning to refactor the whole quantization folder at some point, so quantization tests will help a lot in preserving functionality going forward.

Comment thread python/sglang/srt/model_loader/weight_utils.py
…t/npu_gguf

# Conflicts:
#	python/sglang/srt/layers/quantization/gguf.py
#	test/srt/run_suite.py
@ping1jing2
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@ping1jing2 ping1jing2 requested a review from b8zhong as a code owner March 17, 2026 14:20
@ping1jing2
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@ping1jing2
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@ping1jing2
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@ping1jing2
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@ping1jing2
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@OrangeRedeng
Copy link
Copy Markdown
Contributor

Hi! Could you please update the documentation to include information about GGUF on NPU? https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/quantization.md and https://github.com/sgl-project/sglang/blob/main/docs/platforms/ascend/ascend_npu_quantization.md

@github-actions github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization labels Apr 2, 2026
@ping1jing2 ping1jing2 requested a review from wisclmy0611 as a code owner April 23, 2026 18:13
@ping1jing2
Copy link
Copy Markdown
Collaborator

as for failed CI

ERROR: test_mmlu (__main__.TestUnifiedSWARadixCache)
Simple-evals MMLU multi-task accuracy.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/utils/common.py", line 2645, in retry
    return fn()
  File "/actions-runner/_work/sglang/sglang/python/sglang/test/test_utils.py", line 2151, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
AssertionError: 0.71875 not greater than or equal to 0.75

our test results on H100

image image

@sglang-npu-bot sglang-npu-bot merged commit 046c14a into sgl-project:main Apr 25, 2026
195 of 244 checks passed
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation npu quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants