Checklist
Describe the bug
The unquant MoE NZ feature introduced by this PR: #15904 and the early transpose of weights cause the GPT-OSS model to fail to run.
The model can run successfully after the following modification is made in unquant.py:
However, this might lead to degradation in other models, so I am looking for a method that is compatible.
diff --git a/python/sglang/srt/layers/quantization/unquant.py b/python/sglang/srt/layers/quantization/unquant.py
--- a/python/sglang/srt/layers/quantization/unquant.py (revision cb8105fe282fc373b5baed63d5df38682418a373)
+++ b/python/sglang/srt/layers/quantization/unquant.py (date 1774253773674)
@@ -313,13 +313,13 @@
layer.num_local_experts, *new_shape_w2
)
- if _is_npu:
- for weight_name in ["w13_weight", "w2_weight"]:
- weight = getattr(layer, weight_name)
- weight.data = weight.data.transpose(1, 2)
- weight.data = npu_format_cast(
- weight.data,
- )
+ # if _is_npu:
+ # for weight_name in ["w13_weight", "w2_weight"]:
+ # weight = getattr(layer, weight_name)
+ # weight.data = weight.data.transpose(1, 2)
+ # weight.data = npu_format_cast(
+ # weight.data,
+ # )
return
@@ -587,7 +587,7 @@
# gmm1: gate_up_proj
hidden_states = torch.ops.npu.npu_grouped_matmul(
x=[hidden_states],
- weight=[layer.w13_weight],
+ weight=[layer.w13_weight.transpose(1, 2)],
bias=w13_bias,
split_item=2,
group_list_type=1,
@@ -611,7 +611,7 @@
# gmm2: down_proj
hidden_states = torch.ops.npu.npu_grouped_matmul(
x=[hidden_states],
- weight=[layer.w2_weight],
+ weight=[layer.w2_weight.transpose(1, 2)],
bias=w2_bias,
split_item=2,
group_list_type=1,
Reproduction
Server launch script:
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=500000000
# export TIKTOKEN_ENCODINGS_BASE=/home/encoding
export TASK_QUEUE_ENABLE=0
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export HCCL_BUFFSIZE=2000
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
#export ENABLE_PROFILING=1
#export ASCEND_RT_VISIBLE_DEVICES=8
export USE_VLLM_CUSTOM_ALLREDUCE=1
python -m sglang.launch_server \
--model-path /home/weights/gpt-oss-120b-bf16 \
--trust-remote-code \
--mem-fraction-static 0.7 \
--host 127.0.0.1 \
--port 9989 \
--device npu \
--attention-backend ascend \
--nnodes 1 \
--node-rank 0 \
--max-running-requests 32 \
--chunked-prefill-size 3276800 \
--max-prefill-tokens 32768 \
--watchdog-timeout 9000 \
--tp-size 8 \
--sampling-backend ascend \
--disable-hybrid-swa-memory \
The error message about MoE NZ is as follows:
File "/home/Todobe/sgl-sglang/python/sglang/srt/layers/quantization/unquant.py", line 612, in forward_npu
hidden_states = torch.ops.npu.npu_grouped_matmul(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1243, in __call__
return self._op(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: npu_grouped_matmul:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:252 NPU function error: call aclnnGroupedMatmulWeightNz failed, error code is 561103
[ERROR] 2026-03-23-08:51:39 (PID:375494, Device:4, RankID:-1) ERR00100 PTA call acl api failed.
[PID: 375494] 2026-03-23-08:51:39.140.094 AclNN_Parameter_Error(EZ1001): When weight(BF16) format is FRACTAL_NZ, weight'shape(k[360], n[2880]) should be divisible by the following shape: INT8:[16, 32],BF16/FP16[16, 16],INT4[16, 64],FP4[64,64]). If the weight is transposed,the k/n need to be reversed.
TraceBack (most recent call last):
the inner axis size of nz weight is expected to be a multiple of 32B, but now the inner axis size is 360.[FUNC:CheckWeightNZShape][FILE:grouped_matmul_tiling.cpp][LINE:112]
the shape of nz weight is invaild.[FUNC:PrepareTilingData][FILE:grouped_matmul_tiling.cpp][LINE:176]
GMM PrepareTilingData failed.[FUNC:Init][FILE:grouped_matmul_tiling.cpp][LINE:385]
GMM tiling init failed[FUNC:TilingGMM][FILE:grouped_matmul_tiling.cpp][LINE:1841]
Tiling failed
Tiling Failed.
Kernel GetWorkspace failed. opType: 40
ADD_TO_LAUNCHER_LIST_AICORE failed.
The error message about early transpose is as follows:
File "/home/Todobe/sgl-sglang/python/sglang/srt/layers/quantization/unquant.py", line 612, in forward_npu
hidden_states = torch.ops.npu.npu_grouped_matmul(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1243, in __call__
return self._op(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: npu_grouped_matmul:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:252 NPU function error: call aclnnGroupedMatmulWeightNz failed, error code is 161002
[ERROR] 2026-03-23-08:47:57 (PID:368290, Device:7, RankID:-1) ERR00100 PTA call acl api failed.
[PID: 368290] 2026-03-23-08:47:57.552.039 AclNN_Parameter_Error(EZ1001): Dim 1 value of x[0] should be equal with dim 1 value of weight[0], but now is 1440 and 360 respectively.
TraceBack (most recent call last):
k dim value of x and weight is not matched.
Split m, single x, single weight, single y case failed.
Invalid inputs!
Environment
sglang:v0.5.6-ascend-a3
Checklist
Describe the bug
The unquant MoE NZ feature introduced by this PR: #15904 and the early transpose of weights cause the GPT-OSS model to fail to run.
The model can run successfully after the following modification is made in unquant.py:
However, this might lead to degradation in other models, so I am looking for a method that is compatible.
Reproduction
Server launch script:
The error message about MoE NZ is as follows:
The error message about early transpose is as follows:
Environment
sglang:v0.5.6-ascend-a3