Skip to content

[Bug] [NPU] GPT-OSS run failed because of NZ #21201

@Todobe

Description

@Todobe

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

The unquant MoE NZ feature introduced by this PR: #15904 and the early transpose of weights cause the GPT-OSS model to fail to run.
The model can run successfully after the following modification is made in unquant.py:
However, this might lead to degradation in other models, so I am looking for a method that is compatible.

diff --git a/python/sglang/srt/layers/quantization/unquant.py b/python/sglang/srt/layers/quantization/unquant.py
--- a/python/sglang/srt/layers/quantization/unquant.py	(revision cb8105fe282fc373b5baed63d5df38682418a373)
+++ b/python/sglang/srt/layers/quantization/unquant.py	(date 1774253773674)
@@ -313,13 +313,13 @@
                 layer.num_local_experts, *new_shape_w2
             )
 
-        if _is_npu:
-            for weight_name in ["w13_weight", "w2_weight"]:
-                weight = getattr(layer, weight_name)
-                weight.data = weight.data.transpose(1, 2)
-                weight.data = npu_format_cast(
-                    weight.data,
-                )
+        # if _is_npu:
+            # for weight_name in ["w13_weight", "w2_weight"]:
+            #     weight = getattr(layer, weight_name)
+            #     weight.data = weight.data.transpose(1, 2)
+            #     weight.data = npu_format_cast(
+            #         weight.data,
+            #     )
 
         return
 
@@ -587,7 +587,7 @@
         # gmm1: gate_up_proj
         hidden_states = torch.ops.npu.npu_grouped_matmul(
             x=[hidden_states],
-            weight=[layer.w13_weight],
+            weight=[layer.w13_weight.transpose(1, 2)],
             bias=w13_bias,
             split_item=2,
             group_list_type=1,
@@ -611,7 +611,7 @@
         # gmm2: down_proj
         hidden_states = torch.ops.npu.npu_grouped_matmul(
             x=[hidden_states],
-            weight=[layer.w2_weight],
+            weight=[layer.w2_weight.transpose(1, 2)],
             bias=w2_bias,
             split_item=2,
             group_list_type=1,

Reproduction

Server launch script:

echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sysctl -w vm.swappiness=0
sysctl -w kernel.numa_balancing=0
sysctl -w kernel.sched_migration_cost_ns=500000000

# export TIKTOKEN_ENCODINGS_BASE=/home/encoding
export TASK_QUEUE_ENABLE=0
source /usr/local/Ascend/ascend-toolkit/set_env.sh

export HCCL_BUFFSIZE=2000
export STREAMS_PER_DEVICE=32
export HCCL_SOCKET_IFNAME=lo
export GLOO_SOCKET_IFNAME=lo
#export ENABLE_PROFILING=1
#export ASCEND_RT_VISIBLE_DEVICES=8

export USE_VLLM_CUSTOM_ALLREDUCE=1
python -m sglang.launch_server \
    --model-path /home/weights/gpt-oss-120b-bf16 \
    --trust-remote-code \
    --mem-fraction-static 0.7 \
    --host 127.0.0.1 \
    --port 9989 \
    --device npu \
    --attention-backend ascend \
    --nnodes 1 \
    --node-rank 0 \
    --max-running-requests 32 \
    --chunked-prefill-size 3276800 \
    --max-prefill-tokens 32768 \
    --watchdog-timeout 9000 \
    --tp-size 8 \
    --sampling-backend ascend \
    --disable-hybrid-swa-memory \

The error message about MoE NZ is as follows:

 File "/home/Todobe/sgl-sglang/python/sglang/srt/layers/quantization/unquant.py", line 612, in forward_npu
    hidden_states = torch.ops.npu.npu_grouped_matmul(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1243, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: npu_grouped_matmul:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:252 NPU function error: call aclnnGroupedMatmulWeightNz failed, error code is 561103
[ERROR] 2026-03-23-08:51:39 (PID:375494, Device:4, RankID:-1) ERR00100 PTA call acl api failed.
[PID: 375494] 2026-03-23-08:51:39.140.094 AclNN_Parameter_Error(EZ1001): When weight(BF16) format is FRACTAL_NZ, weight'shape(k[360], n[2880]) should be divisible by the following shape: INT8:[16, 32],BF16/FP16[16, 16],INT4[16, 64],FP4[64,64]). If the weight is transposed,the k/n need to be reversed.
        TraceBack (most recent call last):
        the inner axis size of nz weight is expected to be a multiple of 32B, but now the inner axis size is 360.[FUNC:CheckWeightNZShape][FILE:grouped_matmul_tiling.cpp][LINE:112]
        the shape of nz weight is invaild.[FUNC:PrepareTilingData][FILE:grouped_matmul_tiling.cpp][LINE:176]
        GMM PrepareTilingData failed.[FUNC:Init][FILE:grouped_matmul_tiling.cpp][LINE:385]
        GMM tiling init failed[FUNC:TilingGMM][FILE:grouped_matmul_tiling.cpp][LINE:1841]
        Tiling failed
        Tiling Failed.
        Kernel GetWorkspace failed. opType: 40
        ADD_TO_LAUNCHER_LIST_AICORE failed.

The error message about early transpose is as follows:

  File "/home/Todobe/sgl-sglang/python/sglang/srt/layers/quantization/unquant.py", line 612, in forward_npu
    hidden_states = torch.ops.npu.npu_grouped_matmul(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/python3.11.14/lib/python3.11/site-packages/torch/_ops.py", line 1243, in __call__
    return self._op(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: npu_grouped_matmul:build/CMakeFiles/torch_npu.dir/compiler_depend.ts:252 NPU function error: call aclnnGroupedMatmulWeightNz failed, error code is 161002
[ERROR] 2026-03-23-08:47:57 (PID:368290, Device:7, RankID:-1) ERR00100 PTA call acl api failed.
[PID: 368290] 2026-03-23-08:47:57.552.039 AclNN_Parameter_Error(EZ1001): Dim 1 value of x[0] should be equal with dim 1 value of weight[0], but now is 1440 and 360 respectively.
        TraceBack (most recent call last):
        k dim value of x and weight is not matched.
        Split m, single x, single weight, single y case failed.
        Invalid inputs!

Environment

sglang:v0.5.6-ascend-a3

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions