Checklist
Motivation
Share our optimization methods for Qwen
Qwen3-32B
decode optimization
prefill optimization -- sequence parallelism
see #10519 in details
host optimization
will update late
Qwen2.5-VL----only Vision part
please see #9189 #10556 #11047
-
VisionAttention
Using torch_npu._npu_flash_attention_unpad for attention acceleration, including sin/cos cache and torch_npu.npu_rotary_mul
-
VisionPatchEmbed
Using matmul instead of Conv3D for Patch acceleration.
How does it work?
feature_map.shape = (N, C, D, H, W);
kernel_size.shape = (Cout, C, d, h, w);
for Qwen2.5-VL: kernel_size == stride and D=d, H=h, W=w
sliding window times: S = (D/d) * (H/h) * (W/w) = 1
Conv3D result: Hidden_state = (N, S, C*d*h*w) x (Cout, C*d*h*w)^T = (N, 1, C*d*h*w) x (Cout, C*d*h*w)^T
Equals to: Hidden_state = (N, C*d*h*w) x (Cout, C*d*h*w)^T
-
VisionTransformer
attention padding
Because Ascend Cube unit has the highest performance when the input.shape is divisible by 16, we padding the attn_head from [40, 40] to [64, 64] in Qwen2_5_VLForConditionalGeneration.load_weights
Qwen3-30B-A3B
#12078
Checklist
Motivation
Share our optimization methods for Qwen
Qwen3-32B
decode optimization
Fused OPs
torch_npu.npu_add_rms_norm
decode:torch_npu._npu_paged_attention
Summary of other key features
W8A8 quantization
[feature]Ascend quantization support
enable ACLGraph
Notice:
actual_seq_lengthsin torch_npu.npu_fused_infer_attention_score during the npu_graph_runner.py::NPUGraphRunner::replay.torch_npu._npu_paged_attention, which is faster thantorch_npu.npu_fused_infer_attention_score, in ACLGraph. more details see PR24572. In this case, you should special handle thecontext_lensargument.注意:
actual_seq_lengths。关于这个问题的背景和解释详见 IFA的tiling 依赖actual_seq_len host值,是问题的起源。torch_npu._npu_paged_attention比torch_npu.npu_fused_infer_attention_score计算快一些,但是不支持ACLGraph入图,所以与torch_npu沟通后提了 PR24572 来支持。[CMO权重预取] Prefetch the weight of matmul when running the AIV kernels
Using torch_npu.npu_prefetch to Prefetch the weight of matmul(gate_up, down proj) when running other AIV kernels, aiming to overlap the memory access time.
权重预取通过在计算matmul(gate_up,down proj)前,提前将其右矩阵权重预加载到L2 cache上,从而减少了算子运算阶段的访存开销,缩短了matmul算子执行时间。
prefill optimization -- sequence parallelism
see #10519 in details
host optimization
will update late
Qwen2.5-VL----only Vision part
please see #9189 #10556 #11047
VisionAttention
Using
torch_npu._npu_flash_attention_unpadfor attention acceleration, including sin/cos cache andtorch_npu.npu_rotary_mulVisionPatchEmbed
Using
matmulinstead ofConv3Dfor Patch acceleration.VisionTransformer
attention padding
Because Ascend Cube unit has the highest performance when the input.shape is divisible by 16, we padding the attn_head from [40, 40] to [64, 64] in
Qwen2_5_VLForConditionalGeneration.load_weightsQwen3-30B-A3B
#12078