Skip to content

[CPU] Optimize small oc GEMM for Qwen3-next on CPU#12446

Merged
Fridge003 merged 14 commits intosgl-project:mainfrom
jianan-gu:fma_linear_opt
Dec 4, 2025
Merged

[CPU] Optimize small oc GEMM for Qwen3-next on CPU#12446
Fridge003 merged 14 commits intosgl-project:mainfrom
jianan-gu:fma_linear_opt

Conversation

@jianan-gu
Copy link
Copy Markdown
Contributor

@jianan-gu jianan-gu commented Oct 31, 2025

This PR adds optimization for BF16 GEMMs with small oc shapes ( oc < 16), including:

  1. weight_packed_linear, which is common for all GEMMs with oc < 16
  2. fused_linear_sigmoid_mul, which is specific for cases found in Qwen3-next (shared_expert_gate, where N = 1 and having post op sigmoid+mul)

Copy link
Copy Markdown
Collaborator

@mingfeima mingfeima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make as little change in the python level as possible,

make the changes happen in C++ level.

that's to say, we still use weight_packed_linear, but we use a different implementation for it under certain conditions.

Comment thread python/sglang/srt/layers/amx_utils.py Outdated
Comment thread python/sglang/srt/models/qwen3_next.py Outdated
Comment thread sgl-kernel/csrc/cpu/gemm.cpp Outdated
Comment thread sgl-kernel/csrc/cpu/gemm.cpp Outdated
Comment thread sgl-kernel/csrc/cpu/gemm.cpp Outdated
Comment thread sgl-kernel/csrc/cpu/gemm.cpp Outdated
@mingfeima mingfeima added cpu cpu backend performance optimization intel sgl-kernel run-ci labels Nov 10, 2025
@mingfeima
Copy link
Copy Markdown
Collaborator

@jianan-gu put more detailed descriptions in the PR message.

@mingfeima
Copy link
Copy Markdown
Collaborator

one more thing, using F32 brgemm might not be a good idea from performance point of view, but this kernel doesn't contribute too much in the end to end profiling results, we can keep it as it is right now.

@mingfeima
Copy link
Copy Markdown
Collaborator

@github-actions github-actions Bot added quant LLM Quantization deepseek labels Nov 18, 2025
@jianan-gu
Copy link
Copy Markdown
Contributor Author

one more thing, using F32 brgemm might not be a good idea from performance point of view, but this kernel doesn't contribute too much in the end to end profiling results, we can keep it as it is right now.

Noted and added comments in codes.

@jianan-gu
Copy link
Copy Markdown
Contributor Author

@jianan-gu jianan-gu changed the title [Draft] [CPU] Optimize small oc/ic GEMM for Qwen3-next on CPU [CPU] Optimize small oc/ic GEMM for Qwen3-next on CPU Nov 18, 2025
@jianan-gu jianan-gu changed the title [CPU] Optimize small oc/ic GEMM for Qwen3-next on CPU [CPU] Optimize small oc GEMM for Qwen3-next on CPU Nov 18, 2025
@jianan-gu jianan-gu requested a review from mingfeima November 18, 2025 08:56
@jianan-gu
Copy link
Copy Markdown
Contributor Author

Test commit: 0b97584

  • Test1: Linear
M N K Ref (torch) , ms This PR, ms Speedup Ratio
1 12 1024 0.101 0.042 2.40x
64 12 1024 0.137 0.076 1.80x
128 12 1024 0.142 0.116 1.22x
1024 12 1024 0.256 0.233 1.10x
  • Test2: Linear+sigmoid+mul(M*K)
M N K Ref (torch), ms This PR, ms Speedup Ratio
1 1 1024 0.034 0.033 1.03x
64 1 1024 0.175 0.068 2.57x
128 1 1024 0.201 0.074 2.71x
1024 1 1024 0.313 0.241 1.29x

@jianan-gu jianan-gu marked this pull request as ready for review December 1, 2025 05:24
@FlamingoPg
Copy link
Copy Markdown
Collaborator

Hi, could you please fix conflicts? thx!

@jianan-gu
Copy link
Copy Markdown
Contributor Author

Hi, could you please fix conflicts? thx!

Thanks and have fixed.

@mingfeima
Copy link
Copy Markdown
Collaborator

@jianan-gu rebase

@Fridge003 Fridge003 merged commit 70d2587 into sgl-project:main Dec 4, 2025
180 of 185 checks passed
tom-jerr pushed a commit to tom-jerr/sglang that referenced this pull request Dec 4, 2025
Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>
yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025
Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>
yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025
Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>
Kevin-XiongC pushed a commit to novitalabs/sglang that referenced this pull request Dec 9, 2025
Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>
blzheng added a commit to blzheng/sglang that referenced this pull request Jan 30, 2026
Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>
blzheng added a commit to blzheng/sglang that referenced this pull request Feb 2, 2026
* [CPU]  Optimize small oc GEMM for Qwen3-next on CPU (sgl-project#12446)

Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>

* port linear_gelu_linear kernel

* apply linear_gelu_linear for TP=1

* fix numa memory bind

* apply parallel partition patch

---------

Co-authored-by: jianan-gu <jianan.gu@intel.com>
jianan-gu added a commit to jianan-gu/sglang that referenced this pull request Feb 25, 2026
* [CPU]  Optimize small oc GEMM for Qwen3-next on CPU (sgl-project#12446)

Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>

* port linear_gelu_linear kernel

* apply linear_gelu_linear for TP=1

* fix numa memory bind

* apply parallel partition patch

---------

Co-authored-by: jianan-gu <jianan.gu@intel.com>
sywangyi added a commit to sywangyi/sglang that referenced this pull request Feb 27, 2026
* port layernorm 3d

* apply layernorm

* support for bias

* fix

* intf fix

* add support for CPU

* fix tp=3/6 padding issue in encoder vision

* fix tp=3/6 padding issue in qwen3-omni

* refactor code

* add mrope

* change attention_mask shape to use flash attn

* add kernel apply_rotary_pos_emb_cpu

* replace nn.Linear with ReplicatedLinear

* enable torch.compile

* construct mask using query.dtype instead of bool on CPU

* add fast path for sparse attention

* fix double free segfault by wrong setting of BLOCK_M

* improve extend kernel performance for long context length

* update test_extend.py

* update comment

* fix topk softmax performance issue

* port optimization for image preprocessor in Qwen2VLImageProcessorFast

* apply optimization for image preprocessor

* update docker file

* optimize conv3d used in patch embedding

* resolve conflict

* apply optimized conv3d

* apply optimization for flash_attn_varlen_func (sgl-project#19)

* port optimization for flash_attn_varlen_func

* apply flash_attn_varlen_func

* remove contiguous before rope (sgl-project#20)

* Revert "resolve conflict"

This reverts commit 7622f6d.

* fix after rebase

* Update pyproject_cpu.toml

* Update xeon.Dockerfile

* minor fix after rebase

* rope: add support for bf16 sincos (sgl-project#102)

* format

* Update xeon.Dockerfile

* odd tp for cpu

* Apply linear_gelu_linear and fix numa memory bind (sgl-project#22)

* [CPU]  Optimize small oc GEMM for Qwen3-next on CPU (sgl-project#12446)

Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>

* port linear_gelu_linear kernel

* apply linear_gelu_linear for TP=1

* fix numa memory bind

* apply parallel partition patch

---------

Co-authored-by: jianan-gu <jianan.gu@intel.com>

* Revert "Fix: test_vlm_offline_throughput output throughput (sgl-project#13279)" (sgl-project#101)

This reverts commit 7ee3e36.

* fix input dtype mismatch issue

* apply optimized layernorm

---------

Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>
Co-authored-by: ZailiWang <zaili.wang@intel.com>
Co-authored-by: mingfeima <mingfei.ma@intel.com>
Co-authored-by: jianan-gu <jianan.gu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpu cpu backend performance optimization deepseek intel quant LLM Quantization run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants