[CPU] Optimize small oc GEMM for Qwen3-next on CPU by jianan-gu · Pull Request #12446 · sgl-project/sglang

jianan-gu · 2025-10-31T07:52:30Z

This PR adds optimization for BF16 GEMMs with small oc shapes ( oc < 16), including:

weight_packed_linear， which is common for all GEMMs with oc < 16
fused_linear_sigmoid_mul, which is specific for cases found in Qwen3-next (shared_expert_gate, where N = 1 and having post op sigmoid+mul)

mingfeima

make as little change in the python level as possible,

make the changes happen in C++ level.

that's to say, we still use weight_packed_linear, but we use a different implementation for it under certain conditions.

mingfeima · 2025-11-10T02:38:13Z

@jianan-gu put more detailed descriptions in the PR message.

mingfeima · 2025-11-10T02:39:11Z

one more thing, using F32 brgemm might not be a good idea from performance point of view, but this kernel doesn't contribute too much in the end to end profiling results, we can keep it as it is right now.

mingfeima · 2025-11-10T02:40:06Z

lint: https://docs.sglang.ai/developer_guide/contribution_guide.html#format-code-with-pre-commit

jianan-gu · 2025-11-18T05:50:02Z

one more thing, using F32 brgemm might not be a good idea from performance point of view, but this kernel doesn't contribute too much in the end to end profiling results, we can keep it as it is right now.

Noted and added comments in codes.

jianan-gu · 2025-11-18T05:50:11Z

lint: https://docs.sglang.ai/developer_guide/contribution_guide.html#format-code-with-pre-commit

fixed.

jianan-gu · 2025-11-19T07:13:28Z

Test commit: 0b97584

Test1: Linear

M	N	K	Ref (torch) , ms	This PR, ms	Speedup Ratio
1	12	1024	0.101	0.042	2.40x
64	12	1024	0.137	0.076	1.80x
128	12	1024	0.142	0.116	1.22x
1024	12	1024	0.256	0.233	1.10x

Test2: Linear+sigmoid+mul(M*K)

M	N	K	Ref (torch), ms	This PR, ms	Speedup Ratio
1	1	1024	0.034	0.033	1.03x
64	1	1024	0.175	0.068	2.57x
128	1	1024	0.201	0.074	2.71x
1024	1	1024	0.313	0.241	1.29x

FlamingoPg · 2025-12-03T15:48:33Z

Hi, could you please fix conflicts? thx!

jianan-gu · 2025-12-04T01:14:39Z

Hi, could you please fix conflicts? thx!

Thanks and have fixed.

mingfeima · 2025-12-04T02:51:24Z

@jianan-gu rebase

Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>

* [CPU] Optimize small oc GEMM for Qwen3-next on CPU (sgl-project#12446) Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> * port linear_gelu_linear kernel * apply linear_gelu_linear for TP=1 * fix numa memory bind * apply parallel partition patch --------- Co-authored-by: jianan-gu <jianan.gu@intel.com>

* port layernorm 3d * apply layernorm * support for bias * fix * intf fix * add support for CPU * fix tp=3/6 padding issue in encoder vision * fix tp=3/6 padding issue in qwen3-omni * refactor code * add mrope * change attention_mask shape to use flash attn * add kernel apply_rotary_pos_emb_cpu * replace nn.Linear with ReplicatedLinear * enable torch.compile * construct mask using query.dtype instead of bool on CPU * add fast path for sparse attention * fix double free segfault by wrong setting of BLOCK_M * improve extend kernel performance for long context length * update test_extend.py * update comment * fix topk softmax performance issue * port optimization for image preprocessor in Qwen2VLImageProcessorFast * apply optimization for image preprocessor * update docker file * optimize conv3d used in patch embedding * resolve conflict * apply optimized conv3d * apply optimization for flash_attn_varlen_func (sgl-project#19) * port optimization for flash_attn_varlen_func * apply flash_attn_varlen_func * remove contiguous before rope (sgl-project#20) * Revert "resolve conflict" This reverts commit 7622f6d. * fix after rebase * Update pyproject_cpu.toml * Update xeon.Dockerfile * minor fix after rebase * rope: add support for bf16 sincos (sgl-project#102) * format * Update xeon.Dockerfile * odd tp for cpu * Apply linear_gelu_linear and fix numa memory bind (sgl-project#22) * [CPU] Optimize small oc GEMM for Qwen3-next on CPU (sgl-project#12446) Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> * port linear_gelu_linear kernel * apply linear_gelu_linear for TP=1 * fix numa memory bind * apply parallel partition patch --------- Co-authored-by: jianan-gu <jianan.gu@intel.com> * Revert "Fix: test_vlm_offline_throughput output throughput (sgl-project#13279)" (sgl-project#101) This reverts commit 7ee3e36. * fix input dtype mismatch issue * apply optimized layernorm --------- Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com> Co-authored-by: ZailiWang <zaili.wang@intel.com> Co-authored-by: mingfeima <mingfei.ma@intel.com> Co-authored-by: jianan-gu <jianan.gu@intel.com>

jianan-gu and others added 2 commits October 31, 2025 03:48

[Gemm] opt tiny Gemm with small OC (#88)

1f4246c

fix failure in tp=1/2

0a25809

mingfeima requested changes Nov 10, 2025

View reviewed changes

mingfeima added cpu cpu backend performance optimization intel sgl-kernel run-ci labels Nov 10, 2025

jianan-gu added 3 commits November 18, 2025 00:33

refinements for fma linear packing/kernels

cc87fa9

remove unused codes

aeb1eea

minor fix

c52a801

github-actions Bot added quant LLM Quantization deepseek labels Nov 18, 2025

jianan-gu added 3 commits November 18, 2025 13:51

Merge branch 'main' into fma_linear_opt

4e63f02

refine apis

97d0a37

minor fix

393cbe3

jianan-gu changed the title ~~[Draft] [CPU] Optimize small oc/ic GEMM for Qwen3-next on CPU~~ [CPU] Optimize small oc/ic GEMM for Qwen3-next on CPU Nov 18, 2025

jianan-gu changed the title ~~[CPU] Optimize small oc/ic GEMM for Qwen3-next on CPU~~ [CPU] Optimize small oc GEMM for Qwen3-next on CPU Nov 18, 2025

rename testcase

a995bdd

jianan-gu requested a review from mingfeima November 18, 2025 08:56

minor opt for cast input

0b97584

Merge branch 'main' into fma_linear_opt

cdd7c2d

jianan-gu marked this pull request as ready for review December 1, 2025 05:24

jianan-gu requested review from ispobock and zhyncs as code owners December 1, 2025 05:24

jianan-gu requested review from BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, merrymercy and yizhang2077 as code owners December 1, 2025 05:24

Merge branch 'main' into fma_linear_opt

3d77207

Merge remote-tracking branch 'origin/main' into fma_linear_opt

fa586cc

mingfeima approved these changes Dec 4, 2025

View reviewed changes

Merge branch 'main' into fma_linear_opt

06bffa7

mingfeima mentioned this pull request Dec 4, 2025

[Roadmap] Intel CPU Roadmap (2025Q4) #12802

Closed

2 tasks

FlamingoPg approved these changes Dec 4, 2025

View reviewed changes

Fridge003 merged commit 70d2587 into sgl-project:main Dec 4, 2025
180 of 185 checks passed

tom-jerr pushed a commit to tom-jerr/sglang that referenced this pull request Dec 4, 2025

[CPU] Optimize small oc GEMM for Qwen3-next on CPU (sgl-project#12446)

298dc4a

Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>

yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025

[CPU] Optimize small oc GEMM for Qwen3-next on CPU (sgl-project#12446)

57622aa

Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

[CPU] Optimize small oc GEMM for Qwen3-next on CPU (sgl-project#12446)

c846658

Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

[CPU] Optimize small oc GEMM for Qwen3-next on CPU (sgl-project#12446)

e16a6ea

Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>

yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025

[CPU] Optimize small oc GEMM for Qwen3-next on CPU (sgl-project#12446)

164560a

Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>

Kevin-XiongC pushed a commit to novitalabs/sglang that referenced this pull request Dec 9, 2025

[CPU] Optimize small oc GEMM for Qwen3-next on CPU (sgl-project#12446)

2a59aa9

Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>

blzheng added a commit to blzheng/sglang that referenced this pull request Jan 30, 2026

[CPU] Optimize small oc GEMM for Qwen3-next on CPU (sgl-project#12446)

833f25b

Co-authored-by: Zheng, Beilei <beilei.zheng@intel.com>

Conversation

jianan-gu commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mingfeima left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mingfeima commented Nov 10, 2025

Uh oh!

mingfeima commented Nov 10, 2025

Uh oh!

mingfeima commented Nov 10, 2025

Uh oh!

jianan-gu commented Nov 18, 2025

Uh oh!

jianan-gu commented Nov 18, 2025

Uh oh!

jianan-gu commented Nov 19, 2025

Uh oh!

FlamingoPg commented Dec 3, 2025

Uh oh!

jianan-gu commented Dec 4, 2025

Uh oh!

mingfeima commented Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jianan-gu commented Oct 31, 2025 •

edited

Loading