Fused two elementwise kernels for k_nope and k_pe concat by kkHuang-amd · Pull Request #14862 · sgl-project/sglang

kkHuang-amd · 2025-12-11T01:57:49Z

Motivation

Reduce time cost in concat k_nope and k_pe before doing MHA attention

Modifications

Use the triton kernel to replace the naive torch operations

Accuracy Tests

root@mia1-p01-g07:/sgl-workspace/sglang# python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --port 8000
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:57<00:00, 23.06it/s]
Accuracy: 0.939
Invalid: 0.000
Latency: 57.372 s
Output throughput: 2335.740 token/s

Benchmarking and Profiling

before fusing two elementwise kernels

This PR

The time cost can be reduced from (169 us + 104 us) to (128us)

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-11T01:57:53Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

HaiShaw · 2025-12-11T03:47:16Z

                attn_dtype = k_nope.dtype
            k = k_nope.new_empty(*k_shape, dtype=attn_dtype)
            concat_and_cast_mha_k_triton(k, k_nope, k_pe)
+        elif _is_hip and self.current_attention_backend == "aiter":


Please confirm _is_hip or _is_gfx95_supported

_is_hip is enough, this optimization can go in any ROCm platform.

HaiShaw · 2025-12-11T18:41:50Z

/tag-and-rerun-ci

…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (89 commits) [model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels (sgl-project#15160) [diffusion] fix: fix video model sp when resolution is not specified (sgl-project#15047) [diffusion] fix: fix pytorch non-writable array warning (sgl-project#15017) [diffusion] fix: cache dit with parallel (sgl-project#15163) chore: change npu pr-test a2 runner (sgl-project#15152) [Feature] Fuse mrope all in 1 kernel (sgl-project#14906) Fix num running requests (load) wrong cleared for ongoing requests (sgl-project#15116) Fused two elementwise kernels for k_nope and k_pe concat (sgl-project#14862) fix: adding date and fixing release name issue (sgl-project#15174) [CPU] Add Gemma3RMSNorm kernel in sgl-kernel and add ut (sgl-project#9324) feature: PR wheel (sgl-project#15170) [diffusion] model: support mutli-image input and qwen-image-edit-2509 (sgl-project#15005) fix CompressedTensorsW8A8Int8 min_capability (sgl-project#13914) Tiny improve summary text in `bench_one_batch_server.py` (sgl-project#15158) [model-gateway] add mcp and discovery metrics (sgl-project#15156) fix: move ci-bot (sgl-project#15154) Fix import warnings (sgl-project#15144) ci: adding errors to Github summary (sgl-project#14778) [model-gateway] Add streaming metrics for harmony gRPC router (sgl-project#15147) [model-gateway] upgrade axum and axum server (sgl-project#15146) ... # Conflicts: # python/sglang/srt/server_args.py

…#14862) 1 TC failure to check, but irrelevant to this code change

Fused two elementwise kernels for k_nope and k_pe concat

0eb3eaf

kkHuang-amd requested review from Fridge003, ch-wan, fzyzcjy, ispobock, merrymercy and zhyncs as code owners December 11, 2025 01:57

github-actions Bot added the deepseek label Dec 11, 2025

kkHuang-amd requested a review from HaiShaw December 11, 2025 01:58

kkHuang-amd added the run-ci label Dec 11, 2025

HaiShaw reviewed Dec 11, 2025

View reviewed changes

HaiShaw approved these changes Dec 11, 2025

View reviewed changes

HaiShaw added 2 commits December 11, 2025 18:46

Merge branch 'main' into fused-concat-k-nope-pe

b2f1903

Merge branch 'main' into fused-concat-k-nope-pe

d7a410b

HaiShaw merged commit 2ea844e into sgl-project:main Dec 15, 2025
82 of 89 checks passed

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 17, 2025

Fused two elementwise kernels for k_nope and k_pe concat (sgl-project…

67cdd32

…#14862) 1 TC failure to check, but irrelevant to this code change

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

Fused two elementwise kernels for k_nope and k_pe concat (sgl-project…

7dc42d7

…#14862) 1 TC failure to check, but irrelevant to this code change

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fused two elementwise kernels for k_nope and k_pe concat#14862

Fused two elementwise kernels for k_nope and k_pe concat#14862
HaiShaw merged 3 commits intosgl-project:mainfrom
HaiShaw:fused-concat-k-nope-pe

kkHuang-amd commented Dec 11, 2025

Uh oh!

gemini-code-assist Bot commented Dec 11, 2025

Uh oh!

HaiShaw Dec 11, 2025

Uh oh!

kkHuang-amd Dec 11, 2025

Uh oh!

HaiShaw commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kkHuang-amd commented Dec 11, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 11, 2025

Uh oh!

HaiShaw Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

kkHuang-amd Dec 11, 2025

Choose a reason for hiding this comment

Uh oh!

HaiShaw commented Dec 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants