perf: enable inductor combo_kernels for horizontal fusion by jasperjiaguo · Pull Request #21977 · sgl-project/sglang

jasperjiaguo · 2026-04-02T22:42:17Z

Enable combo_kernels and benchmark_combo_kernel in inductor config to allow horizontal fusion of sibling ops with different shapes. This fuses operations like q_norm + k_norm (QK normalization) into a single triton kernel instead of generating separate kernels for each.

Requires torch >= 2.9.0.

Profile Results

Qwen3-0.6B FP8 embeddings on H200, PCG inductor, 7k tokens:

Metric	Before	After
GPU kernels per forward	413	357 (-14%)
QK norm kernels per layer	4	2
`split_with_sizes` / `clone` in kernel names	Present	Gone

The QK norm reduction + pointwise kernels for q and k are now horizontally fused into single kernels.

Throughput impact is neutral at 60 RPS (kernel launch overhead is not the bottleneck at this load), but the reduced kernel count should help at higher concurrency or with smaller models where launch overhead is proportionally larger.

gemini-code-assist · 2026-04-02T22:42:21Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

jasperjiaguo · 2026-04-02T23:26:51Z

/tag-and-rerun-ci

jasperjiaguo · 2026-04-03T07:18:01Z

/rerun-failed-ci

jasperjiaguo · 2026-04-04T00:40:28Z

/tag-and-rerun-ci

jasperjiaguo · 2026-04-06T16:41:32Z

/rerun-failed-checks

jasperjiaguo · 2026-04-06T16:41:43Z

/rerun-failed-ci

jasperjiaguo · 2026-04-06T17:51:41Z

/rerun-failed-ci

jasperjiaguo · 2026-04-06T21:37:58Z

/rerun-failed-ci

jasperjiaguo · 2026-04-07T16:58:02Z

/rerun-failed-ci

jasperjiaguo · 2026-04-07T21:57:09Z

/rerun-failed-ci

jasperjiaguo · 2026-04-07T23:12:45Z

/rerun-failed-ci

jasperjiaguo · 2026-04-08T00:16:03Z

/rerun-failed-ci

jasperjiaguo · 2026-04-08T01:04:20Z

/rerun-failed-ci

jasperjiaguo · 2026-04-08T01:57:35Z

/rerun-failed-ci

jasperjiaguo · 2026-04-08T04:37:09Z

/rerun-failed-ci

jasperjiaguo · 2026-04-08T06:37:27Z

/rerun-failed-ci

jasperjiaguo · 2026-04-08T17:21:12Z

/rerun-failed-ci

jasperjiaguo · 2026-04-08T20:47:09Z

/rerun-failed-ci

Replace nvjet (cooperative-algorithm) FP8 GEMMs with CUTLASS kernels to eliminate the 4-byte memset that nvjet requires before each GEMM launch. This memset creates ~20us pipeline bubbles between triton fusion kernels and GEMM kernels, totaling ~2.2ms per forward pass (112 GEMMs). Changes: - Add Sm90ColOrScalarBroadcast/Sm90RowOrScalarBroadcast custom EVT nodes (adapted from vLLM) that handle per-tensor scalar scales natively via runtime bool flag, eliminating expand+contiguous overhead - Add out= parameter to fp8_scaled_mm for zero-copy GEMM output - Add runtime wrapper that replaces extern_kernels._scaled_mm with CUTLASS fp8_scaled_mm, preserving inductor triton fusion - Update fake tensor implementation for torch.compile compatibility Profile results (7k token FP8 embedding, H200): - Memset: 112 -> 0 - nvjet GEMM: 112 -> 0 CUTLASS - Total GPU kernels: 357 (unchanged, fusion preserved) Benchmark (Qwen3-0.6B FP8, production traffic distribution): - Baseline (main): 30.77 items/sec - With PRs sgl-project#21734+sgl-project#21971+sgl-project#21977: 37.77 items/sec - + This PR (CUTLASS): 38.77 items/sec (+26% vs baseline) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jasperjiaguo · 2026-04-09T00:02:52Z

/rerun-failed-ci

jasperjiaguo · 2026-04-09T22:39:26Z

/rerun-failed-ci

Replace nvjet (cooperative-algorithm) FP8 GEMMs with CUTLASS kernels to eliminate the 4-byte memset that nvjet requires before each GEMM launch. This memset creates ~20us pipeline bubbles between triton fusion kernels and GEMM kernels, totaling ~2.2ms per forward pass (112 GEMMs). Changes: - Add Sm90ColOrScalarBroadcast/Sm90RowOrScalarBroadcast custom EVT nodes (adapted from vLLM) that handle per-tensor scalar scales natively via runtime bool flag, eliminating expand+contiguous overhead - Add out= parameter to fp8_scaled_mm for zero-copy GEMM output - Add runtime wrapper that replaces extern_kernels._scaled_mm with CUTLASS fp8_scaled_mm, preserving inductor triton fusion - Update fake tensor implementation for torch.compile compatibility Profile results (7k token FP8 embedding, H200): - Memset: 112 -> 0 - nvjet GEMM: 112 -> 0 CUTLASS - Total GPU kernels: 357 (unchanged, fusion preserved) Benchmark (Qwen3-0.6B FP8, production traffic distribution): - Baseline (main): 30.77 items/sec - With PRs sgl-project#21734+sgl-project#21971+sgl-project#21977: 37.77 items/sec - + This PR (CUTLASS): 38.77 items/sec (+26% vs baseline) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…t#21977)

Replace nvjet (cooperative-algorithm) FP8 GEMMs with CUTLASS kernels to eliminate the 4-byte memset that nvjet requires before each GEMM launch. This memset creates ~20us pipeline bubbles between triton fusion kernels and GEMM kernels, totaling ~2.2ms per forward pass (112 GEMMs). Changes: - Add Sm90ColOrScalarBroadcast/Sm90RowOrScalarBroadcast custom EVT nodes (adapted from vLLM) that handle per-tensor scalar scales natively via runtime bool flag, eliminating expand+contiguous overhead - Add out= parameter to fp8_scaled_mm for zero-copy GEMM output - Add runtime wrapper that replaces extern_kernels._scaled_mm with CUTLASS fp8_scaled_mm, preserving inductor triton fusion - Update fake tensor implementation for torch.compile compatibility Profile results (7k token FP8 embedding, H200): - Memset: 112 -> 0 - nvjet GEMM: 112 -> 0 CUTLASS - Total GPU kernels: 357 (unchanged, fusion preserved) Benchmark (Qwen3-0.6B FP8, production traffic distribution): - Baseline (main): 30.77 items/sec - With PRs sgl-project#21734+sgl-project#21971+sgl-project#21977: 37.77 items/sec - + This PR (CUTLASS): 38.77 items/sec (+26% vs baseline) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jasperjiaguo requested a review from hebiao064 as a code owner April 2, 2026 22:42

github-actions Bot added the run-ci label Apr 2, 2026

jasperjiaguo force-pushed the jiaguo/enable-combo-kernels branch from 28c6263 to 5e8404d Compare April 7, 2026 08:01

jasperjiaguo force-pushed the jiaguo/enable-combo-kernels branch 2 times, most recently from 2d4da8e to 3fd92ee Compare April 7, 2026 18:22

Qiaolin-Yu requested a review from ispobock April 7, 2026 19:51

Qiaolin-Yu assigned ispobock Apr 7, 2026

jasperjiaguo mentioned this pull request Apr 8, 2026

perf: eliminate nvjet memset bubbles via CUTLASS FP8 GEMM #22392

Open

5 tasks

ispobock assigned Oasis-Git Apr 9, 2026

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

perf: enable inductor combo_kernels for horizontal fusion (sgl-projec…

637302b

…t#21977)

BBuf mentioned this pull request Apr 29, 2026

SGLang AI Agent Performance Optimization PRs (2026-01-29 to 2026-04-29) BBuf/AI-Infra-Auto-Driven-SKILLS#46

Open

Conversation

jasperjiaguo commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Profile Results

Uh oh!

gemini-code-assist Bot commented Apr 2, 2026

Uh oh!

jasperjiaguo commented Apr 2, 2026

Uh oh!

jasperjiaguo commented Apr 3, 2026

Uh oh!

jasperjiaguo commented Apr 4, 2026

Uh oh!

jasperjiaguo commented Apr 6, 2026

Uh oh!

jasperjiaguo commented Apr 6, 2026

Uh oh!

jasperjiaguo commented Apr 6, 2026

Uh oh!

jasperjiaguo commented Apr 6, 2026

Uh oh!

jasperjiaguo commented Apr 7, 2026

Uh oh!

jasperjiaguo commented Apr 7, 2026

Uh oh!

jasperjiaguo commented Apr 7, 2026

Uh oh!

jasperjiaguo commented Apr 8, 2026

Uh oh!

jasperjiaguo commented Apr 8, 2026

Uh oh!

jasperjiaguo commented Apr 8, 2026

Uh oh!

jasperjiaguo commented Apr 8, 2026

Uh oh!

jasperjiaguo commented Apr 8, 2026

Uh oh!

jasperjiaguo commented Apr 8, 2026

Uh oh!

jasperjiaguo commented Apr 8, 2026

Uh oh!

jasperjiaguo commented Apr 9, 2026

Uh oh!

jasperjiaguo commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jasperjiaguo commented Apr 2, 2026 •

edited

Loading