Add offline auto-tuning for LoRA CSGMV kernel by satyamk7054 · Pull Request #20391 · sgl-project/sglang

satyamk7054 · 2026-03-12T00:04:50Z

Motivation

Add offline auto-tuning script for LoRA csgmv shrink / expand kernels (similar to MoE auto-tuning)

On H200 with Qwen3-Embedding-0.6B (rank=64), tuning yields 2-3x speedup on shrink kernels and 1.1-1.5x on expand kernels

Modifications

Add lora_tuning_config.py: Config loader with LRU cache – same MoE loader
Add tune_lora_csgmv.py: Offline script to generate configs
Update kernel invocations to use the loaded configs (with existing defaults); maxnregs helped with occupancy so added that as kwargs
Unit test for loading logic

python -m unittest test.manual.lora.test_lora_tuning_config -v

Fallback behavior

When no tuned config is found (no config file for the current GPU/model/Triton version), falls back to the original upstream defaults.

Usage

# Tune for a model (auto-derives all layer dims)
python benchmark/kernels/lora_csgmv/tune_lora_csgmv.py \
    --model Qwen/Qwen3-Embedding-0.6B --rank 64

# Configs saved to python/sglang/srt/lora/triton_ops/csgmv_configs/<triton_version>/
# Server automatically picks them up with --lora-backend csgmv

Accuracy Tests

No changes to kernel computation logic — only block size and launch params are tuned. The kernels produce identical outputs with different block sizes (verified by existing LoRA correctness tests).

Benchmarking and Profiling

Kernel-level tuning results (H200, Triton 3.5.1, Qwen3-Embedding-0.6B, rank=64)

layer	kernel	K/dim	chunk	baseline	tuned	speedup	config
qkv	shrink	1024	16	0.228ms	0.115ms	1.98x	N=64, K=64, w4, s3
qkv	shrink	1024	32	0.153ms	0.089ms	1.72x	N=64, K=64, w4, s3
qkv	shrink	1024	64	0.191ms	0.074ms	2.56x	N=64, K=64, w8, s3
qkv	shrink	1024	128	0.230ms	0.071ms	3.24x	N=64, K=64, w8, s3
qkv	expand	4096	16	1.651ms	1.170ms	1.41x	N=64, K=32, w4, s3, mr160
qkv	expand	4096	32	1.083ms	0.827ms	1.31x	N=64, K=32, w4, s3
qkv	expand	4096	64	0.976ms	0.753ms	1.30x	N=64, K=32, w4, s1, mr160
qkv	expand	4096	128	0.766ms	0.721ms	1.06x	N=64, K=16, w8, s2, mr128
o_proj	shrink	2048	16	0.152ms	0.077ms	1.96x	N=64, K=128, w4, s3
o_proj	shrink	2048	32	0.106ms	0.066ms	1.60x	N=64, K=64, w4, s3
o_proj	shrink	2048	64	0.118ms	0.059ms	2.00x	N=64, K=64, w4, s3
o_proj	shrink	2048	128	0.145ms	0.060ms	2.44x	N=64, K=64, w8, s4
o_proj	expand	1024	16	0.382ms	0.251ms	1.52x	N=64, K=32, w4, s1, mr128
o_proj	expand	1024	32	0.243ms	0.176ms	1.38x	N=64, K=32, w4, s2
o_proj	expand	1024	64	0.206ms	0.157ms	1.31x	N=64, K=32, w4, s2, mr160
o_proj	expand	1024	128	0.163ms	0.150ms	1.09x	N=64, K=16, w8, s2, mr128
gate_up	shrink	1024	16	0.158ms	0.080ms	1.99x	N=64, K=64, w4, s3
gate_up	shrink	1024	32	0.104ms	0.063ms	1.66x	N=64, K=64, w4, s3
gate_up	shrink	1024	64	0.122ms	0.047ms	2.62x	N=64, K=64, w4, s3
gate_up	shrink	1024	128	0.146ms	0.045ms	3.22x	N=64, K=64, w4, s4
gate_up	expand	6144	16	2.182ms	1.408ms	1.55x	N=64, K=32, w4, s2, mr112
gate_up	expand	6144	32	1.332ms	0.959ms	1.39x	N=64, K=32, w4, s2
gate_up	expand	6144	64	1.139ms	0.850ms	1.34x	N=64, K=32, w4, s3, mr160
gate_up	expand	6144	128	0.902ms	0.823ms	1.10x	N=64, K=16, w8, s3, mr128
down_proj	shrink	3072	16	0.213ms	0.104ms	2.05x	N=64, K=128, w4, s3
down_proj	shrink	3072	32	0.148ms	0.090ms	1.65x	N=64, K=64, w4, s4
down_proj	shrink	3072	64	0.155ms	0.073ms	2.11x	N=64, K=64, w4, s3
down_proj	shrink	3072	128	0.197ms	0.075ms	2.64x	N=64, K=64, w8, s3

Per-layer net savings at chunk_size=128 (shrink + expand combined)

layer	baseline total	tuned total	net speedup
qkv	0.996ms	0.792ms	1.26x
o_proj	0.308ms	0.210ms	1.47x
gate_up	1.048ms	0.868ms	1.21x
down_proj	0.197ms	0.075ms	2.63x
Total per layer	2.549ms	1.945ms	1.31x

E2E benchmark result

Launch Server

python -m sglang.launch_server         --tp-size 1     --model-path $BASE_MODEL        --is-embedding    --host 127.0.0.1        --port 30000             --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 32768       --enable-lora   --lora-paths v1=$UPDATED_ADAPTER  --lora-backend csgmv

python -m sglang.bench_serving   --backend sglang-embedding   --host 127.0.0.1   --port 30000   --model $BASE_MODEL --dataset-name random  --random-input-len <num-tokens> --random-range-ratio 1.0  --num-prompts 120   --request-rate <rps>   --lora-name v1

Config	chunk_size	Req throughput (req/s)	Tok throughput (tok/s)
main	16	30.67	188,395
tuned	16	34.55 (+12.6%)	212,231 (+12.7%)
main	128	37.17 (+21.2%)	228,365 (+21.2%)
tuned	128	38.13 (+24.3%)	234,227 (+24.3%)

All % gains are relative to main @ chunk_size=16 (the current default behavior).

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

gemini-code-assist · 2026-03-12T00:04:53Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

zminglei · 2026-03-12T00:10:11Z

/tag-and-rerun-ci

satyamk7054 · 2026-03-16T20:12:07Z

/rerun-failed-ci

…uto-tuning

Fridge003

Nice feature

satyamk7054 · 2026-04-09T20:46:22Z

/rerun-failed-ci try 2

Co-authored-by: Satyam Kumar <satyamk@linkedin.com>

Add offline auto-tuning for LoRA CSGMV kernel block sizes

44abd7a

satyamk7054 requested review from Fridge003, Ying1123 and lifuhuang as code owners March 12, 2026 00:04

github-actions Bot added the lora label Mar 12, 2026

github-actions Bot added the run-ci label Mar 12, 2026

satyamk7054 added 2 commits March 11, 2026 18:07

Merge branch 'main' into satyamk/lora-csgmv-tuning

447973a

Merge branch 'main' into satyamk/lora-csgmv-tuning

f0bae13

satyamk7054 requested a review from yushengsu-thu as a code owner March 12, 2026 22:43

satyamk7054 added 3 commits March 13, 2026 11:54

Merge branch 'main' into satyamk/lora-csgmv-tuning

a8eebf7

Merge branch 'main' into satyamk/lora-csgmv-tuning

a65a477

Merge branch 'main' into satyamk/lora-csgmv-tuning

7f22479

Merge branch 'main' into satyamk/lora-csgmv-tuning

8e85c04

zminglei reviewed Apr 8, 2026

View reviewed changes

Comment thread python/sglang/srt/lora/triton_ops/lora_tuning_config.py Outdated

Remove num_warps/num_stages from default configs to preserve Triton a…

738690d

…uto-tuning

zminglei approved these changes Apr 9, 2026

View reviewed changes

Fridge003 approved these changes Apr 9, 2026

View reviewed changes

Fridge003 merged commit 059b287 into sgl-project:main Apr 10, 2026
171 of 216 checks passed

Fridge003 pushed a commit that referenced this pull request Apr 11, 2026

Add offline auto-tuning for LoRA CSGMV kernel (#20391)

1285e80

Co-authored-by: Satyam Kumar <satyamk@linkedin.com>

pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026

Add offline auto-tuning for LoRA CSGMV kernel (sgl-project#20391)

532d2e2

Co-authored-by: Satyam Kumar <satyamk@linkedin.com>

yushengsu-thu pushed a commit that referenced this pull request Apr 17, 2026

Add offline auto-tuning for LoRA CSGMV kernel (#20391)

63455c2

Co-authored-by: Satyam Kumar <satyamk@linkedin.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Add offline auto-tuning for LoRA CSGMV kernel (sgl-project#20391)

01aeae7

Co-authored-by: Satyam Kumar <satyamk@linkedin.com>

satyamk7054 deleted the satyamk/lora-csgmv-tuning branch April 25, 2026 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add offline auto-tuning for LoRA CSGMV kernel#20391

Add offline auto-tuning for LoRA CSGMV kernel#20391
Fridge003 merged 8 commits intosgl-project:mainfrom
satyamk7054:satyamk/lora-csgmv-tuning

satyamk7054 commented Mar 12, 2026

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Uh oh!

zminglei commented Mar 12, 2026

Uh oh!

satyamk7054 commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Fridge003 left a comment

Uh oh!

satyamk7054 commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

satyamk7054 commented Mar 12, 2026

Motivation

Modifications

Fallback behavior

Usage

Accuracy Tests

Benchmarking and Profiling

Kernel-level tuning results (H200, Triton 3.5.1, Qwen3-Embedding-0.6B, rank=64)

Per-layer net savings at chunk_size=128 (shrink + expand combined)

E2E benchmark result

Checklist

Uh oh!

gemini-code-assist Bot commented Mar 12, 2026

Uh oh!

zminglei commented Mar 12, 2026

Uh oh!

satyamk7054 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

satyamk7054 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

satyamk7054 commented Mar 16, 2026 •

edited

Loading

satyamk7054 commented Apr 9, 2026 •

edited

Loading