Skip to content

Add offline auto-tuning for LoRA CSGMV kernel#20391

Merged
Fridge003 merged 8 commits intosgl-project:mainfrom
satyamk7054:satyamk/lora-csgmv-tuning
Apr 10, 2026
Merged

Add offline auto-tuning for LoRA CSGMV kernel#20391
Fridge003 merged 8 commits intosgl-project:mainfrom
satyamk7054:satyamk/lora-csgmv-tuning

Conversation

@satyamk7054
Copy link
Copy Markdown
Contributor

Motivation

Add offline auto-tuning script for LoRA csgmv shrink / expand kernels (similar to MoE auto-tuning)

On H200 with Qwen3-Embedding-0.6B (rank=64), tuning yields 2-3x speedup on shrink kernels and 1.1-1.5x on expand kernels

Modifications

  • Add lora_tuning_config.py: Config loader with LRU cache – same MoE loader
  • Add tune_lora_csgmv.py: Offline script to generate configs
  • Update kernel invocations to use the loaded configs (with existing defaults); maxnregs helped with occupancy so added that as kwargs
  • Unit test for loading logic
python -m unittest test.manual.lora.test_lora_tuning_config -v

Fallback behavior

When no tuned config is found (no config file for the current GPU/model/Triton version), falls back to the original upstream defaults.

Usage

# Tune for a model (auto-derives all layer dims)
python benchmark/kernels/lora_csgmv/tune_lora_csgmv.py \
    --model Qwen/Qwen3-Embedding-0.6B --rank 64

# Configs saved to python/sglang/srt/lora/triton_ops/csgmv_configs/<triton_version>/
# Server automatically picks them up with --lora-backend csgmv

Accuracy Tests

No changes to kernel computation logic — only block size and launch params are tuned. The kernels produce identical outputs with different block sizes (verified by existing LoRA correctness tests).

Benchmarking and Profiling

Kernel-level tuning results (H200, Triton 3.5.1, Qwen3-Embedding-0.6B, rank=64)

layer kernel K/dim chunk baseline tuned speedup config
qkv shrink 1024 16 0.228ms 0.115ms 1.98x N=64, K=64, w4, s3
qkv shrink 1024 32 0.153ms 0.089ms 1.72x N=64, K=64, w4, s3
qkv shrink 1024 64 0.191ms 0.074ms 2.56x N=64, K=64, w8, s3
qkv shrink 1024 128 0.230ms 0.071ms 3.24x N=64, K=64, w8, s3
qkv expand 4096 16 1.651ms 1.170ms 1.41x N=64, K=32, w4, s3, mr160
qkv expand 4096 32 1.083ms 0.827ms 1.31x N=64, K=32, w4, s3
qkv expand 4096 64 0.976ms 0.753ms 1.30x N=64, K=32, w4, s1, mr160
qkv expand 4096 128 0.766ms 0.721ms 1.06x N=64, K=16, w8, s2, mr128
o_proj shrink 2048 16 0.152ms 0.077ms 1.96x N=64, K=128, w4, s3
o_proj shrink 2048 32 0.106ms 0.066ms 1.60x N=64, K=64, w4, s3
o_proj shrink 2048 64 0.118ms 0.059ms 2.00x N=64, K=64, w4, s3
o_proj shrink 2048 128 0.145ms 0.060ms 2.44x N=64, K=64, w8, s4
o_proj expand 1024 16 0.382ms 0.251ms 1.52x N=64, K=32, w4, s1, mr128
o_proj expand 1024 32 0.243ms 0.176ms 1.38x N=64, K=32, w4, s2
o_proj expand 1024 64 0.206ms 0.157ms 1.31x N=64, K=32, w4, s2, mr160
o_proj expand 1024 128 0.163ms 0.150ms 1.09x N=64, K=16, w8, s2, mr128
gate_up shrink 1024 16 0.158ms 0.080ms 1.99x N=64, K=64, w4, s3
gate_up shrink 1024 32 0.104ms 0.063ms 1.66x N=64, K=64, w4, s3
gate_up shrink 1024 64 0.122ms 0.047ms 2.62x N=64, K=64, w4, s3
gate_up shrink 1024 128 0.146ms 0.045ms 3.22x N=64, K=64, w4, s4
gate_up expand 6144 16 2.182ms 1.408ms 1.55x N=64, K=32, w4, s2, mr112
gate_up expand 6144 32 1.332ms 0.959ms 1.39x N=64, K=32, w4, s2
gate_up expand 6144 64 1.139ms 0.850ms 1.34x N=64, K=32, w4, s3, mr160
gate_up expand 6144 128 0.902ms 0.823ms 1.10x N=64, K=16, w8, s3, mr128
down_proj shrink 3072 16 0.213ms 0.104ms 2.05x N=64, K=128, w4, s3
down_proj shrink 3072 32 0.148ms 0.090ms 1.65x N=64, K=64, w4, s4
down_proj shrink 3072 64 0.155ms 0.073ms 2.11x N=64, K=64, w4, s3
down_proj shrink 3072 128 0.197ms 0.075ms 2.64x N=64, K=64, w8, s3

Per-layer net savings at chunk_size=128 (shrink + expand combined)

layer baseline total tuned total net speedup
qkv 0.996ms 0.792ms 1.26x
o_proj 0.308ms 0.210ms 1.47x
gate_up 1.048ms 0.868ms 1.21x
down_proj 0.197ms 0.075ms 2.63x
Total per layer 2.549ms 1.945ms 1.31x

E2E benchmark result

Launch Server

python -m sglang.launch_server         --tp-size 1     --model-path $BASE_MODEL        --is-embedding    --host 127.0.0.1        --port 30000             --disable-radix-cache --chunked-prefill-size -1 --max-prefill-tokens 32768       --enable-lora   --lora-paths v1=$UPDATED_ADAPTER  --lora-backend csgmv 
python -m sglang.bench_serving   --backend sglang-embedding   --host 127.0.0.1   --port 30000   --model $BASE_MODEL --dataset-name random  --random-input-len <num-tokens> --random-range-ratio 1.0  --num-prompts 120   --request-rate <rps>   --lora-name v1 
Config chunk_size Req throughput (req/s) Tok throughput (tok/s)
main 16 30.67 188,395
tuned 16 34.55 (+12.6%) 212,231 (+12.7%)
main 128 37.17 (+21.2%) 228,365 (+21.2%)
tuned 128 38.13 (+24.3%) 234,227 (+24.3%)

All % gains are relative to main @ chunk_size=16 (the current default behavior).

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the lora label Mar 12, 2026
@zminglei
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@satyamk7054
Copy link
Copy Markdown
Contributor Author

satyamk7054 commented Mar 16, 2026

/rerun-failed-ci

Comment thread python/sglang/srt/lora/triton_ops/lora_tuning_config.py Outdated
Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature

@satyamk7054
Copy link
Copy Markdown
Contributor Author

satyamk7054 commented Apr 9, 2026

/rerun-failed-ci try 2

@Fridge003 Fridge003 merged commit 059b287 into sgl-project:main Apr 10, 2026
171 of 216 checks passed
Fridge003 pushed a commit that referenced this pull request Apr 11, 2026
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
yushengsu-thu pushed a commit that referenced this pull request Apr 17, 2026
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
@satyamk7054 satyamk7054 deleted the satyamk/lora-csgmv-tuning branch April 25, 2026 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants