Use torch.addmm instead of separate mm and add_ calls for LoRA torch.native#20562
Conversation
Replace separate mm + scalar multiply in sgemm_lora_a_fwd and mm + add_ in sgemm_lora_b_fwd with single torch.addmm cuBLAS calls. Store pinned CPU scaling tensor in batch info to avoid GPU->CPU sync.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
🔍 SGLang Domain Expert ReviewPR: Use torch.addmm instead of separate mm and add_ calls for LoRA torch.native (#20562) Routing
lora ReviewRisk Level: Low Summary: Clean optimization replacing separate Issues Found:
Suggestions:
Looks Good:
Generated by SGLang domain expert review agents. |
|
/rerun-failed-ci try again 4 |
|
/rerun-stage stage-c-test-4-gpu-b200 (2) |
|
/rerun-failed-ci |
|
/rerun-stage stage-c-test-4-gpu-b200 (2) |
|
/rerun-failed-ci |
…native (sgl-project#20562) Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
…native (sgl-project#20562) Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
…native (sgl-project#20562) Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
Motivation
Use torch.addmm instead of separate mm and add_ calls for LoRA torch.native backend to speed up perf
Torch Native performs better than csgmv when number of LoRAs is small (4-8) and inputs are larger prompts (large token strings for prefill / embeddings)
Modifications
Testing
Used an in-house model + adapter to test embedding LoRA similarity with HF
Ran below test with torch-native. The LoRA adapter in the test updates embed_tokens
which isn't supported by torch-native backend yet.
python test_lora_ops.py -v test_sgemm_lora_a_fwd (__main__.TestLoraOps) ... [CI Test Method] TestLoraOps.test_sgemm_lora_a_fwd ok test_sgemm_lora_a_fwd_expand (__main__.TestLoraOps) ... [CI Test Method] TestLoraOps.test_sgemm_lora_a_fwd_expand ok test_sgemm_lora_b_fwd (__main__.TestLoraOps) ... [CI Test Method] TestLoraOps.test_sgemm_lora_b_fwd ok test_sgemm_lora_b_fwd_expand (__main__.TestLoraOps) ... [CI Test Method] TestLoraOps.test_sgemm_lora_b_fwd_expand ok ---------------------------------------------------------------------- Ran 4 tests in 0.019s OKPerformance
BASE_MODEL is Qwen3-0.6B-Embeddings and Adapters are of rank 64 targeting qkv and o_proj.
python -m sglang.bench_serving --backend sglang-embedding --host 127.0.0.1 --port 30000 --model $BASE_MODEL --dataset-name random --random-input-len 6144 --random-range-ratio 1.0 --num-prompts 120 --request-rate 45 --lora-name v1 v2 v3 v4Checklist