Use torch.addmm instead of separate mm and add_ calls for LoRA torch.native by satyamk7054 · Pull Request #20562 · sgl-project/sglang

satyamk7054 · 2026-03-14T03:39:10Z

Motivation

Use torch.addmm instead of separate mm and add_ calls for LoRA torch.native backend to speed up perf

Torch Native performs better than csgmv when number of LoRAs is small (4-8) and inputs are larger prompts (large token strings for prefill / embeddings)

Modifications

Update torch-native LoRA backend to use torch.addmm

Testing

Used an in-house model + adapter to test embedding LoRA similarity with HF

Ran below test with torch-native. The LoRA adapter in the test updates embed_tokens
which isn't supported by torch-native backend yet.

python test_embedding_lora_support.py TestEmbeddingLoraHFComparison.test_embedding_lora_hf_sglang_similarity

HF vs SGLang LoRA Embedding Comparison:
  Text 0: cosine similarity = 1.000311
  Text 1: cosine similarity = 0.999937
  Average similarity: 1.000124
  Threshold: 0.9999

python test_lora_ops.py -v

test_sgemm_lora_a_fwd (__main__.TestLoraOps) ... [CI Test Method] TestLoraOps.test_sgemm_lora_a_fwd
ok
test_sgemm_lora_a_fwd_expand (__main__.TestLoraOps) ... [CI Test Method] TestLoraOps.test_sgemm_lora_a_fwd_expand
ok
test_sgemm_lora_b_fwd (__main__.TestLoraOps) ... [CI Test Method] TestLoraOps.test_sgemm_lora_b_fwd
ok
test_sgemm_lora_b_fwd_expand (__main__.TestLoraOps) ... [CI Test Method] TestLoraOps.test_sgemm_lora_b_fwd_expand
ok

----------------------------------------------------------------------
Ran 4 tests in 0.019s

OK

Performance

BASE_MODEL is Qwen3-0.6B-Embeddings and Adapters are of rank 64 targeting qkv and o_proj.

python -m sglang.bench_serving   --backend sglang-embedding   --host 127.0.0.1 --port 30000   --model $BASE_MODEL --dataset-name random --random-input-len 6144 --random-range-ratio 1.0 --num-prompts 120 --request-rate 45 --lora-name v1 v2 v3 v4

Branch	Achieved RPS	Achieved TPS
Main	40.12	246,507.98
This Change	41.89 (+4.4%)	257,343.26
Main (csgmv 128 chunk size)	37.63	231,138.80

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Follow the SGLang code style guidance.

Replace separate mm + scalar multiply in sgemm_lora_a_fwd and mm + add_ in sgemm_lora_b_fwd with single torch.addmm cuBLAS calls. Store pinned CPU scaling tensor in batch info to avoid GPU->CPU sync.

gemini-code-assist · 2026-03-14T03:39:14Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

zminglei · 2026-03-14T03:58:39Z

/tag-and-rerun-ci

satyamk7054 · 2026-03-14T17:25:23Z

/rerun-failed-ci

claude-pr-review-bot · 2026-03-16T17:56:33Z

🔍 SGLang Domain Expert Review

PR: Use torch.addmm instead of separate mm and add_ calls for LoRA torch.native (#20562)

Routing

lora [████████░░] 77% — LoRA adapters, torch/triton ops, multi-adapter serving

lora Review

Risk Level: Low

Summary: Clean optimization replacing separate torch.mm + scalar multiply / add_ with fused torch.addmm calls in the torch-native LoRA backend, plus adding scalings_cpu to avoid implicit GPU-to-CPU sync when indexing a GPU tensor in a CPU-side loop. Both changes are semantically equivalent to the original code.

Issues Found:

alpha parameter type for torch.addmm: In sgemm_lora_a_fwd, scaling_tensor[lora_idx] is a 0-dim CPU tensor passed as the alpha argument to torch.addmm operating on GPU tensors. PyTorch accepts this, so it works — but calling .item() to extract a Python float would be more explicit and avoids any edge case in future PyTorch versions. Style nit, not a bug.

Suggestions:

Consider alpha=scaling_tensor[lora_idx].item() for clarity in sgemm_lora_a_fwd.
In sgemm_lora_b_fwd, beta=1, alpha=1 are the defaults for torch.addmm, so they could be omitted: torch.addmm(out_slice, x_slice, w_slice.T, out=out_slice). Keeping them explicit is also a valid choice for readability.

Looks Good:

torch.addmm fusion is a well-known optimization — avoids an intermediate allocation and a separate kernel launch. Using beta=0, alpha=scaling in lora_a and beta=1, alpha=1 in lora_b correctly preserves original semantics (overwrite vs. accumulate).
Adding scalings_cpu is consistent with the existing pattern (lora_ranks_cpu, seg_indptr_cpu, seg_lens_cpu, weight_indices_cpu) — all tensors used in the CPU-side loop are now explicitly on CPU, eliminating implicit GPU-to-CPU synchronization.
The out=out_slice pattern writing back into a view of output is correct — tensor views share storage.
Performance numbers (+4.4% RPS) are credible for this type of kernel fusion with larger prompt inputs.

Generated by SGLang domain expert review agents.

satyamk7054 · 2026-03-17T17:07:52Z

/rerun-failed-ci try again 4

satyamk7054 · 2026-03-23T22:33:19Z

/rerun-stage stage-c-test-4-gpu-b200 (2)

zminglei · 2026-03-23T22:42:29Z

/rerun-failed-ci

satyamk7054 · 2026-03-24T18:16:36Z

/rerun-stage stage-c-test-4-gpu-b200 (2)

satyamk7054 · 2026-03-25T19:02:07Z

/rerun-failed-ci

jasperjiaguo

LGTM

…native (sgl-project#20562) Co-authored-by: Satyam Kumar <satyamk@linkedin.com>

perf: fuse mm + scaling into addmm for torch-native LoRA backend

01b8bbd

Replace separate mm + scalar multiply in sgemm_lora_a_fwd and mm + add_ in sgemm_lora_b_fwd with single torch.addmm cuBLAS calls. Store pinned CPU scaling tensor in batch info to avoid GPU->CPU sync.

satyamk7054 requested review from Fridge003, Ying1123, lifuhuang and yushengsu-thu as code owners March 14, 2026 03:39

github-actions Bot added the lora label Mar 14, 2026

github-actions Bot added the run-ci label Mar 14, 2026

style: use .item() for addmm alpha scalar

51a7e3c

zminglei mentioned this pull request Mar 16, 2026

SGLang PR Review Skills — 16 Domain Expert Agents zminglei/sglang#4

Draft

Merge branch 'main' into perf/torch-lora-addmm-fusion

e7a2c30

zminglei approved these changes Mar 23, 2026

View reviewed changes

jasperjiaguo approved these changes Mar 26, 2026

View reviewed changes

Fridge003 approved these changes Mar 26, 2026

View reviewed changes

Fridge003 merged commit be0cca5 into sgl-project:main Mar 26, 2026
388 of 433 checks passed

satyamk7054 added a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026

Use torch.addmm instead of separate mm and add_ calls for LoRA torch.…

35cb6a6

…native (sgl-project#20562) Co-authored-by: Satyam Kumar <satyamk@linkedin.com>

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

Use torch.addmm instead of separate mm and add_ calls for LoRA torch.…

544cb18

…native (sgl-project#20562) Co-authored-by: Satyam Kumar <satyamk@linkedin.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Use torch.addmm instead of separate mm and add_ calls for LoRA torch.…

4a6c8e6

…native (sgl-project#20562) Co-authored-by: Satyam Kumar <satyamk@linkedin.com>

satyamk7054 deleted the perf/torch-lora-addmm-fusion branch April 25, 2026 01:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use torch.addmm instead of separate mm and add_ calls for LoRA torch.native#20562

Use torch.addmm instead of separate mm and add_ calls for LoRA torch.native#20562
Fridge003 merged 3 commits intosgl-project:mainfrom
satyamk7054:perf/torch-lora-addmm-fusion

satyamk7054 commented Mar 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 14, 2026

Uh oh!

zminglei commented Mar 14, 2026

Uh oh!

satyamk7054 commented Mar 14, 2026 •

edited

Loading

Uh oh!

claude-pr-review-bot commented Mar 16, 2026

Uh oh!

satyamk7054 commented Mar 17, 2026 •

edited

Loading

Uh oh!

satyamk7054 commented Mar 23, 2026

Uh oh!

zminglei commented Mar 23, 2026

Uh oh!

satyamk7054 commented Mar 24, 2026

Uh oh!

satyamk7054 commented Mar 25, 2026

Uh oh!

jasperjiaguo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

satyamk7054 commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Testing

Performance

Checklist

Uh oh!

gemini-code-assist Bot commented Mar 14, 2026

Uh oh!

zminglei commented Mar 14, 2026

Uh oh!

satyamk7054 commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude-pr-review-bot commented Mar 16, 2026

🔍 SGLang Domain Expert Review

Routing

lora Review

Uh oh!

satyamk7054 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

satyamk7054 commented Mar 23, 2026

Uh oh!

zminglei commented Mar 23, 2026

Uh oh!

satyamk7054 commented Mar 24, 2026

Uh oh!

satyamk7054 commented Mar 25, 2026

Uh oh!

jasperjiaguo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

satyamk7054 commented Mar 14, 2026 •

edited

Loading

satyamk7054 commented Mar 14, 2026 •

edited

Loading

satyamk7054 commented Mar 17, 2026 •

edited

Loading