Skip to content

Use torch.addmm instead of separate mm and add_ calls for LoRA torch.native#20562

Merged
Fridge003 merged 3 commits intosgl-project:mainfrom
satyamk7054:perf/torch-lora-addmm-fusion
Mar 26, 2026
Merged

Use torch.addmm instead of separate mm and add_ calls for LoRA torch.native#20562
Fridge003 merged 3 commits intosgl-project:mainfrom
satyamk7054:perf/torch-lora-addmm-fusion

Conversation

@satyamk7054
Copy link
Copy Markdown
Contributor

@satyamk7054 satyamk7054 commented Mar 14, 2026

Motivation

Use torch.addmm instead of separate mm and add_ calls for LoRA torch.native backend to speed up perf

Torch Native performs better than csgmv when number of LoRAs is small (4-8) and inputs are larger prompts (large token strings for prefill / embeddings)

Modifications

  • Update torch-native LoRA backend to use torch.addmm

Testing

Used an in-house model + adapter to test embedding LoRA similarity with HF

Ran below test with torch-native. The LoRA adapter in the test updates embed_tokens
which isn't supported by torch-native backend yet.

python test_embedding_lora_support.py TestEmbeddingLoraHFComparison.test_embedding_lora_hf_sglang_similarity

HF vs SGLang LoRA Embedding Comparison:
  Text 0: cosine similarity = 1.000311
  Text 1: cosine similarity = 0.999937
  Average similarity: 1.000124
  Threshold: 0.9999
python test_lora_ops.py -v

test_sgemm_lora_a_fwd (__main__.TestLoraOps) ... [CI Test Method] TestLoraOps.test_sgemm_lora_a_fwd
ok
test_sgemm_lora_a_fwd_expand (__main__.TestLoraOps) ... [CI Test Method] TestLoraOps.test_sgemm_lora_a_fwd_expand
ok
test_sgemm_lora_b_fwd (__main__.TestLoraOps) ... [CI Test Method] TestLoraOps.test_sgemm_lora_b_fwd
ok
test_sgemm_lora_b_fwd_expand (__main__.TestLoraOps) ... [CI Test Method] TestLoraOps.test_sgemm_lora_b_fwd_expand
ok

----------------------------------------------------------------------
Ran 4 tests in 0.019s

OK

Performance

BASE_MODEL is Qwen3-0.6B-Embeddings and Adapters are of rank 64 targeting qkv and o_proj.

python -m sglang.bench_serving   --backend sglang-embedding   --host 127.0.0.1 --port 30000   --model $BASE_MODEL --dataset-name random --random-input-len 6144 --random-range-ratio 1.0 --num-prompts 120 --request-rate 45 --lora-name v1 v2 v3 v4
Branch Achieved RPS Achieved TPS
Main 40.12 246,507.98
This Change 41.89 (+4.4%) 257,343.26
Main (csgmv 128 chunk size) 37.63 231,138.80

Checklist

Replace separate mm + scalar multiply in sgemm_lora_a_fwd and
mm + add_ in sgemm_lora_b_fwd with single torch.addmm cuBLAS calls.
Store pinned CPU scaling tensor in batch info to avoid GPU->CPU sync.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the lora label Mar 14, 2026
@zminglei
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@satyamk7054
Copy link
Copy Markdown
Contributor Author

satyamk7054 commented Mar 14, 2026

/rerun-failed-ci

@claude-pr-review-bot
Copy link
Copy Markdown

🔍 SGLang Domain Expert Review

PR: Use torch.addmm instead of separate mm and add_ calls for LoRA torch.native (#20562)

Routing

  • lora [████████░░] 77% — LoRA adapters, torch/triton ops, multi-adapter serving

lora Review

Risk Level: Low

Summary: Clean optimization replacing separate torch.mm + scalar multiply / add_ with fused torch.addmm calls in the torch-native LoRA backend, plus adding scalings_cpu to avoid implicit GPU-to-CPU sync when indexing a GPU tensor in a CPU-side loop. Both changes are semantically equivalent to the original code.

Issues Found:

  • alpha parameter type for torch.addmm: In sgemm_lora_a_fwd, scaling_tensor[lora_idx] is a 0-dim CPU tensor passed as the alpha argument to torch.addmm operating on GPU tensors. PyTorch accepts this, so it works — but calling .item() to extract a Python float would be more explicit and avoids any edge case in future PyTorch versions. Style nit, not a bug.

Suggestions:

  • Consider alpha=scaling_tensor[lora_idx].item() for clarity in sgemm_lora_a_fwd.
  • In sgemm_lora_b_fwd, beta=1, alpha=1 are the defaults for torch.addmm, so they could be omitted: torch.addmm(out_slice, x_slice, w_slice.T, out=out_slice). Keeping them explicit is also a valid choice for readability.

Looks Good:

  • torch.addmm fusion is a well-known optimization — avoids an intermediate allocation and a separate kernel launch. Using beta=0, alpha=scaling in lora_a and beta=1, alpha=1 in lora_b correctly preserves original semantics (overwrite vs. accumulate).
  • Adding scalings_cpu is consistent with the existing pattern (lora_ranks_cpu, seg_indptr_cpu, seg_lens_cpu, weight_indices_cpu) — all tensors used in the CPU-side loop are now explicitly on CPU, eliminating implicit GPU-to-CPU synchronization.
  • The out=out_slice pattern writing back into a view of output is correct — tensor views share storage.
  • Performance numbers (+4.4% RPS) are credible for this type of kernel fusion with larger prompt inputs.

Generated by SGLang domain expert review agents.

@satyamk7054
Copy link
Copy Markdown
Contributor Author

satyamk7054 commented Mar 17, 2026

/rerun-failed-ci try again 4

@satyamk7054
Copy link
Copy Markdown
Contributor Author

/rerun-stage stage-c-test-4-gpu-b200 (2)

@zminglei
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@satyamk7054
Copy link
Copy Markdown
Contributor Author

/rerun-stage stage-c-test-4-gpu-b200 (2)

@satyamk7054
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

Copy link
Copy Markdown
Contributor

@jasperjiaguo jasperjiaguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Fridge003 Fridge003 merged commit be0cca5 into sgl-project:main Mar 26, 2026
388 of 433 checks passed
satyamk7054 added a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026
…native (sgl-project#20562)

Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
…native (sgl-project#20562)

Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
…native (sgl-project#20562)

Co-authored-by: Satyam Kumar <satyamk@linkedin.com>
@satyamk7054 satyamk7054 deleted the perf/torch-lora-addmm-fusion branch April 25, 2026 01:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants