[lora] Speedup triton backend sgemm calls with better grid#22386
Merged
Fridge003 merged 5 commits intosgl-project:mainfrom Apr 15, 2026
Merged
[lora] Speedup triton backend sgemm calls with better grid#22386Fridge003 merged 5 commits intosgl-project:mainfrom
sgemm calls with better grid#22386Fridge003 merged 5 commits intosgl-project:mainfrom
Conversation
Sort tokens by adapter during decode to merge per-sequence segments into per-adapter segments. This reduces the number of kernel grid blocks and improves GPU utilization for multi-LoRA batches. Key changes: - Add _resolve_token_positions() helper for indirection in all sgemm kernels - Add SORTED_BY_ADAPTER constexpr and early-exit for empty/OOB segments - Add compute_sgemm_routing() in TritonLoRABackend to build merged batch info - Pre-allocate sgemm CUDA graph buffers in init_cuda_graph_batch_info() - Add test_sgemm_sorted_by_adapter.py verifying correctness across all kernels
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
6 similar comments
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
4 similar comments
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
2 similar comments
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
1 similar comment
Collaborator
|
/tag-and-rerun-ci |
Collaborator
|
/tag-and-rerun-ci |
yushengsu-thu
pushed a commit
that referenced
this pull request
Apr 17, 2026
jmamou
pushed a commit
to jmamou/sglang
that referenced
this pull request
Apr 20, 2026
yhyang201
pushed a commit
to yhyang201/sglang
that referenced
this pull request
Apr 22, 2026
zhangying098
pushed a commit
to zhangying098/sglang
that referenced
this pull request
Apr 23, 2026
kyx1999
pushed a commit
to KMSorSMS/sglang
that referenced
this pull request
Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
During multi-LoRA decode, each sequence gets its own segment in the Triton sgemm grid — even when many sequences share the same adapter. This means the grid scales with
batch_sizeinstead ofnum_adapters, launching excessive blocks and wasting GPU cycles.This PR sorts tokens by adapter and merges per-sequence segments into per-adapter segments, so the kernel grid scales with adapter count instead.
Modifications
kernel_utils.py(new):_resolve_token_positions()Triton JIT helper — gathers/scatters through a permutation when sorted, passes through otherwise.sgemm_lora_a,sgemm_lora_b,qkv_lora_b,gate_up_lora_b): addedSORTED_BY_ADAPTERconstexpr path with indirection via_resolve_token_positions, plus early-exit for empty segments and excess grid blocks.triton_backend.py:compute_sgemm_routing()builds merged per-adapter batch info usingargsort+searchsorted; called during decode only. CUDA graph buffers pre-allocated ininit_cuda_graph_batch_info().test_sgemm_sorted_by_adapter.py(new): verifies numerical equivalence (bf16, atol=1e-4) between per-sequence and sorted-by-adapter paths for all four kernels, plus mixed-rank and single-adapter edge cases.Accuracy Tests
Unit test compares original per-sequence output against sorted-by-adapter output across all kernels.
Speed Tests and Profiling
Checklist
Benchmark the speed.