[Feature] Define backends and add Triton backend for Lora#3161
Merged
zhaochenyang20 merged 9 commits intosgl-project:mainfrom Feb 4, 2025
Merged
[Feature] Define backends and add Triton backend for Lora#3161zhaochenyang20 merged 9 commits intosgl-project:mainfrom
zhaochenyang20 merged 9 commits intosgl-project:mainfrom
Conversation
6a6dadd to
90a5123
Compare
Ying1123
approved these changes
Feb 4, 2025
90a5123 to
bf7ab1f
Compare
Collaborator
|
@Ying1123 we don't have flashinfer yet on ROCm, I found this merge causes a break on AMD. |
Collaborator
|
@HaiShaw AMD CIs are crucial for preventing such issues from a process perspective. |
Collaborator
|
@HaiShaw Also may you help fix the top of the main branch? |
Collaborator
Yes, let me push/press/push on it!! |
Edenzzzz
reviewed
Feb 4, 2025
Collaborator
|
@Fridge003 Please check these. |
Collaborator
|
I think it has been fixed from AMD people. |
5 tasks
67 tasks
timethink
pushed a commit
to timethink/sglang
that referenced
this pull request
Mar 9, 2025
…t#3161) Co-authored-by: Ying Sheng <sqy1415@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Current Lora modules relies on SGemm kernels provided by flashinfer to do the computation. However, Flashinfer is not optimized well on tall and thin matrices of Lora modules. What's more, the way
LoraManagerthat manages segment indices and weight indices of input batch is inefficient. All these issues make Lora run slowly with SGLang.Modifications
To improve efficiency of Lora, this PR makes the following modifications on the basis of PR draft #1728:
BaseLoraBackend,FlashInferLoraBackendandTritonLoraBackendclasses, which discouple GEMM implementation of each backend from the forward logic of Lora modules. A new server arglora-backendis added for controlling the backend.BatchInfoclass that packs [bs,seg_lens,seg_indptr,max_len,weight_indices] together. By attaching it to lora backend, it only needs to be set once at every batch forward.Usage
A new argument
lora-backendis added to server arguments. This argument can be eithertritonorflashinfer, indicating the backend to be chosen. Its default value istriton.Accuracy Test
Accuracy test can be run with:
The code can pass accuracy test on both H100 and A6000 machine.
Benchmarking result
To do benchmarking for lora, run this command to launch server:
Then run this command to request test from client:
Benchmark configurations:
Further Optimization
There are two main bottlenecks of Lora with current Triton backend:
The reward of autotuning is poor since sgemm on lora modules has low arithmetic intensity. The current kernels without autotuning are already fast enough.
The best way to optimize lora kernel is adding Cuda/Cutlass backend, so the time of triton compiling can be saved.
Checklist