Checklist
Motivation
The standards we choose prioritize performance first, ease of use second (such as interface and installation), while also considering compatibility (such as older arch). Therefore, if in the future, the performance of different backends changes, we will still choose the best performing one.
- NVIDIA
sm75 -> Triton
sm80, sm86, sm89 -> FlashInfer
sm90 -> FA3 (Llama, Qwen, Gemma), FlashInfer (Others)
sm100 -> FlashInfer
MLA
sm90 -> FA3 (DeepSeek)
sm100 -> FlashInfer (DeepSeek)
Other options
FlashMLA, cuDNN etc
SGLang will install the JIT version of FlashInfer on PyPI for a better user installation experience. Alternatively, the whl size limit of FlashInfer can be increased on PyPI. cc @yzh119
For FlashInfer, SGLang whl will use JIT version by default, in the Docker image using AOT.
Currently, FA3 is integrated in the sgl-kernel, which is more convenient for users to install and use than installing from source code.
- AMD
@HaiShaw is currently working on improving the performance of the attention backend.
Related resources
No response
Checklist
Motivation
The standards we choose prioritize performance first, ease of use second (such as interface and installation), while also considering compatibility (such as older arch). Therefore, if in the future, the performance of different backends changes, we will still choose the best performing one.
SGLang will install the JIT version of FlashInfer on PyPI for a better user installation experience. Alternatively, the whl size limit of FlashInfer can be increased on PyPI. cc @yzh119
For FlashInfer, SGLang whl will use JIT version by default, in the Docker image using AOT.
Currently, FA3 is integrated in the sgl-kernel, which is more convenient for users to install and use than installing from source code.
@HaiShaw is currently working on improving the performance of the attention backend.
Related resources
No response