Skip to content

[Feature] attention backend default choice #5064

@zhyncs

Description

@zhyncs

Checklist

Motivation

The standards we choose prioritize performance first, ease of use second (such as interface and installation), while also considering compatibility (such as older arch). Therefore, if in the future, the performance of different backends changes, we will still choose the best performing one.

  1. NVIDIA
sm75 -> Triton
sm80, sm86, sm89 -> FlashInfer
sm90 -> FA3 (Llama, Qwen, Gemma), FlashInfer (Others)
sm100 -> FlashInfer

MLA
sm90 -> FA3 (DeepSeek)
sm100 -> FlashInfer (DeepSeek)

Other options
FlashMLA, cuDNN etc

SGLang will install the JIT version of FlashInfer on PyPI for a better user installation experience. Alternatively, the whl size limit of FlashInfer can be increased on PyPI. cc @yzh119

For FlashInfer, SGLang whl will use JIT version by default, in the Docker image using AOT.

Currently, FA3 is integrated in the sgl-kernel, which is more convenient for users to install and use than installing from source code.

  1. AMD
Triton

@HaiShaw is currently working on improving the performance of the attention backend.

Related resources

No response

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions