Skip to content

[Bug] Remove stream sync in fast decode plan of flashinfer mla backend #4905

@Fridge003

Description

@Fridge003

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

flashinfer-ai/flashinfer#969 claims that the flashinfer mla backend can be sped up after removal of

  with self.device as device:
      stream = torch.cuda.current_stream(device).cuda_stream

in fast_mla_decode_plan of flashinfer_mla_backend.py

We need to test its performance after removal.

Reproduction

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code --enable-flashinfer-mla

Environment

GPU: H200 * 8
Latest version of sglang and flashinfer

Related PR

#5208 #5538

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions