[Speculative Decoding] Add FA4-based Spec Support by narutolhy · Pull Request #21080 · sgl-project/sglang

narutolhy · 2026-03-21T05:57:23Z

Motivation

FA4 (FP4 Attention) significantly reduces memory footprint and improves throughput, especially for large-scale and multimodal workloads.

However, FA4 is currently not compatible with the speculative decoding pipeline, which limits its adoption in latency-sensitive scenarios where speculation (e.g., EAGLE/EAGLE3) is critical.

This PR enables FA4 to work seamlessly with speculative decoding, allowing users to combine:

low-precision attention (FA4)
speculative decoding (low latency)

This unlocks better performance trade-offs in production serving.

Modifications

Enable FA4 backend in speculative decoding flow
Support FA4 in both draft and verify stages
Ensure correct behavior for prefill and decode paths
Align FA4 with speculative execution pipeline
Integrate with existing spec scheduling (Spec V2 / overlap schedule)
Handle attention backend selection during speculative execution
Fix compatibility issues and edge cases
Resolve backend mismatches between FA4 and non-FA4 paths
Ensure correctness when switching between attention backends
Refactor attention dispatch logic
Make FA4 usable under speculative execution without breaking existing flows

Accuracy Tests

openai-gpt-oss-120b (mxfp4), B200 x4, FA4, output=512, concurrency=1

Performance (output=512, concurrency=1)

Benchmarking and Profiling

baseline
python3 -m sglang.launch_server
--model openai/gpt-oss-120b
--attention-backend fa4
--moe-runner-backend triton_kernel
--tp 4
--trust-remote-code
--host 0.0.0.0
--port 30000

EAGLE3 3/1/4：
python3 -m sglang.launch_server
--model openai/gpt-oss-120b
--speculative-algorithm EAGLE3
--speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--attention-backend fa4
--moe-runner-backend triton_kernel
--tp 4
--trust-remote-code
--host 0.0.0.0
--port 30000

EAGLE3 6/10/32：
python3 -m sglang.launch_server
--model openai/gpt-oss-120b
--speculative-algorithm EAGLE3
--speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16
--speculative-num-steps 6
--speculative-eagle-topk 10
--speculative-num-draft-tokens 32
--attention-backend fa4
--moe-runner-backend triton_kernel
--tp 4
--trust-remote-code
--host 0.0.0.0
--port 30000

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-21T05:57:27Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Fridge003 · 2026-03-23T01:59:59Z

Can we update this document
https://docs.sglang.io/advanced_features/attention_backend.html
With the latest fa4 features (spec topk=1, spec topk>1...)

narutolhy · 2026-03-23T04:08:34Z

Can we update this document https://docs.sglang.io/advanced_features/attention_backend.html With the latest fa4 features (spec topk=1, spec topk>1...)

Done

Qiaolin-Yu

left several nits

Qiaolin-Yu · 2026-03-23T20:15:37Z


        return FlashAttentionBackend(self.draft_model_runner, skip_prefill=False)

+    def _create_fa4_decode_backend(self):


see the comments above

narutolhy · 2026-03-24T04:02:30Z

left several nits

Excellent suggestion—thank you!

Qiaolin-Yu · 2026-03-24T19:35:48Z

/tag-and-rerun-ci

narutolhy · 2026-03-25T08:44:52Z

/tag-and-rerun-ci

narutolhy · 2026-03-25T21:56:25Z

/tag-and-rerun-ci

narutolhy · 2026-03-26T22:12:44Z

/tag-and-rerun-ci

narutolhy · 2026-03-27T22:56:31Z

/rerun-failed-ci

narutolhy · 2026-03-27T23:16:33Z

/tag-and-rerun-ci

Qiaolin-Yu · 2026-04-01T20:42:15Z

could you help to fix the lint?

narutolhy · 2026-04-02T22:03:36Z

/tag-and-rerun-ci

Qiaolin-Yu · 2026-04-02T22:51:00Z

have conflicts with main

narutolhy · 2026-04-02T23:02:01Z

have conflicts with main

Yes, I found it and fixing. Sorry

- Add FlashAttention4 JIT kernel wrapper for speculative decoding - Update flashattention backend to support FA4 prefill with spec decode - Add draft_utils changes for FA4 compatibility - Add CI test for FA4 + EAGLE3 speculative decoding (topk > 1) - Update attention backend docs

narutolhy · 2026-04-03T01:37:29Z

/tag-and-rerun-ci

narutolhy · 2026-04-03T02:11:44Z

/tag-and-rerun-ci

Fridge003 · 2026-04-04T01:28:11Z

/rerun-ut test_flash_attention_4.py

Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>

narutolhy requested review from BBuf, DarkSharpness, Fridge003, HydraQYH, Qiaolin-Yu, Ying1123, celve, hebiao064, hnyls2002, ispobock, merrymercy and yuan-luo as code owners March 21, 2026 05:57

github-actions Bot added the jit-kernel label Mar 21, 2026

narutolhy force-pushed the fa4-spec branch from 74eb47e to 8d6506c Compare March 22, 2026 20:26

github-actions Bot added the documentation Improvements or additions to documentation label Mar 23, 2026

Qiaolin-Yu reviewed Mar 23, 2026

View reviewed changes

narutolhy requested a review from HaiShaw as a code owner March 24, 2026 03:59

Qiaolin-Yu reviewed Mar 24, 2026

View reviewed changes

Comment thread python/sglang/srt/speculative/draft_utils.py Outdated

Comment thread python/sglang/srt/speculative/draft_utils.py Outdated

Qiaolin-Yu approved these changes Mar 24, 2026

View reviewed changes

github-actions Bot added the run-ci label Mar 24, 2026

narutolhy force-pushed the fa4-spec branch from 3aa6c37 to fc32f8f Compare April 2, 2026 23:03

narutolhy force-pushed the fa4-spec branch from fc32f8f to 2738d19 Compare April 2, 2026 23:04

Fridge003 added 2 commits April 3, 2026 15:54

Move to pure fa4 backend in test

8e9ca90

upd test

78a9fba

Fridge003 merged commit 2476325 into sgl-project:main Apr 4, 2026
247 of 315 checks passed

sundar24295s pushed a commit to sundar24295s/sglang that referenced this pull request Apr 4, 2026

[Speculative Decoding] Add FA4-based Spec Support (sgl-project#21080)

89ecbc0

Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[Speculative Decoding] Add FA4-based Spec Support (sgl-project#21080)

5a33ed0

Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>

Fridge003 pushed a commit that referenced this pull request Apr 7, 2026

[Speculative Decoding] Add FA4-based Spec Support (#21080)

81e27ff

Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>

xiezhq-hermann pushed a commit to antgroup/sglang that referenced this pull request Apr 7, 2026

[Speculative Decoding] Add FA4-based Spec Support (sgl-project#21080)

a6bbc65

Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Speculative Decoding] Add FA4-based Spec Support (sgl-project#21080)

bf7332a

Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>


		return FlashAttentionBackend(self.draft_model_runner, skip_prefill=False)

		def _create_fa4_decode_backend(self):

Conversation

narutolhy commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 21, 2026

Uh oh!

Fridge003 commented Mar 23, 2026

Uh oh!

narutolhy commented Mar 23, 2026

Uh oh!

Qiaolin-Yu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Qiaolin-Yu Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

narutolhy commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Qiaolin-Yu commented Mar 24, 2026

Uh oh!

narutolhy commented Mar 25, 2026

Uh oh!

narutolhy commented Mar 25, 2026

Uh oh!

narutolhy commented Mar 26, 2026

Uh oh!

narutolhy commented Mar 27, 2026

Uh oh!

narutolhy commented Mar 27, 2026

Uh oh!

Qiaolin-Yu commented Apr 1, 2026

Uh oh!

narutolhy commented Apr 2, 2026

Uh oh!

Qiaolin-Yu commented Apr 2, 2026

Uh oh!

narutolhy commented Apr 2, 2026

Uh oh!

narutolhy commented Apr 3, 2026

Uh oh!

narutolhy commented Apr 3, 2026

Uh oh!

Fridge003 commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

narutolhy commented Mar 21, 2026 •

edited

Loading