Skip to content

[Speculative Decoding] Add FA4-based Spec Support#21080

Merged
Fridge003 merged 3 commits intosgl-project:mainfrom
narutolhy:fa4-spec
Apr 4, 2026
Merged

[Speculative Decoding] Add FA4-based Spec Support#21080
Fridge003 merged 3 commits intosgl-project:mainfrom
narutolhy:fa4-spec

Conversation

@narutolhy
Copy link
Copy Markdown
Contributor

@narutolhy narutolhy commented Mar 21, 2026

Motivation

FA4 (FP4 Attention) significantly reduces memory footprint and improves throughput, especially for large-scale and multimodal workloads.

However, FA4 is currently not compatible with the speculative decoding pipeline, which limits its adoption in latency-sensitive scenarios where speculation (e.g., EAGLE/EAGLE3) is critical.

This PR enables FA4 to work seamlessly with speculative decoding, allowing users to combine:

low-precision attention (FA4)
speculative decoding (low latency)

This unlocks better performance trade-offs in production serving.

Modifications

Enable FA4 backend in speculative decoding flow
Support FA4 in both draft and verify stages
Ensure correct behavior for prefill and decode paths
Align FA4 with speculative execution pipeline
Integrate with existing spec scheduling (Spec V2 / overlap schedule)
Handle attention backend selection during speculative execution
Fix compatibility issues and edge cases
Resolve backend mismatches between FA4 and non-FA4 paths
Ensure correctness when switching between attention backends
Refactor attention dispatch logic
Make FA4 usable under speculative execution without breaking existing flows

Accuracy Tests

openai-gpt-oss-120b (mxfp4), B200 x4, FA4, output=512, concurrency=1

Performance (output=512, concurrency=1)
image

Benchmarking and Profiling

baseline
python3 -m sglang.launch_server
--model openai/gpt-oss-120b
--attention-backend fa4
--moe-runner-backend triton_kernel
--tp 4
--trust-remote-code
--host 0.0.0.0
--port 30000

EAGLE3 3/1/4:
python3 -m sglang.launch_server
--model openai/gpt-oss-120b
--speculative-algorithm EAGLE3
--speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--attention-backend fa4
--moe-runner-backend triton_kernel
--tp 4
--trust-remote-code
--host 0.0.0.0
--port 30000

EAGLE3 6/10/32:
python3 -m sglang.launch_server
--model openai/gpt-oss-120b
--speculative-algorithm EAGLE3
--speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16
--speculative-num-steps 6
--speculative-eagle-topk 10
--speculative-num-draft-tokens 32
--attention-backend fa4
--moe-runner-backend triton_kernel
--tp 4
--trust-remote-code
--host 0.0.0.0
--port 30000

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Fridge003
Copy link
Copy Markdown
Collaborator

Can we update this document
https://docs.sglang.io/advanced_features/attention_backend.html
With the latest fa4 features (spec topk=1, spec topk>1...)

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Mar 23, 2026
@narutolhy
Copy link
Copy Markdown
Contributor Author

Can we update this document https://docs.sglang.io/advanced_features/attention_backend.html With the latest fa4 features (spec topk=1, spec topk>1...)

Done

Copy link
Copy Markdown
Collaborator

@Qiaolin-Yu Qiaolin-Yu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left several nits

Comment thread python/sglang/srt/speculative/draft_utils.py Outdated

return FlashAttentionBackend(self.draft_model_runner, skip_prefill=False)

def _create_fa4_decode_backend(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the comments above

Comment thread python/sglang/srt/speculative/draft_utils.py
@narutolhy narutolhy requested a review from HaiShaw as a code owner March 24, 2026 03:59
@narutolhy
Copy link
Copy Markdown
Contributor Author

left several nits

Excellent suggestion—thank you!

Comment thread python/sglang/srt/speculative/draft_utils.py Outdated
Comment thread python/sglang/srt/speculative/draft_utils.py Outdated
@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@narutolhy
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

2 similar comments
@narutolhy
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@narutolhy
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@narutolhy
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@narutolhy
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

could you help to fix the lint?

@narutolhy
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

have conflicts with main

@narutolhy
Copy link
Copy Markdown
Contributor Author

have conflicts with main

Yes, I found it and fixing. Sorry

- Add FlashAttention4 JIT kernel wrapper for speculative decoding
- Update flashattention backend to support FA4 prefill with spec decode
- Add draft_utils changes for FA4 compatibility
- Add CI test for FA4 + EAGLE3 speculative decoding (topk > 1)
- Update attention backend docs
@narutolhy
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

1 similar comment
@narutolhy
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@Fridge003
Copy link
Copy Markdown
Collaborator

/rerun-ut test_flash_attention_4.py

@Fridge003 Fridge003 merged commit 2476325 into sgl-project:main Apr 4, 2026
247 of 315 checks passed
sundar24295s pushed a commit to sundar24295s/sglang that referenced this pull request Apr 4, 2026
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
Fridge003 pushed a commit that referenced this pull request Apr 7, 2026
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
xiezhq-hermann pushed a commit to antgroup/sglang that referenced this pull request Apr 7, 2026
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation jit-kernel run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants