[Speculative Decoding] Add FA4-based Spec Support#21080
[Speculative Decoding] Add FA4-based Spec Support#21080Fridge003 merged 3 commits intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Can we update this document |
Done |
|
|
||
| return FlashAttentionBackend(self.draft_model_runner, skip_prefill=False) | ||
|
|
||
| def _create_fa4_decode_backend(self): |
There was a problem hiding this comment.
see the comments above
Excellent suggestion—thank you! |
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
2 similar comments
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
/tag-and-rerun-ci |
|
could you help to fix the lint? |
|
/tag-and-rerun-ci |
|
have conflicts with main |
Yes, I found it and fixing. Sorry |
- Add FlashAttention4 JIT kernel wrapper for speculative decoding - Update flashattention backend to support FA4 prefill with spec decode - Add draft_utils changes for FA4 compatibility - Add CI test for FA4 + EAGLE3 speculative decoding (topk > 1) - Update attention backend docs
|
/tag-and-rerun-ci |
1 similar comment
|
/tag-and-rerun-ci |
|
/rerun-ut test_flash_attention_4.py |
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
Co-authored-by: luhongyu.4869 <luhongyu.4869@bytedance.com>
Motivation
FA4 (FP4 Attention) significantly reduces memory footprint and improves throughput, especially for large-scale and multimodal workloads.
However, FA4 is currently not compatible with the speculative decoding pipeline, which limits its adoption in latency-sensitive scenarios where speculation (e.g., EAGLE/EAGLE3) is critical.
This PR enables FA4 to work seamlessly with speculative decoding, allowing users to combine:
low-precision attention (FA4)
speculative decoding (low latency)
This unlocks better performance trade-offs in production serving.
Modifications
Enable FA4 backend in speculative decoding flow
Support FA4 in both draft and verify stages
Ensure correct behavior for prefill and decode paths
Align FA4 with speculative execution pipeline
Integrate with existing spec scheduling (Spec V2 / overlap schedule)
Handle attention backend selection during speculative execution
Fix compatibility issues and edge cases
Resolve backend mismatches between FA4 and non-FA4 paths
Ensure correctness when switching between attention backends
Refactor attention dispatch logic
Make FA4 usable under speculative execution without breaking existing flows
Accuracy Tests
openai-gpt-oss-120b (mxfp4), B200 x4, FA4, output=512, concurrency=1
Performance (output=512, concurrency=1)

Benchmarking and Profiling
baseline
python3 -m sglang.launch_server
--model openai/gpt-oss-120b
--attention-backend fa4
--moe-runner-backend triton_kernel
--tp 4
--trust-remote-code
--host 0.0.0.0
--port 30000
EAGLE3 3/1/4:
python3 -m sglang.launch_server
--model openai/gpt-oss-120b
--speculative-algorithm EAGLE3
--speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--attention-backend fa4
--moe-runner-backend triton_kernel
--tp 4
--trust-remote-code
--host 0.0.0.0
--port 30000
EAGLE3 6/10/32:
python3 -m sglang.launch_server
--model openai/gpt-oss-120b
--speculative-algorithm EAGLE3
--speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16
--speculative-num-steps 6
--speculative-eagle-topk 10
--speculative-num-draft-tokens 32
--attention-backend fa4
--moe-runner-backend triton_kernel
--tp 4
--trust-remote-code
--host 0.0.0.0
--port 30000
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci