Motivation
Agentic and mixed workloads do not keep a stable speculative decoding acceptance pattern. A server may move between low-acceptance phases, where deep speculation mostly creates wasted draft/verify/extend work, and high-acceptance phases, where shallow speculation leaves available speedup unused.
Adaptive speculative decoding makes the speculative depth a runtime decision. SGLang observes accelt_lens/batch_size/seq_lens, smooths them with an EMA/cost-aware policy, and switches among pre-built runtime tiers so each phase can run closer to its preferred depth.
Design overview
+--------------------------------------------------------------------+
| Speculative Decoding Algorithms |
| +--------+ +-----------+ +-------+ +--------+ +------------+ |
| | EAGLE | | EAGLE3 | | NGRAM | | DFlash | | STANDALONE | |
| +--------+ +-----------+ +-------+ +--------+ +------------+ |
+--------------------------------------------------------------------+
| report signals ^ swap active state
v |
+------------------------------------------------------------------+
| Adaptive Spec Layer |
| +---------------------------+ +----------------------------+ |
| | Policy Module | | Runtime State Module | |
| | | | | |
| | signals: | | SpecRuntimeState Pool | |
| | - accept length | | +------+ +------+ | |
| | - decode batch size | | |tier 0| |tier 1| ... | |
| | - seq lens | | +------+ +------+ | |
| | - cost metrics | | | |
| | | | per-tier resources: | |
| | strategies: | | - CUDA graphs | |
| | - EMA | | - attention backends | |
| | - cost-aware | | - candidate shapes | |
| +-------------+-------------+ +----------------------------+ |
+------------------------------------------------------------------+
Roadmap
Runtime Foundation
Algorithm Coverage
Switch Policy
Compatibility
Reliability
Motivation
Agentic and mixed workloads do not keep a stable speculative decoding acceptance pattern. A server may move between low-acceptance phases, where deep speculation mostly creates wasted draft/verify/extend work, and high-acceptance phases, where shallow speculation leaves available speedup unused.
Adaptive speculative decoding makes the speculative depth a runtime decision. SGLang observes accelt_lens/batch_size/seq_lens, smooths them with an EMA/cost-aware policy, and switches among pre-built runtime tiers so each phase can run closer to its preferred depth.
Design overview
Roadmap
Runtime Foundation
Algorithm Coverage
Switch Policy
Compatibility
Reliability