[Roadmap] Adaptive Speculative Decoding Roadmap

## Motivation
Agentic and mixed workloads do not keep a stable speculative decoding acceptance pattern. A server may move between low-acceptance phases, where deep speculation mostly creates wasted draft/verify/extend work, and high-acceptance phases, where shallow speculation leaves available speedup unused.

Adaptive speculative decoding makes the speculative depth a runtime decision. SGLang observes accelt_lens/batch_size/seq_lens, smooths them with an EMA/cost-aware policy, and switches among pre-built runtime tiers so each phase can run closer to its preferred depth.

## Design overview

```
+--------------------------------------------------------------------+
|                  Speculative Decoding Algorithms                   |
|   +--------+  +-----------+  +-------+  +--------+  +------------+ |
|   | EAGLE  |  | EAGLE3    |  | NGRAM |  | DFlash |  | STANDALONE | |
|   +--------+  +-----------+  +-------+  +--------+  +------------+ |
+--------------------------------------------------------------------+
         |  report signals                 ^  swap active state
         v                                 |
+------------------------------------------------------------------+
|                      Adaptive Spec Layer                         |
|  +---------------------------+   +----------------------------+  |
|  |      Policy Module        |   |   Runtime State Module     |  |
|  |                           |   |                            |  |
|  |  signals:                 |   |   SpecRuntimeState Pool    |  |
|  |   - accept length         |   |   +------+ +------+        |  |
|  |   - decode batch size     |   |   |tier 0| |tier 1| ...    |  |
|  |   - seq lens              |   |   +------+ +------+        |  |
|  |   - cost metrics          |   |                            |  |
|  |                           |   |   per-tier resources:      |  |
|  |  strategies:              |   |   - CUDA graphs            |  |
|  |   - EMA                   |   |   - attention backends     |  |
|  |   - cost-aware            |   |   - candidate shapes       |  |
|  +-------------+-------------+   +----------------------------+  |
+------------------------------------------------------------------+
```
# Roadmap

## Runtime Foundation
- [x] Initial PR: add adaptive controller, EMA policy, runtime-state prebuild, and state swap. https://github.com/sgl-project/sglang/pull/21599
- [x] Spec V2 support. https://github.com/sgl-project/sglang/pull/23336
- [ ] Support step = 0 (adaptively open/close spec decoding during runtime) @Qiaolin-Yu 

## Algorithm Coverage
- [x] EAGLE/EAGLE3. https://github.com/sgl-project/sglang/pull/21599
- [ ] STANDALONE. https://github.com/sgl-project/sglang/pull/23519
- [ ] NGRAM. https://github.com/sgl-project/sglang/pull/23629
- [ ] DFlash. https://github.com/sgl-project/sglang/pull/24596

## Switch Policy
- [ ] Load-aware policy: adjust speculative depth based on batch_size/seq_lens. https://github.com/sgl-project/sglang/pull/24055
- [ ] Cost-aware policy: choose tiers by expected accepted tokens versus draft, verify, extend, and memory cost. @alphabetc1 
- [ ] Per-request/Mixed-workload policy: use accept-length distribution instead of only batch average.

## Compatibility
- [ ] DP attention compatibility.
- [ ] topk > 1. https://github.com/sgl-project/sglang/pull/24054
- [ ] TBO compatibility.
- [ ] PDMux compatibility.

## Reliability
- [ ] CI coverage: test adaptive spec across Spec V1/V2, CUDA graph, streaming, logprob, grammar, abort, and retract.
- [ ] Observability: add metrics for adaptive spec (active tier, wasted draft ratio). https://github.com/sgl-project/sglang/pull/23942
- [ ] Memory-aware state management/lazy init.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] Adaptive Speculative Decoding Roadmap #23705

Motivation

Design overview

Roadmap

Runtime Foundation

Algorithm Coverage

Switch Policy

Compatibility

Reliability

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Roadmap] Adaptive Speculative Decoding Roadmap #23705

Description

Motivation

Design overview

Roadmap

Runtime Foundation

Algorithm Coverage

Switch Policy

Compatibility

Reliability

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions