Skip to content

[Roadmap] Adaptive Speculative Decoding Roadmap #23705

@alphabetc1

Description

@alphabetc1

Motivation

Agentic and mixed workloads do not keep a stable speculative decoding acceptance pattern. A server may move between low-acceptance phases, where deep speculation mostly creates wasted draft/verify/extend work, and high-acceptance phases, where shallow speculation leaves available speedup unused.

Adaptive speculative decoding makes the speculative depth a runtime decision. SGLang observes accelt_lens/batch_size/seq_lens, smooths them with an EMA/cost-aware policy, and switches among pre-built runtime tiers so each phase can run closer to its preferred depth.

Design overview

+--------------------------------------------------------------------+
|                  Speculative Decoding Algorithms                   |
|   +--------+  +-----------+  +-------+  +--------+  +------------+ |
|   | EAGLE  |  | EAGLE3    |  | NGRAM |  | DFlash |  | STANDALONE | |
|   +--------+  +-----------+  +-------+  +--------+  +------------+ |
+--------------------------------------------------------------------+
         |  report signals                 ^  swap active state
         v                                 |
+------------------------------------------------------------------+
|                      Adaptive Spec Layer                         |
|  +---------------------------+   +----------------------------+  |
|  |      Policy Module        |   |   Runtime State Module     |  |
|  |                           |   |                            |  |
|  |  signals:                 |   |   SpecRuntimeState Pool    |  |
|  |   - accept length         |   |   +------+ +------+        |  |
|  |   - decode batch size     |   |   |tier 0| |tier 1| ...    |  |
|  |   - seq lens              |   |   +------+ +------+        |  |
|  |   - cost metrics          |   |                            |  |
|  |                           |   |   per-tier resources:      |  |
|  |  strategies:              |   |   - CUDA graphs            |  |
|  |   - EMA                   |   |   - attention backends     |  |
|  |   - cost-aware            |   |   - candidate shapes       |  |
|  +-------------+-------------+   +----------------------------+  |
+------------------------------------------------------------------+

Roadmap

Runtime Foundation

Algorithm Coverage

Switch Policy

Compatibility

Reliability

  • CI coverage: test adaptive spec across Spec V1/V2, CUDA graph, streaming, logprob, grammar, abort, and retract.
  • Observability: add metrics for adaptive spec (active tier, wasted draft ratio). Add adaptive speculative decoding observability #23942
  • Memory-aware state management/lazy init.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions