[Roadmap] Further Ngram Speculative Decoding Support

## Overview

Ngram speculative decoding ([paper](https://arxiv.org/pdf/2312.12728)): during generation, previously decoded tokens are inserted into a trie. Before each forward pass, the trie is queried with the current suffix to produce a draft token tree, constructed via BFS (recency) or priority queue (frequency). The target model then verifies the entire tree in one pass. No draft model needed.

## Limitations

1. **No external corpus support.** The trie is only populated from the current decoding session's output tokens. Ngram speculative decoding works best with a large reference corpus, but there is currently no mechanism to load one.

2. **Insert path does not scale to long inputs.** It builds trie paths for almost every suffix, leading to near-O(n²) memory growth; when capacity is full, eviction can only remove leaf nodes and may fail.

## Goals

1. Support external corpus lookup
2. Scale to long input prefills

## Work Items

### Refactor

- [x] Split monolithic C++ ngram module into separate files — [PR #20393](https://github.com/sgl-project/sglang/pull/20393) 
- [x] Rename `branch_length` to `max_trie_depth`- [PR #21181](https://github.com/sgl-project/sglang/pull/21181)
- [x] Remove `max_match_window_size` and `min_match_window_size` to match all suffixes in the trie - [PR #21225](https://github.com/sgl-project/sglang/pull/21225)
- [x] Maintain per-anchor matching state across decode steps — advance anchors on each new token instead of re-walking from trie root - [PR #21243](https://github.com/sgl-project/sglang/pull/21243)
- [x] Fix race condition in `TrieCache::insert()` when the worker thread's queue is empty - [PR #21186](https://github.com/sgl-project/sglang/pull/21186)
- [x] Replace busy-wait polling in `Ngram<Cache>::synchronize()` with a condition variable - [PR #21186](https://github.com/sgl-project/sglang/pull/21186)

### Adaptive Spec Dec
- [ ] https://github.com/sgl-project/sglang/pull/23629
### Spec V2
- [ ] Ngram Spec V2 (require `top_k > 1` and optionally `page_size > 1` for Ngram)

### New Features

- [x] Load an external corpus (pre-specified format) via a server arg and build a [Suffix Automaton (SAM)](https://cp-algorithms.com/string/suffix-automaton.html) over it - [PR #21425](https://github.com/sgl-project/sglang/pull/21425)
- [x] Support multiple SAMs with dynamic add/remove via HTTP API - [PR #22203](https://github.com/sgl-project/sglang/pull/22203)
- [x] Enhance support for mult-SAMs - [PR #22294](https://github.com/sgl-project/sglang/pull/22294) [PR #22471 ](https://github.com/sgl-project/sglang/pull/22471)
- [x] Accept length benchmark: output-as-corpus proves SAM ≥2x accept length boost - [PR #22199](https://github.com/sgl-project/sglang/pull/22199)
- [ ] Better solution for `combineResults`. Instead of relying on fixed `external_sam_budget`, use weighted dynamic allocation for Trie, SAM(s) match tokens based on longer matching / recency / frequency. - [PR #22538](https://github.com/sgl-project/sglang/pull/22538)
- [ ] Accept length benchmark: output-as-corpus and irrelevant corpora proves dynamic allocation work - [PR #22569](https://github.com/sgl-project/sglang/pull/22569)
- [ ] Per-request Trie [PR #22737](https://github.com/sgl-project/sglang/pull/22737)
- [ ] Expand SAM match length (currently still cap by `max_trie_depth`)
- [ ] SAM for user request's input
- [ ] User request take in corpus_ids to match
- [ ] Transport SAM across scheduler in multi-scheduler system (disagg, DP, etc)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] Further Ngram Speculative Decoding Support #21052

Overview

Limitations

Goals

Work Items

Refactor

Adaptive Spec Dec

Spec V2

New Features

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Roadmap] Further Ngram Speculative Decoding Support #21052

Description

Overview

Limitations

Goals

Work Items

Refactor

Adaptive Spec Dec

Spec V2

New Features

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions