Skip to content

[Roadmap] Further Ngram Speculative Decoding Support #21052

@kpham-sgl

Description

@kpham-sgl

Overview

Ngram speculative decoding (paper): during generation, previously decoded tokens are inserted into a trie. Before each forward pass, the trie is queried with the current suffix to produce a draft token tree, constructed via BFS (recency) or priority queue (frequency). The target model then verifies the entire tree in one pass. No draft model needed.

Limitations

  1. No external corpus support. The trie is only populated from the current decoding session's output tokens. Ngram speculative decoding works best with a large reference corpus, but there is currently no mechanism to load one.

  2. Insert path does not scale to long inputs. It builds trie paths for almost every suffix, leading to near-O(n²) memory growth; when capacity is full, eviction can only remove leaf nodes and may fail.

Goals

  1. Support external corpus lookup
  2. Scale to long input prefills

Work Items

Refactor

  • Split monolithic C++ ngram module into separate files — PR #20393
  • Rename branch_length to max_trie_depth- PR #21181
  • Remove max_match_window_size and min_match_window_size to match all suffixes in the trie - PR #21225
  • Maintain per-anchor matching state across decode steps — advance anchors on each new token instead of re-walking from trie root - PR #21243
  • Fix race condition in TrieCache::insert() when the worker thread's queue is empty - PR #21186
  • Replace busy-wait polling in Ngram<Cache>::synchronize() with a condition variable - PR #21186

Adaptive Spec Dec

Spec V2

  • Ngram Spec V2 (require top_k > 1 and optionally page_size > 1 for Ngram)

New Features

  • Load an external corpus (pre-specified format) via a server arg and build a Suffix Automaton (SAM) over it - PR #21425
  • Support multiple SAMs with dynamic add/remove via HTTP API - PR #22203
  • Enhance support for mult-SAMs - PR #22294 PR #22471
  • Accept length benchmark: output-as-corpus proves SAM ≥2x accept length boost - PR #22199
  • Better solution for combineResults. Instead of relying on fixed external_sam_budget, use weighted dynamic allocation for Trie, SAM(s) match tokens based on longer matching / recency / frequency. - PR #22538
  • Accept length benchmark: output-as-corpus and irrelevant corpora proves dynamic allocation work - PR #22569
  • Per-request Trie PR #22737
  • Expand SAM match length (currently still cap by max_trie_depth)
  • SAM for user request's input
  • User request take in corpus_ids to match
  • Transport SAM across scheduler in multi-scheduler system (disagg, DP, etc)

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions