Overview
Ngram speculative decoding (paper): during generation, previously decoded tokens are inserted into a trie. Before each forward pass, the trie is queried with the current suffix to produce a draft token tree, constructed via BFS (recency) or priority queue (frequency). The target model then verifies the entire tree in one pass. No draft model needed.
Limitations
-
No external corpus support. The trie is only populated from the current decoding session's output tokens. Ngram speculative decoding works best with a large reference corpus, but there is currently no mechanism to load one.
-
Insert path does not scale to long inputs. It builds trie paths for almost every suffix, leading to near-O(n²) memory growth; when capacity is full, eviction can only remove leaf nodes and may fail.
Goals
- Support external corpus lookup
- Scale to long input prefills
Work Items
Refactor
Adaptive Spec Dec
Spec V2
New Features
Overview
Ngram speculative decoding (paper): during generation, previously decoded tokens are inserted into a trie. Before each forward pass, the trie is queried with the current suffix to produce a draft token tree, constructed via BFS (recency) or priority queue (frequency). The target model then verifies the entire tree in one pass. No draft model needed.
Limitations
No external corpus support. The trie is only populated from the current decoding session's output tokens. Ngram speculative decoding works best with a large reference corpus, but there is currently no mechanism to load one.
Insert path does not scale to long inputs. It builds trie paths for almost every suffix, leading to near-O(n²) memory growth; when capacity is full, eviction can only remove leaf nodes and may fail.
Goals
Work Items
Refactor
branch_lengthtomax_trie_depth- PR #21181max_match_window_sizeandmin_match_window_sizeto match all suffixes in the trie - PR #21225TrieCache::insert()when the worker thread's queue is empty - PR #21186Ngram<Cache>::synchronize()with a condition variable - PR #21186Adaptive Spec Dec
Spec V2
top_k > 1and optionallypage_size > 1for Ngram)New Features
combineResults. Instead of relying on fixedexternal_sam_budget, use weighted dynamic allocation for Trie, SAM(s) match tokens based on longer matching / recency / frequency. - PR #22538max_trie_depth)