[Spec][Ngram] 7/N: Dynamically select draft token counts from SAMs and Trie#22538
[Spec][Ngram] 7/N: Dynamically select draft token counts from SAMs and Trie#22538kpham-sgl wants to merge 6 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the ngram corpus matching logic to introduce a weighted budget allocation system for speculative decoding. It now distributes draft tokens between the live Trie and external Suffix Automata based on match quality metrics like specificity and confidence. The changes include new helper functions for budget allocation, updated parameter validations, and refactored buildRecency and buildFrequency methods in Trie and SuffixAutomaton to separate anchor matching from result building. An unused batchMatch overload and its FFI binding were removed. Review comments suggest improving the readability and efficiency of a sorting lambda by caching source objects and simplifying a conditional expression for better clarity.
92da2fe to
790b6a0
Compare
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
|
/rerun-test test/registered/unit/spec/test_ngram_corpus.py |
|
/rerun-test test/registered/spec/test_ngram_speculative_decoding.py |
|
✅ |
|
✅ |
790b6a0 to
8ef4ab4
Compare
|
/rerun-test test/registered/spec/test_ngram_speculative_decoding.py |
|
✅ |
Motivation
Part of Ngram series #21052
This PR removes the old fixed trie/SAM draft-budget split. Instead of reserving draft tokens with
external_sam_budget, trie and each loaded SAM are treated as candidate sources scored byscore = source_prior * (w_specificity * specificity + w_confidence * confidence)wherew_specificityandw_confidenceare normalized from the user-provided weights. Sources are merged in score order, and the final merged tree is capped only bynum_draft_tokens.This lets multiple external corpora participate in drafting without hard partitioning draft capacity across trie and SAMs.
Followed by #22569 which benchmarks this PR's effect on accept length across some experiments
Modifications
matching, root-result merging, and multi-corpus management.
can be loaded and used during Ngram speculative decoding.
Remove
external_sam_budgetandmin_trie_share; keeptrie_source_prior,match_specificity_weight, andmatch_confidence_weightas source-ranking knobs.behavior, server args, and the output-as-corpus accept-length regression.
Accuracy Tests
Passed:
python3 -m pytest test/registered/unit/spec/test_ngram_corpus.py -k "TestNgramCorpusExternalSam or TestNgramCorpusMultiSam"python3 -m pytest test/registered/unit/server_args/test_server_args.py -k "NgramExternalSamArgs"python3 -m pytest test/registered/spec/test_ngram_speculative_decoding.py -k "TestNgramSpeculativeDecodingFlashinfer and output_as_corpus_boosts_accept_length"See #22569 for extra benchmarks on accept length
Speed Tests and Profiling
See #22569 for extra benchmarks on accept length
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci