[SPEC][4/N] feat: adaptive spec support ngram#23629
[SPEC][4/N] feat: adaptive spec support ngram#23629alphabetc1 wants to merge 1 commit intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request enables adaptive speculative decoding for the NGRAM algorithm by integrating the AdaptiveController into the NgramWorker. The changes allow for dynamic adjustment of draft token counts through the creation and application of SpecRuntimeState objects. However, several issues were identified: re-allocating GPU tensors during state transitions will invalidate CUDA graphs, and creating separate attention workspaces for each candidate tier may result in excessive memory consumption. Additionally, a redundant variable assignment in build_adaptive_runtime_state should be removed to improve clarity.
| state.speculative_num_draft_tokens, | ||
| ) | ||
| self.draft_token_num = state.speculative_num_draft_tokens | ||
| self._init_preallocated_tensors() |
There was a problem hiding this comment.
CRITICAL: Re-allocating GPU tensors in apply_runtime_state will break CUDA graphs. The target_graph_runner for each tier is built during init_states and captures the memory addresses of the worker's preallocated tensors (such as tree_mask and positions). When apply_runtime_state calls _init_preallocated_tensors(), it re-allocates these tensors, making the pointers captured in ALL previously built graph runners stale. This will lead to illegal memory access or incorrect results when the graph is replayed.
To fix this, allocate these buffers once for _max_draft_token_num during initialization and only update the batch views (slices) in apply_runtime_state without re-allocating the base tensors.
| target_attn_backend = target_model_runner._get_attention_backend( | ||
| init_new_workspace=True | ||
| ) |
There was a problem hiding this comment.
Passing init_new_workspace=True to _get_attention_backend for every candidate tier during init_states may lead to excessive GPU memory consumption, as each tier's SpecRuntimeState will hold its own attention workspace. Consider if the workspace can be shared or if the allocation can be deferred/optimized, similar to the implementation in EAGLEWorker.
| self, speculative_num_steps: int, speculative_num_draft_tokens: int | ||
| ) -> SpecRuntimeState: | ||
| """Build a NGRAM runtime state for the given draft-token count.""" | ||
| speculative_num_steps = speculative_num_draft_tokens - 1 |
9316ec7 to
7047972
Compare
kpham-sgl
left a comment
There was a problem hiding this comment.
Thanks for the contribution! I think Ngram adaptive speculative algorithm can be a bit more sophisticated (based on SAM and Trie confidence match score). We can discuss this more later.
I would hold on merging this PR until some recent PRs in #21052 is merged
| ) -> SpecRuntimeState: | ||
| """Build a NGRAM runtime state for the given draft-token count.""" | ||
| tic = time.perf_counter() | ||
| before_mem = get_available_gpu_memory(self.server_args.device, self.gpu_id) |
There was a problem hiding this comment.
Ngram drafter is fully CPU based so we probably don't need available_gpu_memory information
| # Truncate if corpus produces more tokens than current config | ||
| if self._max_draft_token_num > self.draft_token_num: | ||
| n, k = self._max_draft_token_num, self.draft_token_num | ||
| req_drafts = req_drafts.reshape(bs, n)[:, :k].flatten() | ||
| mask = mask.reshape(bs, n, n)[:, :k, :k].flatten() |
There was a problem hiding this comment.
Bit confused on why this case would happen. Can you elaborate
Motivation
adaptive spec support ngram
Model and hardware:
/models/LLM-Research/Meta-Llama-3.1-8B-Instruct1x H20(CUDA_VISIBLE_DEVICES=0)Benchmark script:
Mode definitions:
--speculative-num-steps 1 --speculative-num-draft-tokens 2--speculative-num-steps 3 --speculative-num-draft-tokens 4--speculative-num-steps 7 --speculative-num-draft-tokens 8--speculative-num-steps 15 --speculative-num-draft-tokens 16--speculative-adaptive --speculative-adaptive-config <cfg>Shared NGRAM flags:
--speculative-algorithm NGRAM--speculative-ngram-max-bfs-breadth 1--attention-backend triton--skip-server-warmup--mem-fraction-static 0.7Adaptive config in this rerun:
{"candidate_steps": [1, 3, 7, 15]}Note: the config is explicit because the current workspace default remains
[1, 3, 7].Server Launch Commands
Static15
Adaptive
Result Summary
[1,3,7,15]did not beat the best static tierOverall
Comparison
Per-Phase Throughput
Modifications
Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci