[SPEC][4/N] feat: adaptive spec support ngram by alphabetc1 · Pull Request #23629 · sgl-project/sglang

alphabetc1 · 2026-04-24T07:01:41Z

Motivation

adaptive spec support ngram

Model and hardware:

Target model: /models/LLM-Research/Meta-Llama-3.1-8B-Instruct
Hardware: 1x H20 (CUDA_VISIBLE_DEVICES=0)

Benchmark script:

PYTHONPATH=python python3 benchmark/bench_adaptive_speculative.py \
  --host 127.0.0.1 \
  --port 7011 \
  --workload transition \
  --requests 16 \
  --concurrency 8 \
  --warmup 8 \
  --max-tokens 256

Mode definitions:

Mode	Extra flags
static1	`--speculative-num-steps 1 --speculative-num-draft-tokens 2`
static3	`--speculative-num-steps 3 --speculative-num-draft-tokens 4`
static7	`--speculative-num-steps 7 --speculative-num-draft-tokens 8`
static15	`--speculative-num-steps 15 --speculative-num-draft-tokens 16`
adaptive	static7 flags plus `--speculative-adaptive --speculative-adaptive-config <cfg>`

Shared NGRAM flags:

--speculative-algorithm NGRAM
--speculative-ngram-max-bfs-breadth 1
--attention-backend triton
--skip-server-warmup
--mem-fraction-static 0.7

Adaptive config in this rerun:

{"candidate_steps": [1, 3, 7, 15]}

Note: the config is explicit because the current workspace default remains
[1, 3, 7].

Server Launch Commands

Static15

CUDA_VISIBLE_DEVICES=0 \
PYTHONPATH=python \
python3 -m sglang.launch_server \
  --model /models/LLM-Research/Meta-Llama-3.1-8B-Instruct \
  --speculative-algorithm NGRAM \
  --speculative-num-steps 15 \
  --speculative-num-draft-tokens 16 \
  --speculative-ngram-max-bfs-breadth 1 \
  --attention-backend triton \
  --skip-server-warmup \
  --mem-fraction-static 0.7 \
  --host 0.0.0.0 \
  --port 7011 \
  --log-level info

Adaptive

CUDA_VISIBLE_DEVICES=0 \
PYTHONPATH=python \
python3 -m sglang.launch_server \
  --model /models/LLM-Research/Meta-Llama-3.1-8B-Instruct \
  --speculative-algorithm NGRAM \
  --speculative-num-steps 7 \
  --speculative-num-draft-tokens 8 \
  --speculative-ngram-max-bfs-breadth 1 \
  --speculative-adaptive \
  --speculative-adaptive-config /tmp/ngram_adaptive_bench_update_20260424_064555/adaptive_1_3_7_15.json \
  --attention-backend triton \
  --skip-server-warmup \
  --mem-fraction-static 0.7 \
  --host 0.0.0.0 \
  --port 7011 \
  --log-level info

Result Summary

Scenario	Date	Best mode	Best throughput	Key point
NGRAM + Meta-Llama-3.1-8B-Instruct (1x H20)	2026-04-24	static7	1660.0 tok/s	adaptive with `[1,3,7,15]` did not beat the best static tier

Overall

Mode	Throughput	Avg Latency	Avg Accept Len	Switches	Ready Time
static1	1093.0 tok/s	1.613 s	1.73	0	62 s
static3	1640.0 tok/s	1.100 s	2.88	0	62 s
static7	1660.0 tok/s	0.998 s	4.46	0	62 s
static15	1301.3 tok/s	1.334 s	6.36	0	65 s
adaptive	1606.3 tok/s	1.133 s	3.64	13	70 s

Comparison

Pair	Static	Adaptive	Delta	Verdict
adaptive vs static1	1093.0	1606.3	+513.3 tok/s (+46.96%)	improvement
adaptive vs static3	1640.0	1606.3	-33.7 tok/s (-2.05%)	regression
adaptive vs static7	1660.0	1606.3	-53.7 tok/s (-3.23%)	regression
adaptive vs static15	1301.3	1606.3	+305.0 tok/s (+23.44%)	improvement

Per-Phase Throughput

Phase	static1	static3	static7	static15	adaptive
low_1	931.3	1306.0	1113.9	906.9	1142.7
high_1	1076.1	1276.1	1240.2	865.7	1377.1
low_2	1034.3	1913.6	2090.0	1578.8	1771.4
high_2	1454.2	2728.9	4413.2	5458.2	3058.6

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request enables adaptive speculative decoding for the NGRAM algorithm by integrating the AdaptiveController into the NgramWorker. The changes allow for dynamic adjustment of draft token counts through the creation and application of SpecRuntimeState objects. However, several issues were identified: re-allocating GPU tensors during state transitions will invalidate CUDA graphs, and creating separate attention workspaces for each candidate tier may result in excessive memory consumption. Additionally, a redundant variable assignment in build_adaptive_runtime_state should be removed to improve clarity.

gemini-code-assist · 2026-04-24T07:04:20Z

+            state.speculative_num_draft_tokens,
+        )
+        self.draft_token_num = state.speculative_num_draft_tokens
+        self._init_preallocated_tensors()


CRITICAL: Re-allocating GPU tensors in apply_runtime_state will break CUDA graphs. The target_graph_runner for each tier is built during init_states and captures the memory addresses of the worker's preallocated tensors (such as tree_mask and positions). When apply_runtime_state calls _init_preallocated_tensors(), it re-allocates these tensors, making the pointers captured in ALL previously built graph runners stale. This will lead to illegal memory access or incorrect results when the graph is replayed.

To fix this, allocate these buffers once for _max_draft_token_num during initialization and only update the batch views (slices) in apply_runtime_state without re-allocating the base tensors.

gemini-code-assist · 2026-04-24T07:04:20Z

+                target_attn_backend = target_model_runner._get_attention_backend(
+                    init_new_workspace=True
+                )


Passing init_new_workspace=True to _get_attention_backend for every candidate tier during init_states may lead to excessive GPU memory consumption, as each tier's SpecRuntimeState will hold its own attention workspace. Consider if the workspace can be shared or if the allocation can be deferred/optimized, similar to the implementation in EAGLEWorker.

gemini-code-assist · 2026-04-24T07:04:20Z

+        self, speculative_num_steps: int, speculative_num_draft_tokens: int
+    ) -> SpecRuntimeState:
+        """Build a NGRAM runtime state for the given draft-token count."""
+        speculative_num_steps = speculative_num_draft_tokens - 1


This assignment shadows the input argument speculative_num_steps. Since the caller (AdaptiveController.init_states) already passes the correct value (steps), this line is redundant and can be removed to improve code clarity.

kpham-sgl

Thanks for the contribution! I think Ngram adaptive speculative algorithm can be a bit more sophisticated (based on SAM and Trie confidence match score). We can discuss this more later.

I would hold on merging this PR until some recent PRs in #21052 is merged

kpham-sgl · 2026-05-02T03:05:53Z

+    ) -> SpecRuntimeState:
+        """Build a NGRAM runtime state for the given draft-token count."""
+        tic = time.perf_counter()
+        before_mem = get_available_gpu_memory(self.server_args.device, self.gpu_id)


Ngram drafter is fully CPU based so we probably don't need available_gpu_memory information

kpham-sgl · 2026-05-02T03:10:53Z

+        # Truncate if corpus produces more tokens than current config
+        if self._max_draft_token_num > self.draft_token_num:
+            n, k = self._max_draft_token_num, self.draft_token_num
+            req_drafts = req_drafts.reshape(bs, n)[:, :k].flatten()
+            mask = mask.reshape(bs, n, n)[:, :k, :k].flatten()


Bit confused on why this case would happen. Can you elaborate

alphabetc1 requested review from Qiaolin-Yu, Ying1123, hnyls2002 and merrymercy as code owners April 24, 2026 07:01

alphabetc1 assigned Qiaolin-Yu Apr 24, 2026

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

[SPEC] feat: adaptive spec support ngram

7047972

alphabetc1 force-pushed the feat/adaptive_spec_ngram branch from 9316ec7 to 7047972 Compare April 24, 2026 07:22

alphabetc1 mentioned this pull request Apr 25, 2026

[Roadmap] Adaptive Speculative Decoding Roadmap #23705

Open

17 tasks

Qiaolin-Yu assigned kpham-sgl Apr 29, 2026

Qiaolin-Yu requested a review from kpham-sgl April 29, 2026 09:10

kpham-sgl mentioned this pull request Apr 29, 2026

[Roadmap] Further Ngram Speculative Decoding Support #21052

Open

19 tasks

kpham-sgl reviewed May 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPEC][4/N] feat: adaptive spec support ngram#23629

[SPEC][4/N] feat: adaptive spec support ngram#23629
alphabetc1 wants to merge 1 commit intosgl-project:mainfrom
alphabetc1:feat/adaptive_spec_ngram

alphabetc1 commented Apr 24, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

kpham-sgl left a comment

Uh oh!

kpham-sgl May 2, 2026

Uh oh!

kpham-sgl May 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alphabetc1 commented Apr 24, 2026

Motivation

Server Launch Commands

Static15

Adaptive

Result Summary

Overall

Comparison

Per-Phase Throughput

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

kpham-sgl left a comment

Choose a reason for hiding this comment

Uh oh!

kpham-sgl May 2, 2026

Choose a reason for hiding this comment

Uh oh!

kpham-sgl May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kpham-sgl May 2, 2026 •

edited

Loading