Skip to content

[SPEC][4/N] feat: adaptive spec support ngram#23629

Open
alphabetc1 wants to merge 1 commit intosgl-project:mainfrom
alphabetc1:feat/adaptive_spec_ngram
Open

[SPEC][4/N] feat: adaptive spec support ngram#23629
alphabetc1 wants to merge 1 commit intosgl-project:mainfrom
alphabetc1:feat/adaptive_spec_ngram

Conversation

@alphabetc1
Copy link
Copy Markdown
Collaborator

Motivation

adaptive spec support ngram

Model and hardware:

  • Target model: /models/LLM-Research/Meta-Llama-3.1-8B-Instruct
  • Hardware: 1x H20 (CUDA_VISIBLE_DEVICES=0)

Benchmark script:

PYTHONPATH=python python3 benchmark/bench_adaptive_speculative.py \
  --host 127.0.0.1 \
  --port 7011 \
  --workload transition \
  --requests 16 \
  --concurrency 8 \
  --warmup 8 \
  --max-tokens 256

Mode definitions:

Mode Extra flags
static1 --speculative-num-steps 1 --speculative-num-draft-tokens 2
static3 --speculative-num-steps 3 --speculative-num-draft-tokens 4
static7 --speculative-num-steps 7 --speculative-num-draft-tokens 8
static15 --speculative-num-steps 15 --speculative-num-draft-tokens 16
adaptive static7 flags plus --speculative-adaptive --speculative-adaptive-config <cfg>

Shared NGRAM flags:

  • --speculative-algorithm NGRAM
  • --speculative-ngram-max-bfs-breadth 1
  • --attention-backend triton
  • --skip-server-warmup
  • --mem-fraction-static 0.7

Adaptive config in this rerun:

{"candidate_steps": [1, 3, 7, 15]}

Note: the config is explicit because the current workspace default remains
[1, 3, 7].

Server Launch Commands

Static15

CUDA_VISIBLE_DEVICES=0 \
PYTHONPATH=python \
python3 -m sglang.launch_server \
  --model /models/LLM-Research/Meta-Llama-3.1-8B-Instruct \
  --speculative-algorithm NGRAM \
  --speculative-num-steps 15 \
  --speculative-num-draft-tokens 16 \
  --speculative-ngram-max-bfs-breadth 1 \
  --attention-backend triton \
  --skip-server-warmup \
  --mem-fraction-static 0.7 \
  --host 0.0.0.0 \
  --port 7011 \
  --log-level info

Adaptive

CUDA_VISIBLE_DEVICES=0 \
PYTHONPATH=python \
python3 -m sglang.launch_server \
  --model /models/LLM-Research/Meta-Llama-3.1-8B-Instruct \
  --speculative-algorithm NGRAM \
  --speculative-num-steps 7 \
  --speculative-num-draft-tokens 8 \
  --speculative-ngram-max-bfs-breadth 1 \
  --speculative-adaptive \
  --speculative-adaptive-config /tmp/ngram_adaptive_bench_update_20260424_064555/adaptive_1_3_7_15.json \
  --attention-backend triton \
  --skip-server-warmup \
  --mem-fraction-static 0.7 \
  --host 0.0.0.0 \
  --port 7011 \
  --log-level info

Result Summary

Scenario Date Best mode Best throughput Key point
NGRAM + Meta-Llama-3.1-8B-Instruct (1x H20) 2026-04-24 static7 1660.0 tok/s adaptive with [1,3,7,15] did not beat the best static tier

Overall

Mode Throughput Avg Latency Avg Accept Len Switches Errors Ready Time
static1 1093.0 tok/s 1.613 s 1.73 0 0 62 s
static3 1640.0 tok/s 1.100 s 2.88 0 0 62 s
static7 1660.0 tok/s 0.998 s 4.46 0 0 62 s
static15 1301.3 tok/s 1.334 s 6.36 0 0 65 s
adaptive 1606.3 tok/s 1.133 s 3.64 13 0 70 s

Comparison

Pair Static Adaptive Delta Verdict
adaptive vs static1 1093.0 1606.3 +513.3 tok/s (+46.96%) improvement
adaptive vs static3 1640.0 1606.3 -33.7 tok/s (-2.05%) regression
adaptive vs static7 1660.0 1606.3 -53.7 tok/s (-3.23%) regression
adaptive vs static15 1301.3 1606.3 +305.0 tok/s (+23.44%) improvement

Per-Phase Throughput

Phase static1 static3 static7 static15 adaptive
low_1 931.3 1306.0 1113.9 906.9 1142.7
high_1 1076.1 1276.1 1240.2 865.7 1377.1
low_2 1034.3 1913.6 2090.0 1578.8 1771.4
high_2 1454.2 2728.9 4413.2 5458.2 3058.6

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables adaptive speculative decoding for the NGRAM algorithm by integrating the AdaptiveController into the NgramWorker. The changes allow for dynamic adjustment of draft token counts through the creation and application of SpecRuntimeState objects. However, several issues were identified: re-allocating GPU tensors during state transitions will invalidate CUDA graphs, and creating separate attention workspaces for each candidate tier may result in excessive memory consumption. Additionally, a redundant variable assignment in build_adaptive_runtime_state should be removed to improve clarity.

state.speculative_num_draft_tokens,
)
self.draft_token_num = state.speculative_num_draft_tokens
self._init_preallocated_tensors()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

CRITICAL: Re-allocating GPU tensors in apply_runtime_state will break CUDA graphs. The target_graph_runner for each tier is built during init_states and captures the memory addresses of the worker's preallocated tensors (such as tree_mask and positions). When apply_runtime_state calls _init_preallocated_tensors(), it re-allocates these tensors, making the pointers captured in ALL previously built graph runners stale. This will lead to illegal memory access or incorrect results when the graph is replayed.

To fix this, allocate these buffers once for _max_draft_token_num during initialization and only update the batch views (slices) in apply_runtime_state without re-allocating the base tensors.

Comment on lines +215 to +217
target_attn_backend = target_model_runner._get_attention_backend(
init_new_workspace=True
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Passing init_new_workspace=True to _get_attention_backend for every candidate tier during init_states may lead to excessive GPU memory consumption, as each tier's SpecRuntimeState will hold its own attention workspace. Consider if the workspace can be shared or if the allocation can be deferred/optimized, similar to the implementation in EAGLEWorker.

self, speculative_num_steps: int, speculative_num_draft_tokens: int
) -> SpecRuntimeState:
"""Build a NGRAM runtime state for the given draft-token count."""
speculative_num_steps = speculative_num_draft_tokens - 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This assignment shadows the input argument speculative_num_steps. Since the caller (AdaptiveController.init_states) already passes the correct value (steps), this line is redundant and can be removed to improve code clarity.

Copy link
Copy Markdown
Collaborator

@kpham-sgl kpham-sgl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I think Ngram adaptive speculative algorithm can be a bit more sophisticated (based on SAM and Trie confidence match score). We can discuss this more later.

I would hold on merging this PR until some recent PRs in #21052 is merged

) -> SpecRuntimeState:
"""Build a NGRAM runtime state for the given draft-token count."""
tic = time.perf_counter()
before_mem = get_available_gpu_memory(self.server_args.device, self.gpu_id)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ngram drafter is fully CPU based so we probably don't need available_gpu_memory information

Comment on lines +290 to +294
# Truncate if corpus produces more tokens than current config
if self._max_draft_token_num > self.draft_token_num:
n, k = self._max_draft_token_num, self.draft_token_num
req_drafts = req_drafts.reshape(bs, n)[:, :k].flatten()
mask = mask.reshape(bs, n, n)[:, :k, :k].flatten()
Copy link
Copy Markdown
Collaborator

@kpham-sgl kpham-sgl May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit confused on why this case would happen. Can you elaborate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants