[draft] rdk by keyboardAnt · Pull Request #9539 · sgl-project/sglang

keyboardAnt · 2025-08-23T16:36:56Z

Motivation

Implementing RDK from "Out-of-Vocabulary Sampling Boosts Speculative Decoding" (https://arxiv.org/abs/2506.03206). This PR replaces #8393.

Modifications

Accuracy Tests

Benchmarking and Profiling

Comparing the acceptance:

method	pruning method	acceptance	command
eagle	N/A	1.70 (for more repeats, see #8391 (comment))	`python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 --speculative-algorithm EAGLE --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --disable-cuda-graph`
frspec	freq	1.64, 1.65, 1.66	`python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 --speculative-algorithm EAGLE --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --speculative-token-map python/sglang/srt/speculative/token_maps/indices_to_keep_by_tokenizer_freq.pt --disable-cuda-graph`
frspec	var	1.31, 1.31	`python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 --speculative-algorithm EAGLE --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --speculative-token-map python/sglang/srt/speculative/token_maps/indices_to_keep_by_target_var.pt --disable-cuda-graph`
rdk	freq	1.65, 1.61, 1.60	python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 --speculative-algorithm EAGLE --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --speculative-token-map python/sglang/srt/speculative/token_maps/indices_to_keep_by_tokenizer_freq.pt --speculative-weaker-drafter-probs python/sglang/srt/speculative/token_maps/target-probs-meta-llama-Llama-3.1-8B-Instruct-wikitext-wikitext-103-raw-v1-train.pt --disable-cuda-graph
rdk	var	1.34, 1.34	python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 --speculative-algorithm EAGLE --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --speculative-token-map python/sglang/srt/speculative/token_maps/indices_to_keep_by_target_var.pt --speculative-weaker-drafter-probs python/sglang/srt/speculative/token_maps/target-probs-meta-llama-Llama-3.1-8B-Instruct-wikitext-wikitext-103-raw-v1-train.pt --disable-cuda-graph

Code used: https://github.com/keyboardAnt/sglang/tree/nadav/exact-socp-rdk (commit 9181631d7ff9e68b9570cc4177470efcf849175f)

`python3 -m sglang.check_env`:

Python: 3.12.11 (main, Jul 23 2025, 00:34:44) [Clang 20.1.4 ]
CUDA available: True
GPU 0: NVIDIA H100 NVL
GPU 0 Compute Capability: 9.0
CUDA_HOME: /apps/easybd/easybuild/amd/software/CUDA/12.6.0
NVCC: Cuda compilation tools, release 12.6, V12.6.20
CUDA Driver Version: 560.35.05
PyTorch: 2.8.0+cu128
sglang: 0.5.0rc2
sgl_kernel: 0.3.5
flashinfer_python: 0.2.11.post3
triton: 3.4.0
transformers: 4.55.2
torchao: 0.9.0
numpy: 2.3.2
aiohttp: 3.12.15
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.4
interegular: 0.3.3
modelscope: 1.29.0
orjson: 3.11.2
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.1
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.22
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.64.0
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology: 
        GPU0    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    109     1               N/A
NIC0    NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0


ulimit soft: 66000

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

…aps/indices_to_keep_by_target_var.pt`)

keyboardAnt · 2025-09-01T20:41:45Z

@zhyncs, fyi -
#9877 seems to block this draft PR.

hnyls2002 · 2026-03-17T02:10:38Z

Closed due to inactivity. Please reopen with a clear roadmap if needed.

keyboardAnt added 16 commits July 28, 2025 17:58

print the number of accepted drafts per request

c7a17c7

print avg acceptance

2e79850

add toke-map of our 10k vocab (`python/sglang/srt/speculative/token_m…

e87a960

…aps/indices_to_keep_by_target_var.pt`)

log_cold_token_prob

0565db9

logging

93d7ad6

fix device mismatch

b889fc6

sum of cold probs stats (log_stats(sum_of_cold_probs)

13ef6c6

sum_of_cold_probs.to(torch.float64) for numerical stability

c8e41f4

log accumulated stats

5f5a044

launching scripts & wandb wrapper

a04ac93

Merge branch 'nadav/acceptance' into nadav/exact-socp-rdk

7248c06

Merge token_maps from nadav/rdk-minimal

a31fd85

rm --log-cold-token-prob

03eb757

implement rdk (mix the trivial drafter)

9181631

log probs diff

0b67af8

log level debug -> info

8ed669f

keyboardAnt mentioned this pull request Sep 1, 2025

[Question] Speculative Decoding: Sampling from the draft probs vector #9877

Closed

This was referenced Sep 1, 2025

[V1][Spec Decode] Always use argmax for sampling draft tokens vllm-project/vllm#16899

Merged

[V1][Spec Decode][Feature] Spec decode with probs vllm-project/vllm#20459

Closed

hnyls2002 closed this Mar 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[draft] rdk#9539

[draft] rdk#9539
keyboardAnt wants to merge 16 commits intosgl-project:mainfrom
keyboardAnt:nadav/exact-socp-rdk

keyboardAnt commented Aug 23, 2025 •

edited

Loading

Uh oh!

keyboardAnt commented Sep 1, 2025 •

edited

Loading

Uh oh!

hnyls2002 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

keyboardAnt commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

keyboardAnt commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hnyls2002 commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

keyboardAnt commented Aug 23, 2025 •

edited

Loading

keyboardAnt commented Sep 1, 2025 •

edited

Loading