Skip to content

[draft] rdk#9539

Closed
keyboardAnt wants to merge 16 commits intosgl-project:mainfrom
keyboardAnt:nadav/exact-socp-rdk
Closed

[draft] rdk#9539
keyboardAnt wants to merge 16 commits intosgl-project:mainfrom
keyboardAnt:nadav/exact-socp-rdk

Conversation

@keyboardAnt
Copy link
Copy Markdown

@keyboardAnt keyboardAnt commented Aug 23, 2025

Motivation

Implementing RDK from "Out-of-Vocabulary Sampling Boosts Speculative Decoding" (https://arxiv.org/abs/2506.03206). This PR replaces #8393.

Modifications

Accuracy Tests

Benchmarking and Profiling

Comparing the acceptance:

method pruning method acceptance command
eagle N/A 1.70 (for more repeats, see #8391 (comment)) python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 --speculative-algorithm EAGLE --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --disable-cuda-graph
frspec freq 1.64, 1.65, 1.66 python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 --speculative-algorithm EAGLE --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --speculative-token-map python/sglang/srt/speculative/token_maps/indices_to_keep_by_tokenizer_freq.pt --disable-cuda-graph
frspec var 1.31, 1.31 python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 --speculative-algorithm EAGLE --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --speculative-token-map python/sglang/srt/speculative/token_maps/indices_to_keep_by_target_var.pt --disable-cuda-graph
rdk freq 1.65, 1.61, 1.60 python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 --speculative-algorithm EAGLE --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --speculative-token-map python/sglang/srt/speculative/token_maps/indices_to_keep_by_tokenizer_freq.pt --speculative-weaker-drafter-probs python/sglang/srt/speculative/token_maps/target-probs-meta-llama-Llama-3.1-8B-Instruct-wikitext-wikitext-103-raw-v1-train.pt --disable-cuda-graph
rdk var 1.34, 1.34 python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10 --speculative-algorithm EAGLE --speculative-draft-model-path lmsys/sglang-EAGLE-LLaMA3-Instruct-8B --speculative-num-steps 5 --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 --mem-fraction 0.7 --cuda-graph-max-bs 2 --dtype float16 --speculative-token-map python/sglang/srt/speculative/token_maps/indices_to_keep_by_target_var.pt --speculative-weaker-drafter-probs python/sglang/srt/speculative/token_maps/target-probs-meta-llama-Llama-3.1-8B-Instruct-wikitext-wikitext-103-raw-v1-train.pt --disable-cuda-graph

Code used: https://github.com/keyboardAnt/sglang/tree/nadav/exact-socp-rdk (commit 9181631d7ff9e68b9570cc4177470efcf849175f)

`python3 -m sglang.check_env`:
Python: 3.12.11 (main, Jul 23 2025, 00:34:44) [Clang 20.1.4 ]
CUDA available: True
GPU 0: NVIDIA H100 NVL
GPU 0 Compute Capability: 9.0
CUDA_HOME: /apps/easybd/easybuild/amd/software/CUDA/12.6.0
NVCC: Cuda compilation tools, release 12.6, V12.6.20
CUDA Driver Version: 560.35.05
PyTorch: 2.8.0+cu128
sglang: 0.5.0rc2
sgl_kernel: 0.3.5
flashinfer_python: 0.2.11.post3
triton: 3.4.0
transformers: 4.55.2
torchao: 0.9.0
numpy: 2.3.2
aiohttp: 3.12.15
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.4
interegular: 0.3.3
modelscope: 1.29.0
orjson: 3.11.2
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.1
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.22
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.64.0
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology: 
        GPU0    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    109     1               N/A
NIC0    NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0


ulimit soft: 66000

Checklist

@keyboardAnt
Copy link
Copy Markdown
Author

keyboardAnt commented Sep 1, 2025

@zhyncs, fyi -
#9877 seems to block this draft PR.

@hnyls2002
Copy link
Copy Markdown
Collaborator

Closed due to inactivity. Please reopen with a clear roadmap if needed.

@hnyls2002 hnyls2002 closed this Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants