Skip to content

support spec decoding in triton unified kernel#11789

Merged
hebiao064 merged 6 commits intosgl-project:bhe/1_stage_triton_kernelfrom
zminglei:triton-spec-decoding
Oct 19, 2025
Merged

support spec decoding in triton unified kernel#11789
hebiao064 merged 6 commits intosgl-project:bhe/1_stage_triton_kernelfrom
zminglei:triton-spec-decoding

Conversation

@zminglei
Copy link
Copy Markdown
Collaborator

@zminglei zminglei commented Oct 18, 2025

Motivation

support speculative decoding in triton unified (deterministic inference) kernel.

Modifications

Accuracy Tests

Before:
spec decoding can't launch successfully with --enable-deterministic-inference (unified triton kernel)

After:

python3 -m sglang.launch_server --model /shared/public/elr-models/meta-llama/Meta-Llama-3.1-8B-Instruct/07eb05b21d191a58c577b4a45982fe0c049d0693  --speculative-algorithm EAGLE3 --speculative-draft-model-path /shared/public/elr-models/jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B/e5ed08d66f528a95ce89f5d4fd136a28f6def714 --speculative-num-steps 3         --speculative-eagle-topk 1 --speculative-num-draft-tokens 4  --trust-remote-code --dtype float16 --enable-torch-compile --attention-backend triton --cuda-graph-max-bs 2 --enable-deterministic-inference

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:18<00:00, 10.70it/s]
Accuracy: 0.770
Invalid: 0.000
Latency: 18.762 s
Output throughput: 909.155 token/s

python3 -m sglang.test.test_deterministic --test-mode prefix --n-trials 50
Prompt 0 with prefix length 1: total samples: 286, Unique samples: 1
Prompt 1 with prefix length 511: total samples: 333, Unique samples: 1
Prompt 2 with prefix length 2048: total samples: 311, Unique samples: 1
Prompt 3 with prefix length 4097: total samples: 345, Unique samples: 1

This score is aligned with the source-of-truth for this model (verified with no --enable-deterministic-inference)

Benchmarking and Profiling

Checklist

@zminglei zminglei marked this pull request as ready for review October 18, 2025 22:03
@zminglei zminglei changed the title support spec decoding in unified kernel support spec decoding in triton unified kernel Oct 18, 2025
@hebiao064 hebiao064 merged commit 6d7e4c2 into sgl-project:bhe/1_stage_triton_kernel Oct 19, 2025
1 check passed
@hebiao064 hebiao064 added the deterministic Issues on deterministic inference/kernels label Oct 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deterministic Issues on deterministic inference/kernels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants