[Feature] Support EAGLE 3#4247
Conversation
|
Benchmarks on MT-Bench (bsz 1): Autoregressive: EAGLE-3: |
|
@chromecast56 Can we use |
|
@chromecast56 Ours is in the round level. We calculate the number of accepted tokens in each speculative decoding round, add it by 1 (because the very last token in each round will always be accepted as it is from the target model), and average the numbers across all rounds. |
|
@chromecast56 could you fix the conflicts of the docs? @zhyncs @merrymercy could we merge it? |
|
@merrymercy @Ying1123 reminder |
|
Hi @chromecast56 May you help fix the conflicts |
merrymercy
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the great work.
|
cc @simveit hey simon. Eagle 3 is merged into sglang now, yienng @zhyncs will profiling it today. Could you help to update the docs https://docs.sglang.ai/backend/speculative_decoding.html after yineng provides the performance? thanks so much! |
|
@zhaochenyang20 Yes. Let me read the paper in the next days. |
|
should we change |
|
@ispobock can you share the exact commands to launch the servers for eagle2 and eagle3? |
|
@merrymercy For bs=1, the launch command is here: # EAGLE2
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algo EAGLE \
--speculative-draft jamesliu1/sglang-EAGLE-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64 \
--cuda-graph-max-bs 1 --dtype float16 --port 30000 --tp 1 --disable-radix --mem-frac 0.7
# EAGLE3
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algo EAGLE3 \
--speculative-draft jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 8 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 64 \
--cuda-graph-max-bs 1 --dtype float16 --port 30000 --tp 1 --disable-radix --mem-frac 0.7MT-bench is the default bench dataset in EAGLE's evaluation code. It's used to keep align with the setting in the paper. |
|
I'll also work on the docs about this feature these days. |
Motivation
Add support for EAGLE-3: https://arxiv.org/abs/2503.01840
Modifications
EAGLE3speculative method to server argsllama.py, logits_processor.pyto support capturing auxiliary hidden stateseagle_worker.pyto support EAGLE-3 token map + untied LM headllama_eagle3.pymodelChecklist