Skip to content

[Feature] Support EAGLE 3#4247

Merged
zhyncs merged 31 commits into
sgl-project:mainfrom
chromecast56:eagle3
Mar 18, 2025
Merged

[Feature] Support EAGLE 3#4247
zhyncs merged 31 commits into
sgl-project:mainfrom
chromecast56:eagle3

Conversation

@chromecast56

Copy link
Copy Markdown
Contributor

Motivation

Add support for EAGLE-3: https://arxiv.org/abs/2503.01840

Modifications

  • Add EAGLE3 speculative method to server args
  • Refactor llama.py, logits_processor.py to support capturing auxiliary hidden states
  • Modify eagle_worker.py to support EAGLE-3 token map + untied LM head
  • Add llama_eagle3.py model
  • Tests and Documentation

Checklist

@zhyncs

zhyncs commented Mar 10, 2025

Copy link
Copy Markdown
Collaborator

cc @Liyuhui-12 @hongyanz

@chromecast56

Copy link
Copy Markdown
Contributor Author

Benchmarks on MT-Bench (bsz 1):

Autoregressive:


python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --port 30000 --cuda-graph-max-bs 1

#questions: 1, Throughput: 147.05 token/s, Acceptance length: 1.00

EAGLE-3:

python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algo EAGLE3 \
    --speculative-draft jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 \
    --cuda-graph-max-bs 1 --mem-fraction 0.7 --dtype float16 --port 30000

#questions: 80, Throughput: 336.61 token/s, Acceptance length: 4.29

@zhyncs

zhyncs commented Mar 10, 2025

Copy link
Copy Markdown
Collaborator

@chromecast56 Can we use --speculative-num-steps 8

@hongyanz

hongyanz commented Mar 12, 2025

Copy link
Copy Markdown

@chromecast56 Ours is in the round level. We calculate the number of accepted tokens in each speculative decoding round, add it by 1 (because the very last token in each round will always be accepted as it is from the target model), and average the numbers across all rounds.

@ispobock ispobock left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhaochenyang20

Copy link
Copy Markdown
Collaborator

@chromecast56 could you fix the conflicts of the docs? @zhyncs @merrymercy could we merge it?

@zhyncs

zhyncs commented Mar 13, 2025

Copy link
Copy Markdown
Collaborator

@merrymercy @Ying1123 reminder

@zhyncs

zhyncs commented Mar 17, 2025

Copy link
Copy Markdown
Collaborator

Hi @chromecast56 May you help fix the conflicts

@merrymercy merrymercy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the great work.

@zhyncs zhyncs merged commit 9e0186f into sgl-project:main Mar 18, 2025
@zhaochenyang20

Copy link
Copy Markdown
Collaborator

cc @simveit hey simon. Eagle 3 is merged into sglang now, yienng @zhyncs will profiling it today. Could you help to update the docs https://docs.sglang.ai/backend/speculative_decoding.html after yineng provides the performance? thanks so much!

@simveit

simveit commented Mar 18, 2025

Copy link
Copy Markdown
Contributor

@zhaochenyang20 Yes. Let me read the paper in the next days.

@finger92

Copy link
Copy Markdown
Contributor

should we change
"You can enable EAGLE-3 decoding by setting --speculative_draft_model_path: EAGLE3:"
to
"You can enable EAGLE-3 decoding by setting --speculative-algorithm EAGLE3:"
?

@zhyncs zhyncs mentioned this pull request Mar 22, 2025
6 tasks
@merrymercy

Copy link
Copy Markdown
Contributor

@ispobock can you share the exact commands to launch the servers for eagle2 and eagle3?
We do not need to benchmark on MT-bench, we can just use python3 -m sglang.test.send_one

@ispobock

Copy link
Copy Markdown
Collaborator

@merrymercy For bs=1, the launch command is here:

# EAGLE2
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algo EAGLE \
    --speculative-draft jamesliu1/sglang-EAGLE-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 \
    --cuda-graph-max-bs 1 --dtype float16 --port 30000 --tp 1 --disable-radix --mem-frac 0.7

# EAGLE3
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algo EAGLE3 \
    --speculative-draft jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 8 \
    --speculative-eagle-topk 8 --speculative-num-draft-tokens 64 \
    --cuda-graph-max-bs 1 --dtype float16 --port 30000 --tp 1 --disable-radix --mem-frac 0.7

MT-bench is the default bench dataset in EAGLE's evaluation code. It's used to keep align with the setting in the paper.
python3 -m sglang.test.send_one also works for benchmark.

@ryang-max

Copy link
Copy Markdown
Contributor

I'll also work on the docs about this feature these days.
cc @zhaochenyang20 @simveit

0826joyce pushed a commit to 0826joyce/sglang-perf-opt that referenced this pull request May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.