My target model is Qwen-2.5-14B
I Use default config train a draft model, the train_eagle3_online.py generated a eagle3-config.json, content is:
{
"architectures": [
"LlamaForCausalLMEagle3"
],
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 13824,
"max_position_embeddings": 131072,
"model_type": "llama",
"num_attention_heads": 40,
"num_key_value_heads": 8,
"num_hidden_layers": 1,
"pad_token_id": 0,
"rms_norm_eps": 1e-06,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.28.1",
"use_cache": true,
"vocab_size": 152064,
"draft_vocab_size": 32000
}
after traning, i use sglang to load draft model, cmd is:
python -m sglang.launch_server --model-path /home/Models/Qwen2.5-14B-Instruct --host 0.0.0.0 --port 30000 --tp-size 2 --served-model-name qwen2 --context-length 2048 --speculative-algorithm EAGLE3 --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --mem-fraction 0.6 --cuda-graph-max-bs 2 --dtype float16 --speculative-draft-model-path /home/Models/epoch_2
error stack is:
Capturing batches (bs=2 avail_mem=14.40 GB): 0%| | 0/2 [00:00<?, ?it/s][2025-09-13 18:14:09 TP1] Registering 0 cuda graph addresses
Capturing batches (bs=2 avail_mem=14.40 GB): 0%| | 0/2 [00:01<?, ?it/s]
[2025-09-13 18:14:09 TP0] Registering 0 cuda graph addresses
[2025-09-13 18:14:09 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2587, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 329, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 93, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 250, in init
self.initialize(min_per_gpu_memory)
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 386, in initialize
self.init_device_graphs()
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1761, in init_device_graphs
self.graph_runner = graph_runnersself.device
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 389, in init
self.capture()
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 497, in capture
) = self.capture_one_batch_size(bs, forward)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 668, in capture_one_batch_size
run_once()
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 657, in run_once
logits_output_or_pp_proxy_tensors = forward(
^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/models/qwen2.py", line 489, in forward
hidden_states, aux_hidden_states = hidden_states
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: too many values to unpack (expected 2)
I'm new to this, i need some help!
My target model is Qwen-2.5-14B
I Use default config train a draft model, the train_eagle3_online.py generated a eagle3-config.json, content is:
{
"architectures": [
"LlamaForCausalLMEagle3"
],
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 13824,
"max_position_embeddings": 131072,
"model_type": "llama",
"num_attention_heads": 40,
"num_key_value_heads": 8,
"num_hidden_layers": 1,
"pad_token_id": 0,
"rms_norm_eps": 1e-06,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.28.1",
"use_cache": true,
"vocab_size": 152064,
"draft_vocab_size": 32000
}
after traning, i use sglang to load draft model, cmd is:
python -m sglang.launch_server --model-path /home/Models/Qwen2.5-14B-Instruct --host 0.0.0.0 --port 30000 --tp-size 2 --served-model-name qwen2 --context-length 2048 --speculative-algorithm EAGLE3 --speculative-num-steps 5 --speculative-eagle-topk 4 --speculative-num-draft-tokens 8 --mem-fraction 0.6 --cuda-graph-max-bs 2 --dtype float16 --speculative-draft-model-path /home/Models/epoch_2
error stack is:
Capturing batches (bs=2 avail_mem=14.40 GB): 0%| | 0/2 [00:00<?, ?it/s][2025-09-13 18:14:09 TP1] Registering 0 cuda graph addresses
Capturing batches (bs=2 avail_mem=14.40 GB): 0%| | 0/2 [00:01<?, ?it/s]
[2025-09-13 18:14:09 TP0] Registering 0 cuda graph addresses
[2025-09-13 18:14:09 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 2587, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/scheduler.py", line 329, in init
self.tp_worker = TpWorkerClass(
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/managers/tp_worker.py", line 93, in init
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 250, in init
self.initialize(min_per_gpu_memory)
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 386, in initialize
self.init_device_graphs()
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/model_runner.py", line 1761, in init_device_graphs
self.graph_runner = graph_runnersself.device
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 389, in init
self.capture()
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 497, in capture
) = self.capture_one_batch_size(bs, forward)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 668, in capture_one_batch_size
run_once()
File "/usr/local/lib/python3.12/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 657, in run_once
logits_output_or_pp_proxy_tensors = forward(
^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sglang/srt/models/qwen2.py", line 489, in forward
hidden_states, aux_hidden_states = hidden_states
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: too many values to unpack (expected 2)
I'm new to this, i need some help!