Checklist
Describe the bug
Running into some weird issues with spec decode for deepseek, it seems to crash only on higher batch sizes like:
--speculative-algorithm EAGLE
--speculative-num-steps 2
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4 \
Hunch is that it's related to num steps.
Gemini says in eagle_utils.py, the RuntimeError you encountered was caused by a mismatch between the number of requests being processed and the data used to process them during the verification step of speculative decoding. Specifically, when some requests in a batch were filtered out, the corresponding sampling_info (which holds the logit_bias for each request) was not updated. This resulted in the error you saw, where the logits tensor had a batch size of 7 while the logit_bias tensor still had a size of 8.
and that this may help:
if bs != len(batch.reqs):
sampling_info = copy.deepcopy(sampling_info)
# NOTE: retrive_index are the indices of the requests that are kept.
sampling_info.filter_batch(self.retrive_index.tolist(), self.retrive_index)
Trace:
[2025-06-26 19:44:51 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2647, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 732, in event_loop_normal
result = self.run_batch(batch)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1701, in run_batch
) = self.draft_worker.forward_batch_speculative_generation(batch)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 323, in forward_batch_speculative_generation
self.verify(batch, spec_info)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 685, in verify
res: EagleVerifyOutput = spec_info.verify(
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 381, in verify
sampling_info.apply_logits_bias(linear_penalty)
File "/sgl-workspace/sglang/python/sglang/srt/sampling/sampling_batch_info.py", line 223, in apply_logits_bias
logits.add(self.logit_bias)
RuntimeError: The size of tensor a (19) must match the size of tensor b (24) at non-singleton dimension 0
Reproduction
Deepseek, using tp-8 and MTP:
--speculative-algorithm EAGLE
--speculative-num-steps 2
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4 \
Happened on lmsysorg/sglang:v0.4.8-cu128-b200 image and only on higher batch sizes like 20 or so.
Environment
Blackwell, see above
Checklist
Describe the bug
Running into some weird issues with spec decode for deepseek, it seems to crash only on higher batch sizes like:
--speculative-algorithm EAGLE
--speculative-num-steps 2
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4 \
Hunch is that it's related to num steps.
Gemini says in eagle_utils.py, the RuntimeError you encountered was caused by a mismatch between the number of requests being processed and the data used to process them during the verification step of speculative decoding. Specifically, when some requests in a batch were filtered out, the corresponding sampling_info (which holds the logit_bias for each request) was not updated. This resulted in the error you saw, where the logits tensor had a batch size of 7 while the logit_bias tensor still had a size of 8.
and that this may help:
if bs != len(batch.reqs):
sampling_info = copy.deepcopy(sampling_info)
# NOTE: retrive_index are the indices of the requests that are kept.
sampling_info.filter_batch(self.retrive_index.tolist(), self.retrive_index)
Trace:
[2025-06-26 19:44:51 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2647, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 732, in event_loop_normal
result = self.run_batch(batch)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1701, in run_batch
) = self.draft_worker.forward_batch_speculative_generation(batch)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 323, in forward_batch_speculative_generation
self.verify(batch, spec_info)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 685, in verify
res: EagleVerifyOutput = spec_info.verify(
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_utils.py", line 381, in verify
sampling_info.apply_logits_bias(linear_penalty)
File "/sgl-workspace/sglang/python/sglang/srt/sampling/sampling_batch_info.py", line 223, in apply_logits_bias
logits.add(self.logit_bias)
RuntimeError: The size of tensor a (19) must match the size of tensor b (24) at non-singleton dimension 0
Reproduction
Deepseek, using tp-8 and MTP:
--speculative-algorithm EAGLE
--speculative-num-steps 2
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4 \
Happened on lmsysorg/sglang:v0.4.8-cu128-b200 image and only on higher batch sizes like 20 or so.
Environment
Blackwell, see above