Skip to content

[Speculative] Fix Eagle3/DFLASH aux hidden state capture during CUDA graph init#22836

Merged
merrymercy merged 3 commits intomainfrom
lianmin/fix-eagle-capture
Apr 15, 2026
Merged

[Speculative] Fix Eagle3/DFLASH aux hidden state capture during CUDA graph init#22836
merrymercy merged 3 commits intomainfrom
lianmin/fix-eagle-capture

Conversation

@merrymercy
Copy link
Copy Markdown
Contributor

Summary

  • Move set_eagle3_layers_to_capture() and set_dflash_layers_to_capture() into a new init_aux_hidden_state_capture() method
  • Call it BEFORE init_device_graphs() in ModelRunner.initialize() so CUDA graphs are captured with aux hidden state paths enabled
  • Remove redundant calls from CudaGraphRunner.__init__() and _dummy_run()

Motivation

Previously set_eagle3_layers_to_capture() was called AFTER CUDA graph capture in initialize(), so the captured graphs ran without aux hidden state capture enabled. For Eagle3 this caused zero acceptance length at runtime. The CudaGraphRunner had a workaround that called set_eagle3_layers_to_capture() without args, which used default layer IDs instead of config-specified ones — breaking models with custom eagle_aux_hidden_state_layer_ids.

Test plan

  • Tested Eagle3 spec decode with Llama-3.1-8B-Instruct + CUDA graphs: non-zero acceptance length confirmed
  • Tested with custom eagle_aux_hidden_state_layer_ids from config

…graph init

Move `set_eagle3_layers_to_capture()` and `set_dflash_layers_to_capture()`
into a new `init_aux_hidden_state_capture()` method and call it BEFORE
`init_device_graphs()` in `ModelRunner.initialize()`.

Previously these were called AFTER CUDA graph capture, so the captured
graphs ran without aux hidden state capture enabled. For Eagle3 this
caused zero acceptance length at runtime. The `CudaGraphRunner` had a
workaround that called `set_eagle3_layers_to_capture()` without args,
which used default layer IDs instead of the config-specified ones.

Remove the redundant aux hidden state setup from
`CudaGraphRunner.__init__()` and `_dummy_run()`.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Comment thread python/sglang/srt/model_executor/cuda_graph_runner.py Outdated
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>

self.tbo_plugin = TboCudaGraphRunnerPlugin()

# Speculative_inference
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should call this in model_runner.py::initialize instead of the cuda graph runner.

The old code call this similar code twice with different arguments, which is wrong.

@merrymercy
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@merrymercy merrymercy merged commit 43925d1 into main Apr 15, 2026
340 of 443 checks passed
@merrymercy merrymercy deleted the lianmin/fix-eagle-capture branch April 15, 2026 21:04
jmamou pushed a commit to jmamou/sglang that referenced this pull request Apr 20, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026
empty-quiver pushed a commit to empty-quiver/sglang-turboquant that referenced this pull request Apr 28, 2026
empty-quiver pushed a commit to empty-quiver/sglang-turboquant that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants