[WIP] Fix GPT-OSS prefix caching not working with EAGLE by mgoin · Pull Request #32801 · vllm-project/vllm

mgoin · 2026-01-21T19:22:12Z

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: mgoin <mgoin64@gmail.com>

gemini-code-assist

Code Review

This pull request refactors the EAGLE speculative decoding feature to support hybrid models by replacing the use_eagle boolean with an EagleMode enum. This allows for more granular control, designating a primary handler (like full attention) to manage EAGLE logic while secondary handlers (like sliding window) adjust accordingly. The changes in kv_cache_coordinator.py and single_type_kv_cache_manager.py correctly implement this new logic. A new test, test_hybrid_model_with_eagle, is added to verify this behavior. However, I've found a potential issue in the test's assertion, which I've detailed in a specific comment.

gemini-code-assist · 2026-01-21T19:25:09Z

+    assert full_attn_blocks == num_blocks - 2, (
+        f"Expected {num_blocks - 2}, got {full_attn_blocks} blocks"
+    )


The assertion for the number of computed blocks appears to be incorrect. With EAGLE enabled, the FullAttentionManager (as the primary handler) should drop one block from the cache hit. If 20 blocks are found, this should result in 19 computed blocks.

Based on my analysis of the fixed-point iteration in HybridKVCacheCoordinator, the process should converge to 19 blocks, not 18. The assertion should likely be for num_blocks - 1.

Suggested change

assert full_attn_blocks == num_blocks - 2, (

f"Expected {num_blocks - 2}, got {full_attn_blocks} blocks"

)

assert full_attn_blocks == num_blocks - 1, (

f"Expected {num_blocks - 1}, got {full_attn_blocks} blocks"

)

[WIP] Fix GPT-OSS prefix caching not working with EAGLE

f43be2b

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin mentioned this pull request Jan 21, 2026

[Bug]: GPT-OSS 0% prefix cache hits with hybrid attention + EAGLE #32802

Closed

1 task

mergify Bot added gpt-oss Related to GPT-OSS models v1 labels Jan 21, 2026

github-project-automation Bot moved this to To Triage in gpt-oss Issues & Enhancements Jan 21, 2026

github-project-automation Bot added this to gpt-oss Issues & Enhancements Jan 21, 2026

gemini-code-assist Bot reviewed Jan 21, 2026

View reviewed changes

mgoin closed this Feb 9, 2026

github-project-automation Bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Feb 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Fix GPT-OSS prefix caching not working with EAGLE#32801

[WIP] Fix GPT-OSS prefix caching not working with EAGLE#32801
mgoin wants to merge 1 commit into
vllm-project:mainfrom
neuralmagic:sliding-window-eagle-fix

mgoin commented Jan 21, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mgoin commented Jan 21, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mgoin commented Jan 21, 2026 •

edited by github-actions Bot

Loading