Skip to content

[WIP] Fix GPT-OSS prefix caching not working with EAGLE#32801

Closed
mgoin wants to merge 1 commit into
vllm-project:mainfrom
neuralmagic:sliding-window-eagle-fix
Closed

[WIP] Fix GPT-OSS prefix caching not working with EAGLE#32801
mgoin wants to merge 1 commit into
vllm-project:mainfrom
neuralmagic:sliding-window-eagle-fix

Conversation

@mgoin

@mgoin mgoin commented Jan 21, 2026

Copy link
Copy Markdown
Member

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: mgoin <mgoin64@gmail.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the EAGLE speculative decoding feature to support hybrid models by replacing the use_eagle boolean with an EagleMode enum. This allows for more granular control, designating a primary handler (like full attention) to manage EAGLE logic while secondary handlers (like sliding window) adjust accordingly. The changes in kv_cache_coordinator.py and single_type_kv_cache_manager.py correctly implement this new logic. A new test, test_hybrid_model_with_eagle, is added to verify this behavior. However, I've found a potential issue in the test's assertion, which I've detailed in a specific comment.

Comment on lines +1937 to +1939
assert full_attn_blocks == num_blocks - 2, (
f"Expected {num_blocks - 2}, got {full_attn_blocks} blocks"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The assertion for the number of computed blocks appears to be incorrect. With EAGLE enabled, the FullAttentionManager (as the primary handler) should drop one block from the cache hit. If 20 blocks are found, this should result in 19 computed blocks.

Based on my analysis of the fixed-point iteration in HybridKVCacheCoordinator, the process should converge to 19 blocks, not 18. The assertion should likely be for num_blocks - 1.

Suggested change
assert full_attn_blocks == num_blocks - 2, (
f"Expected {num_blocks - 2}, got {full_attn_blocks} blocks"
)
assert full_attn_blocks == num_blocks - 1, (
f"Expected {num_blocks - 1}, got {full_attn_blocks} blocks"
)

@mgoin mgoin closed this Feb 9, 2026
@github-project-automation github-project-automation Bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant