[WIP] Fix GPT-OSS prefix caching not working with EAGLE#32801
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the EAGLE speculative decoding feature to support hybrid models by replacing the use_eagle boolean with an EagleMode enum. This allows for more granular control, designating a primary handler (like full attention) to manage EAGLE logic while secondary handlers (like sliding window) adjust accordingly. The changes in kv_cache_coordinator.py and single_type_kv_cache_manager.py correctly implement this new logic. A new test, test_hybrid_model_with_eagle, is added to verify this behavior. However, I've found a potential issue in the test's assertion, which I've detailed in a specific comment.
| assert full_attn_blocks == num_blocks - 2, ( | ||
| f"Expected {num_blocks - 2}, got {full_attn_blocks} blocks" | ||
| ) |
There was a problem hiding this comment.
The assertion for the number of computed blocks appears to be incorrect. With EAGLE enabled, the FullAttentionManager (as the primary handler) should drop one block from the cache hit. If 20 blocks are found, this should result in 19 computed blocks.
Based on my analysis of the fixed-point iteration in HybridKVCacheCoordinator, the process should converge to 19 blocks, not 18. The assertion should likely be for num_blocks - 1.
| assert full_attn_blocks == num_blocks - 2, ( | |
| f"Expected {num_blocks - 2}, got {full_attn_blocks} blocks" | |
| ) | |
| assert full_attn_blocks == num_blocks - 1, ( | |
| f"Expected {num_blocks - 1}, got {full_attn_blocks} blocks" | |
| ) |
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.