Your current environment
The output of python collect_env.py
Your output of `python collect_env.py` here
🐛 Describe the bug
# Eval script used
python tests/evals/gsm8k/gsm8k_eval.py
# With EAGLE: 0% prefix cache hit rate
vllm serve openai/gpt-oss-120b \
--speculative-config '{"model": "nvidia/gpt-oss-120b-Eagle3-long-context", "num_speculative_tokens": 3, "method": "eagle3"}'
# Prefix cache hit rate: 0.0%
# Without EAGLE: ~80% hit rate as expected
vllm serve openai/gpt-oss-120b
# Prefix cache hit rate: 88.7%
# Workaround: disable hybrid KV cache manager
vllm serve openai/gpt-oss-120b \
--speculative-config '...' \
--disable-hybrid-kv-cache-manager
# Prefix cache hit rate: ~80%
Qwen3 with EAGLE works fine (not a hybrid attention model).
Root Cause
In HybridKVCacheCoordinator.find_longest_cache_hit():
- Full attention finds N blocks, does EAGLE drop → returns N-1 blocks
- This sets
max_length = (N-1) * block_size for sliding window
- Sliding window searches range 0 to N-2, but its cached blocks may be at positions N-k to N-1 (within the window)
- The last cached block (N-1) is now outside the search range
- Sliding window can't find enough contiguous blocks to meet its threshold → returns 0
- Sliding window also does its own EAGLE drop, causing cascading reductions in the convergence loop
Potential Solutions
Option 1: Coordinator-level EAGLE handling
- Remove EAGLE logic from individual managers
- Coordinator applies EAGLE drop once at the end after all managers converge
- Cleanest separation of concerns, but requires careful handling of per-manager threshold requirements
Option 2: EagleMode enum (implemented in my experimental PR #32801)
- Single
eagle_mode parameter: DISABLED, PRIMARY, SECONDARY
PRIMARY: This manager handles EAGLE (threshold+1, do pop)
SECONDARY: Another manager handled EAGLE (threshold-1, no pop)
- Coordinator assigns: full attention = PRIMARY, others = SECONDARY
- If no full attention, all managers = PRIMARY (this is the part I'm least sure about)
Option 3: Skip EAGLE for non-full-attention in hybrid models
- In
HybridKVCacheCoordinator, only pass EAGLE flag to full attention
- Other managers just find best match within the (already reduced) range
- Simpler but may need threshold adjustment for sliding window
Before submitting a new issue...
Your current environment
The output of
python collect_env.py🐛 Describe the bug
Qwen3 with EAGLE works fine (not a hybrid attention model).
Root Cause
In
HybridKVCacheCoordinator.find_longest_cache_hit():max_length = (N-1) * block_sizefor sliding windowPotential Solutions
Option 1: Coordinator-level EAGLE handling
Option 2: EagleMode enum (implemented in my experimental PR #32801)
eagle_modeparameter:DISABLED,PRIMARY,SECONDARYPRIMARY: This manager handles EAGLE (threshold+1, do pop)SECONDARY: Another manager handled EAGLE (threshold-1, no pop)Option 3: Skip EAGLE for non-full-attention in hybrid models
HybridKVCacheCoordinator, only pass EAGLE flag to full attentionBefore submitting a new issue...