[Bug]: GPT-OSS 0% prefix cache hits with hybrid attention + EAGLE

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Your output of `python collect_env.py` here
```

</details>


### 🐛 Describe the bug

```bash
# Eval script used
python tests/evals/gsm8k/gsm8k_eval.py

# With EAGLE: 0% prefix cache hit rate
vllm serve openai/gpt-oss-120b \
  --speculative-config '{"model": "nvidia/gpt-oss-120b-Eagle3-long-context", "num_speculative_tokens": 3, "method": "eagle3"}'
# Prefix cache hit rate: 0.0%

# Without EAGLE: ~80% hit rate as expected
vllm serve openai/gpt-oss-120b
# Prefix cache hit rate: 88.7%

# Workaround: disable hybrid KV cache manager
vllm serve openai/gpt-oss-120b \
  --speculative-config '...' \
  --disable-hybrid-kv-cache-manager
# Prefix cache hit rate: ~80%
```

Qwen3 with EAGLE works fine (not a hybrid attention model).

### Root Cause

In `HybridKVCacheCoordinator.find_longest_cache_hit()`:

1. Full attention finds N blocks, does EAGLE drop → returns N-1 blocks
2. This sets `max_length = (N-1) * block_size` for sliding window
3. Sliding window searches range 0 to N-2, but its cached blocks may be at positions N-k to N-1 (within the window)
4. The last cached block (N-1) is now outside the search range
5. Sliding window can't find enough contiguous blocks to meet its threshold → returns 0
6. Sliding window also does its own EAGLE drop, causing cascading reductions in the convergence loop

### Potential Solutions

**Option 1: Coordinator-level EAGLE handling**
- Remove EAGLE logic from individual managers
- Coordinator applies EAGLE drop once at the end after all managers converge
- Cleanest separation of concerns, but requires careful handling of per-manager threshold requirements

**Option 2: EagleMode enum (implemented in my experimental PR https://github.com/vllm-project/vllm/pull/32801)**
- Single `eagle_mode` parameter: `DISABLED`, `PRIMARY`, `SECONDARY`
- `PRIMARY`: This manager handles EAGLE (threshold+1, do pop)
- `SECONDARY`: Another manager handled EAGLE (threshold-1, no pop)
- Coordinator assigns: full attention = PRIMARY, others = SECONDARY
- If no full attention, all managers = PRIMARY (this is the part I'm least sure about)

**Option 3: Skip EAGLE for non-full-attention in hybrid models**
- In `HybridKVCacheCoordinator`, only pass EAGLE flag to full attention
- Other managers just find best match within the (already reduced) range
- Simpler but may need threshold adjustment for sliding window

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: GPT-OSS 0% prefix cache hits with hybrid attention + EAGLE #32802

Your current environment

🐛 Describe the bug

Root Cause

Potential Solutions

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: GPT-OSS 0% prefix cache hits with hybrid attention + EAGLE #32802

Description

Your current environment

🐛 Describe the bug

Root Cause

Potential Solutions

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions