Skip to content

[Bug]: GPT-OSS 0% prefix cache hits with hybrid attention + EAGLE #32802

@mgoin

Description

@mgoin

Your current environment

The output of python collect_env.py
Your output of `python collect_env.py` here

🐛 Describe the bug

# Eval script used
python tests/evals/gsm8k/gsm8k_eval.py

# With EAGLE: 0% prefix cache hit rate
vllm serve openai/gpt-oss-120b \
  --speculative-config '{"model": "nvidia/gpt-oss-120b-Eagle3-long-context", "num_speculative_tokens": 3, "method": "eagle3"}'
# Prefix cache hit rate: 0.0%

# Without EAGLE: ~80% hit rate as expected
vllm serve openai/gpt-oss-120b
# Prefix cache hit rate: 88.7%

# Workaround: disable hybrid KV cache manager
vllm serve openai/gpt-oss-120b \
  --speculative-config '...' \
  --disable-hybrid-kv-cache-manager
# Prefix cache hit rate: ~80%

Qwen3 with EAGLE works fine (not a hybrid attention model).

Root Cause

In HybridKVCacheCoordinator.find_longest_cache_hit():

  1. Full attention finds N blocks, does EAGLE drop → returns N-1 blocks
  2. This sets max_length = (N-1) * block_size for sliding window
  3. Sliding window searches range 0 to N-2, but its cached blocks may be at positions N-k to N-1 (within the window)
  4. The last cached block (N-1) is now outside the search range
  5. Sliding window can't find enough contiguous blocks to meet its threshold → returns 0
  6. Sliding window also does its own EAGLE drop, causing cascading reductions in the convergence loop

Potential Solutions

Option 1: Coordinator-level EAGLE handling

  • Remove EAGLE logic from individual managers
  • Coordinator applies EAGLE drop once at the end after all managers converge
  • Cleanest separation of concerns, but requires careful handling of per-manager threshold requirements

Option 2: EagleMode enum (implemented in my experimental PR #32801)

  • Single eagle_mode parameter: DISABLED, PRIMARY, SECONDARY
  • PRIMARY: This manager handles EAGLE (threshold+1, do pop)
  • SECONDARY: Another manager handled EAGLE (threshold-1, no pop)
  • Coordinator assigns: full attention = PRIMARY, others = SECONDARY
  • If no full attention, all managers = PRIMARY (this is the part I'm least sure about)

Option 3: Skip EAGLE for non-full-attention in hybrid models

  • In HybridKVCacheCoordinator, only pass EAGLE flag to full attention
  • Other managers just find best match within the (already reduced) range
  • Simpler but may need threshold adjustment for sliding window

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions