Skip to content

[Performance] Missing global conversation history state causes 314x slowdown on llama.cpp with Qwen3.5-27B #13442

@lsunay

Description

@lsunay

[Performance] Missing global conversation history state causes 314x slowdown on llama.cpp with Qwen3.5-27B

System: RTX 3090 (24GB VRAM) + llama.cpp + Qwen3.5-27B-Q4_K_M + -np 1


🚨 Problem Summary

Every LLM request resends the system prompt and creates a new messages array, causing complete KV cache invalidation. This results in 314x performance degradation (122s → 0.39s) when using --keep -1 on llama.cpp backends.

Impact: Affects local LLM users running llama.cpp with single-slot configuration (-np 1) and conversation history persistence (--keep -1).

Note: This analysis is based on testing with Qwen3.5-27B on llama.cpp. Other models/backends may behave differently.


🎯 Environment & Reproduction

Hardware/Software Stack:

  • GPU: NVIDIA RTX 3090 (24GB VRAM)
  • Backend: llama-cpp-turboquant:cuda (Docker)
  • Model: Qwen3.5-27B-Q4_K_M.gguf (~16GB)
  • Context: 32K-256K tokens
  • Configuration: -np 1 (single parallel slot)

llama.cpp Server Configuration:

docker run -d --name llama27b-turbo4 \
  --gpus all \
  -p 8089:8080 \
  -v /models:/models:ro \
  llama-cpp-turboquant:cuda \
    llama-server \
    -m /models/Qwen3.5-27B-Q4_K_M.gguf \
    --cache-prompt \
    --cache-reuse 1024 \
    --keep -1 \
    -ngl 99 \
    -c 262144 \
    --flash-attn on \
    --cache-type-k q4_0 \
    --cache-type-v q4_0 \
    --batch-size 2048 \
    --ubatch-size 512 \
    -np 1 \
    --host 0.0.0.0 \
    --port 8080

Reproduction Steps:

  1. Start llama.cpp server with --keep -1 and -np 1
  2. Run Hermes CLI with base_url=http://localhost:8089/v1
  3. Send first user message → ~122s response time
  4. Send second user message → ~122s response time (should be <1s with cache)
  5. Monitor llama.cpp logs for n_past values:
    docker logs llama27b-turbo4 2>&1 | grep "n_past"

Expected vs Actual:

Expected (with KV cache):

slot update_slots: id  0 | task XXXXX | n_past = 34576, slot.prompt.tokens.size() = 34775

Response time: <1s

Actual (cache invalid):

slot update_slots: id  0 | task XXXXX | n_past = 3, slot.prompt.tokens.size() = 298
erased invalidated context checkpoint (15 instances)

Response time: ~122s


🔍 Root Cause Analysis

Problem Location: run_agent.py

Issue 1: No Persistent Global State (Line ~7783)

Current Code:

# Initialize conversation (copy to avoid mutating the caller's list)
messages = list(conversation_history) if conversation_history else []

Problem:

  • messages array is recreated from scratch on every LLM request
  • Only the initial conversation_history parameter is used
  • No persistent state between LLM requests within a single run_conversation() call

Issue 2: System Prompt Sent Every Request (Line ~8127)

Current Code:

if effective_system:
    api_messages = [{"role": "system", "content": effective_system}] + api_messages

Problem:

  • System prompt is prepended to api_messages on EVERY LLM request
  • Changes tokenization → KV cache invalidation
  • --keep -1 cannot work because prefix changes

Important Note: The code creates a copy called api_messages from messages (lines 8090-8113), then adds system prompt to api_messages. But api_messages is recreated fresh on every LLM request, so the system prompt is always added.

Issue 3: No Message Persistence (Lines ~7468, ~10021)

Current Code:

# Tool results
tool_msg = {"role": "tool", "content": function_result, "tool_call_id": tool_call.id}
messages.append(tool_msg)

# Assistant responses  
assistant_msg = {"role": "assistant", "content": final_response}
messages.append(assistant_msg)

Problem:

  • Messages appended to local messages list
  • List is discarded after LLM request returns
  • Next LLM request starts fresh → cache broken

📊 Performance Impact

Metrics:

Metric Current (Broken) Expected (Fixed) Degradation
Response Time ~122s ~0.39s 314x slower
Cache Hit Rate ~1% ~99% 98% worse
n_past (typical) 3-10 30,000+ Cache not used
Token Processing 32K tokens/request ~100 tokens/request 320x waste
VRAM Usage 21.9/24GB (89%) ~20/24GB (83%) Cache overflow

llama.cpp Log Evidence:

# Request 1 (first message)
slot 0 | n_past = 0, processing 32000 tokens... → 122s

# Request 2 (follow-up, should use cache)
slot 0 | n_past = 3, processing 32000 tokens... → 122s
erased invalidated context checkpoint (15 instances)

# Request 3 (same pattern)
slot 0 | n_past = 4, processing 32000 tokens... → 122s

🧪 What We Attempted

We attempted to implement a fix by adding global conversation history state to AIAgent. Here's what we tried:

Attempted Implementation:

class AIAgent:
    def __init__(self, ...):
        # Add persistent state
        self._global_conversation_history: List[Dict[str, Any]] = []
        self._system_prompt_sent: bool = False

Changes Made:

  1. Modified message initialization to use global state
  2. Added system prompt only on first LLM request
  3. Persisted messages to global history after each LLM request

Results:

  • ✅ Initial tests showed promise (n_past increased)
  • ❌ Integration with existing code caused errors
  • ❌ Not fully compatible with background review system
  • ❌ Had to revert all changes

See: IMPLEMENTATION-ATTEMPT-ANALYSIS.md for detailed attempt documentation.

Additional Issues Found:

  1. Speculative Decoding: Breaks KV cache on llama.cpp
  2. Cache Reuse: --cache-reuse 1024 too low for 32K context
  3. VRAM Pressure: 28GB needed, 24GB available (with speculative decoding)
  4. Parallel Requests: -np 1 forces single slot

🔗 Related Issues

Our research found several related but different issues:

Issue #4555: KV cache invalidation on new user message

Issue #4319: KV cache invalidation on compression

  • Different: That issue is about compression triggering system prompt rebuild
  • Our finding: System prompt sent on EVERY LLM request (not just compression)
  • Relationship: Related concern, different scope

Issue #12089: Conversation-aware sliding cache breakpoints

Issue #8687: System prompt timestamp changes after compression

  • Related: Both about system prompt stability
  • Our finding: System prompt shouldn't change, but shouldn't be resent either
  • Relationship: Complementary issue

Issue #3353: Runtime metadata in cached system prompt


🔗 Related PR


💡 Proposed Solution Approach

Core Concept: Persistent Conversation State

Add persistent state to AIAgent class to maintain conversation history across LLM requests.

Key Changes Needed:

  1. Message Initialization (line ~7783)

    • Use global state instead of recreating from parameter
    • Check if system prompt already exists
  2. System Prompt Injection (line ~8127)

    • Add system prompt only once (first LLM request)
    • Track with _system_prompt_sent flag
  3. Message Persistence (after each LLM response)

    • Append assistant/tool messages to global history
    • Ensure consistency with session DB
  4. Session Integration (save/load)

    • Persist global history to session DB
    • Restore on session reload
    • Handle context compression
  5. Background Review Compatibility (ORIGIN-ANALYSIS-REPORT.md)

    • Merge with origin/main's _spawn_background_review()
    • Ensure background review doesn't break cache

❓ Questions for Maintainers

We need maintainer guidance to properly implement this fix:

1. Design Intent:

  • Was the lack of global state a design decision?
  • If so, what are the trade-offs we're missing?

2. Background Review:

  • Origin/main has _spawn_background_review() that forks AIAgent
  • How should this interact with global conversation state?
  • Should background review have its own history?

3. Session Persistence:

  • Should global history be stored in session DB?
  • How to handle context compression with persistent state?

4. Model/Backend Compatibility:

  • Does this approach work with vLLM, Ollama, cloud providers?
  • Are there backends where this would break?
  • We only tested with llama.cpp + Qwen3.5-27B

5. Alternative Approaches:

  • Is there a better architectural solution?
  • Should we modify the message format instead?
  • What would you recommend?

📚 Research Documentation

We created detailed analysis documents. Available for review:

  1. LLM-MESSAGE-ARRAY-ISSUE-ANALYSIS.md - Initial problem identification
  2. ROOT-CAUSE-ANALYSIS.md - Root cause with 4 solution options
  3. ORIGIN-ANALYSIS-REPORT.md - Comparison with origin/main
  4. IMPLEMENTATION-ATTEMPT-ANALYSIS.md - Our attempted fix (unsuccessful)

Note: We can share these files if helpful for understanding our investigation.


🎯 Expected Impact

If this issue is resolved:

  • 314x performance improvement for llama.cpp users with -np 1
  • Proper KV cache utilization with --keep -1
  • Reduced token waste (system prompt sent once)
  • Better VRAM efficiency (cache doesn't overflow)
  • Improved UX for all local LLM users

User Impact: Affects anyone running:

  • llama.cpp with --keep -1 and -np 1
  • Long conversations (30K+ tokens)
  • Qwen3.5-27B or similar large models
  • RTX 3090 or similar 24GB VRAM GPUs

🏷️ Suggested Labels

  • performance
  • enhancement
  • backend:llama.cpp
  • KV-cache
  • priority:high
  • needs-maintainer-input

👤 Reporter

Levent Sunay (@lsunay1)
Date: 2026-04-21
System: RTX 3090 (24GB) + llama-cpp-turboquant + Qwen3.5-27B-Q4_K_M
Impact: 314x slowdown (122s → 0.39s)


Important Note: We attempted to implement a fix but were unable to complete it successfully due to integration issues with the existing codebase. We believe this is a fundamental architecture issue that needs maintainer input on the best approach.

We're sharing our detailed research and partial implementation attempt to:

  1. Document the problem we found
  2. Show our investigation and analysis
  3. Request maintainer guidance on proper implementation
  4. Offer to collaborate on testing and refinement

We're happy to help test any proposed solution! 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt builderprovider/ollamaOllama / local modelstype/perfPerformance improvement or optimization

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions