Skip to content

[Bug]: llama-cpp and Ollama providers return incorrect context usage due to field name mismatch #53448

@oven1231231234

Description

@oven1231231234

Bug type

Behavior bug (incorrect output/state without crash)

Summary

OpenClaw GitHub Issue - Context Usage Bug

Issue Title

[bug] llama-cpp and Ollama providers return incorrect context usage due to field name mismatch

Issue Content

Problem Description

OpenClaw fails to accurately track token usage due to mismatched field names between expected and actual API responses, causing context usage to display as 0/80k (0%) even when the model is actively consuming significant tokens.

Environment:

  • 🦞 OpenClaw: 2026.3.23-1
  • 🧠 Model: llama-cpp/qwen35b-local
  • 📚 Context Display: 0/80k (0%)
  • 🧵 Session: agent:main:main
  • 🪢 Runtime: direct

Affected Frameworks

Framework Status Notes
llama.cpp server AFFECTED Most common local deployment solution
Ollama AFFECTED Popular model management service
vLLM NOT AFFECTED Compatible (OpenAI format)
HuggingFace TGI NOT AFFECTED Compatible (OpenAI format)
OpenAI API NOT AFFECTED Compatible (OpenAI format)

Root Cause

OpenClaw expects these field names at line ~181675:

input: response.usage?.input_tokens ?? 0,
output: response.usage?.output_tokens ?? 0,

However, different frameworks return different field names:

llama.cpp server (OpenAI-compatible format)

{
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 1,
    "total_tokens": 12
  }
}

Ollama (custom format)

{
  "prompt_eval_count": 26,
  "eval_count": 259
}

vLLM / TGI / OpenAI (OpenAI standard format)

{
  "usage": {
    "prompt_tokens": 100,
    "completion_tokens": 50,
    "total_tokens": 150
  }
}

Real-World Case

User Configuration:

  • OpenClaw Display: 0/80k (0%)
  • Remote llama-server (192.168.3.77:8080) Actual Usage: 43250/80000 (54%)

Cause: llama.cpp server returns prompt_tokens, but OpenClaw expects input_tokens.


Chain Reactions from Failed Context Statistics

1. Context Window Overflow Risk

Due to inability to accurately track token usage:

Chain Reactions:

  1. User cannot see real-time token usage rate
  2. Cannot determine if conversation is approaching the 80k context limit
  3. May lead to:
    • Model truncation: Ultra-long conversations are forcibly truncated
    • Quality degradation: Context overflow causes model to forget early conversation
    • Session crash: API returns errors after exceeding limits

Actual Impact:

  • In long conversation scenarios, users may encounter context overflow without warning
  • Important conversation content may be lost

2. Conversation Management Failure

OpenClaw's conversation management mechanisms rely on accurate token counting:

Chain Reactions:

  1. Auto-compression mechanism fails:

    • OpenClaw may decide to compress historical messages based on token usage rate
    • If count is 0, compression never triggers
    • Leads to unlimited accumulation of historical messages, eventually causing memory overflow
  2. Session reset strategy fails:

    • Under some configurations, sessions automatically reset when token usage reaches a threshold
    • Due to count being 0, reset never triggers
    • Leads to uncontrolled session length
  3. Resource waste:

    • Cannot accurately evaluate token cost per session
    • May lead to unnecessary long conversations

3. Cost Monitoring Failure

Even with free local models, token statistics are important performance metrics:

Chain Reactions:

  1. Performance analysis difficulty:

    • Cannot analyze token consumption across different conversations
    • Cannot identify abnormally high token usage patterns
    • Difficult to optimize conversation strategies
  2. Multi-model comparison fails:

    • If multiple model backends exist, cannot fairly compare token efficiency
    • Cannot make model switching decisions based on token usage
  3. API quota monitoring fails (if using paid APIs):

    • Cannot accurately track API quota usage
    • May unexpectedly exceed quota causing service interruption

4. LCM (Lossless Context Management) Function Abnormalities

OpenClaw's LCM system relies on token statistics to manage conversation history:

Chain Reactions:

  1. Historical message compression strategy fails:

    • LCM decides whether to compress history based on token usage rate
    • When count is 0, compression never triggers
    • Leads to uncontrolled memory usage
  2. Context optimization fails:

    • LCM cannot intelligently retain important conversations
    • May lead to important information being discarded too early
  3. Search and retrieval functionality affected:

    • LCM's search function may rely on token statistics
    • Leads to inaccurate search results

5. User Experience Degradation

Chain Reactions:

  1. User confusion:

    • See 0/80k (0%) display
    • User cannot determine conversation status
    • May mistakenly think system is malfunctioning
  2. Trust reduction:

    • Key metrics display incorrectly
    • User may question the reliability of the entire system
  3. Cannot optimize conversation strategy:

    • User cannot adjust conversation methods based on token usage
    • Cannot learn how to efficiently use the context window

6. Diagnosis and Debugging Difficulty

Chain Reactions:

  1. Problem troubleshooting difficulty:

    • If conversation anomalies occur, cannot locate issues through token statistics
    • Increases troubleshooting time costs
  2. Performance optimization blocked:

    • Cannot perform performance optimization based on token statistics
    • Difficult to identify performance bottlenecks
  3. Automated testing fails:

    • Automated tests may rely on token statistics as success metrics
    • Leads to inaccurate test results

7. Resource Allocation Issues in Multi-User/Multi-Session Scenarios

If multiple users or concurrent sessions exist:

Chain Reactions:

  1. Unequal resource allocation:

    • Cannot accurately track token usage per session
    • Leads to some sessions consuming excessive resources
  2. Service quality degradation:

    • Some sessions may respond slowly due to resource exhaustion
    • Affects overall user experience
  3. Quota management difficult to implement:

    • Cannot fairly allocate token quotas
    • May lead to certain users monopolizing resources

Problem Severity Assessment

Issue Severity Affected Scope Probability
Context window overflow 🔴 High All long conversations High
Conversation management failure 🟡 Medium LCM users Medium
Cost monitoring failure 🟡 Medium All users High
LCM function abnormality 🔴 High LCM users High
User experience degradation 🟢 Low All users High
Diagnosis difficulty 🟡 Medium Developers/Advanced users Medium
Resource allocation issues 🟡 Medium Multi-user scenarios Medium

Overall Severity: 🔴 High


Case Study 1: Long Conversation Leading to Content Loss

User Scenario:

  • Conducting a 50+ turn technical discussion
  • OpenClaw Display: 0/80k (0%)
  • Actual llama-server Usage: 65000/80000 (81%)

Result:

  • User thought there was still ample context available
  • Continued conversation until model started truncating early content
  • Key information from technical discussion was forgotten
  • Conversation quality deteriorated rapidly

Case Study 2: LCM Compression Mechanism Failure

User Scenario:

  • Configured automatic compression of historical messages
  • Expected compression to trigger when token usage reached 70%

Result:

  • Due to count being 0, compression never triggered
  • Historical messages accumulated infinitely
  • Eventually led to excessive memory usage and slow system response

Code Location

File: ~/.npm-global/lib/node_modules/openclaw/dist/pi-embedded-CwMQzdKD.js
Line: ~181675 (exact line may vary by version)


Test Steps

  1. Configure llama.cpp server as model backend
  2. Send a test message
  3. Check if context display updates

Expected Result:

  • Display actual token usage rate
  • Example: 12/80k (15%) instead of 0/80k (0%)

Environment Information

Item Value
OpenClaw Version 2026.3.23-1
Remote llama-server 192.168.3.77:8080
Model Qwen3.5-35B-A3B-GGUF
Operating System macOS (user) / Ubuntu 24.04 (server)
llama.cpp Version 8419 (commit: 509a31d00)
Model File unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
GPU NVIDIA GeForce RTX 3090 (24GB)

Recommended Solution

Modify OpenClaw code to support multiple field name formats:

// Before
input: response.usage?.input_tokens ?? 0,
output: response.usage?.output_tokens ?? 0,

// After - Support all formats
input: response.usage?.prompt_tokens ?? 
       response.usage?.input_tokens ?? 
       response.usage?.prompt_eval_count ?? 0,

output: response.usage?.completion_tokens ?? 
        response.usage?.output_tokens ?? 
        response.usage?.eval_count ?? 0,

This solution:

  1. ✅ Backward compatible with all existing configurations
  2. ✅ Supports llama.cpp server, Ollama, vLLM, and other frameworks
  3. ✅ Zero configuration, works out of the box

Expected Fix Priority

Recommended: HIGH

This issue has wide-ranging impact and may cause severe user experience problems.


Server Information

192.168.3.77 Server Details

Basic Information:

  • Hostname: vllm-server
  • IP Address: 192.168.3.77
  • OS: Ubuntu 24.04 (Linux 6.8.0-106-generic)
  • Architecture: x86_64
  • Uptime: 3 days 13 hours

Hardware:

  • GPU: NVIDIA GeForce RTX 3090 (24GB VRAM)
  • System Memory: 62GB
  • Disk: 836GB (138GB used, 656GB available)

Software:

  • llama.cpp Version: 8419 (commit: 509a31d00)
  • GCC Version: 13.3.0
  • NVIDIA Driver: 580.126.09
  • CUDA Support: Enabled

llama-server Configuration:

/home/XXX/llama.cpp/build/bin/llama-server \
  -m /home/XXX/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
  --mmproj /home/XXX/.cache/llama.cpp/mmproj-F16.gguf \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  -ngl 99 \
  -np 1 \
  -fa on \
  --ctx-size 96000 \
  --image-min-tokens 1024 \
  --image-max-tokens 4096 \
  --host 0.0.0.0 \
  --port 8080

Key Configuration Notes:

  • Context window: 96,000 tokens (configured)
  • Model size: 21 GB (Q4_K_XL quantized)
  • GPU layers: 99 (all layers on GPU)
  • Flash Attention: Enabled

References

Steps to reproduce

telegrem /status

Expected behavior

  • 📚 Context Display: 10/100k (10%)

Actual behavior

  • 📚 Context Display: 0/100k (0%)

OpenClaw version

2026.3.8~2026.3.23

Operating system

macos12.7 llam8419 (commit: 509a31d00)

Install method

npm

Model

unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf

Provider / routing chain

openclaw---->llama-server

Additional provider/model setup details

llama.cpp server (OpenAI-compatible format)

{
"usage": {
"prompt_tokens": 11,
"completion_tokens": 1,
"total_tokens": 12
}
}

Ollama (custom format)

{
"prompt_eval_count": 26,
"eval_count": 259
}

vLLM / TGI / OpenAI (OpenAI standard format)

{
"usage": {
"prompt_tokens": 100,
"completion_tokens": 50,
"total_tokens": 150
}
}

Framework Status Notes
❌ llama.cpp server AFFECTED Most common local deployment solution
❌ Ollama AFFECTED Popular model management service
✅ vLLM NOT AFFECTED Compatible (OpenAI format)
✅ HuggingFace TGI NOT AFFECTED Compatible (OpenAI format)
✅ OpenAI API NOT AFFECTED Compatible (OpenAI format)

Logs, screenshots, and evidence

Root Cause

OpenClaw expects these field names at line ~181675:

input: response.usage?.input_tokens ?? 0,
output: response.usage?.output_tokens ?? 0,

However, different frameworks return different field names:

llama.cpp server (OpenAI-compatible format)

{
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 1,
    "total_tokens": 12
  }
}

Ollama (custom format)

{
  "prompt_eval_count": 26,
  "eval_count": 259
}

vLLM / TGI / OpenAI (OpenAI standard format)

{
  "usage": {
    "prompt_tokens": 100,
    "completion_tokens": 50,
    "total_tokens": 150
  }
}

Real-World Case

User Configuration:

• OpenClaw Display: 0/80k (0%)
• Remote llama-server (192.168.3.77:8080) Actual Usage: 43250/80000 (54%)

Cause: llama.cpp server returns prompt_tokens, but OpenClaw expects

Impact and severity

  1. Context Window Overflow Risk

Due to inability to accurately track token usage:

  1. User cannot see real-time token usage rate
  2. Cannot determine if conversation is approaching the 80k context limit
  3. May lead to:
    • Model truncation: Ultra-long conversations are forcibly truncated
    • Quality degradation: Context overflow causes model to forget early conversation
    • Session crash: API returns errors after exceeding limits

Actual Impact:

• In long conversation scenarios, users may encounter context overflow without warning
• Important conversation content may be lost

───

  1. Conversation Management Failure

OpenClaw's conversation management mechanisms rely on accurate token counting:

  1. Auto-compression mechanism fails:
    • OpenClaw may decide to compress historical messages based on token usage rate
    • If count is 0, compression never triggers
    • Leads to unlimited accumulation of historical messages, eventually causing memory overflow
  2. Session reset strategy fails:
    • Under some configurations, sessions automatically reset when token usage reaches a threshold
    • Due to count being 0, reset never triggers
    • Leads to uncontrolled session length
  3. Resource waste:
    • Cannot accurately evaluate token cost per session
    • May lead to unnecessary long conversations

───

  1. Cost Monitoring Failure

Even with free local models, token statistics are important performance metrics:

  1. Performance analysis difficulty:
    • Cannot analyze token consumption across different conversations
    • Cannot identify abnormally high token usage patterns
    • Difficult to optimize conversation strategies
  2. Multi-model comparison fails:
    • If multiple model backends exist, cannot fairly compare token efficiency
    • Cannot make model switching decisions based on token usage
  3. API quota monitoring fails (if using paid APIs):
    • Cannot accurately track API quota usage
    • May unexpectedly exceed quota causing service interruption
  4. LCM (Lossless Context Management) Function Abnormalities

OpenClaw's LCM system relies on token statistics to manage conversation history:

  1. Historical message compression strategy fails:
    • LCM decides whether to compress history based on token usage rate
    • When count is 0, compression never triggers
    • Leads to uncontrolled memory usage
  2. Context optimization fails:
    • LCM cannot intelligently retain important conversations
    • May lead to important information being discarded too early
  3. Search and retrieval functionality affected:
    • LCM's search function may rely on token statistics
    • Leads to inaccurate search results

───

  1. User Experience Degradation

  2. User confusion:
    • See 0/80k (0%) display
    • User cannot determine conversation status
    • May mistakenly think system is malfunctioning

  3. Trust reduction:
    • Key metrics display incorrectly
    • User may question the reliability of the entire system

  4. Cannot optimize conversation strategy:
    • User cannot adjust conversation methods based on token usage
    • Cannot learn how to efficiently use the context window

───

  1. Diagnosis and Debugging Difficulty

  2. Problem troubleshooting difficulty:
    • If conversation anomalies occur, cannot locate issues through token statistics
    • Increases troubleshooting time costs

  3. Performance optimization blocked:
    • Cannot perform performance optimization based on token statistics
    • Difficult to identify performance bottlenecks

  4. Automated testing fails:
    • Automated tests may rely on token statistics as success metrics
    • Leads to inaccurate test results

───

  1. Resource Allocation Issues in Multi-User/Multi-Session Scenarios

If multiple users or concurrent sessions exist:

  1. Unequal resource allocation:
    • Cannot accurately track token usage per session
    • Leads to some sessions consuming excessive resources
  2. Service quality degradation:
    • Some sessions may respond slowly due to resource exhaustion
    • Affects overall user experience
  3. Quota management difficult to implement:
    • Cannot fairly allocate token quotas
    • May lead to certain users monopolizing resources
    | Issue | Severity | Affected Scope | Probability |
    | ------------------------------- | --------- | ------------------------- | ----------- |
    | Context window overflow | 🔴 High | All long conversations | High |
    | Conversation management failure | 🟡 Medium | LCM users | Medium |
    | Cost monitoring failure | 🟡 Medium | All users | High |
    | LCM function abnormality | 🔴 High | LCM users | High |
    | User experience degradation | 🟢 Low | All users | High |
    | Diagnosis difficulty | 🟡 Medium | Developers/Advanced users | Medium |
    | Resource allocation issues | 🟡 Medium | Multi-user scenarios | Medium |

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingbug:behaviorIncorrect behavior without a crash

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions