[Bug]: llama-cpp and Ollama providers return incorrect context usage due to field name mismatch

### Bug type

Behavior bug (incorrect output/state without crash)

### Summary

# OpenClaw GitHub Issue - Context Usage Bug

## Issue Title

```
[bug] llama-cpp and Ollama providers return incorrect context usage due to field name mismatch
```

---

## Issue Content

### Problem Description

OpenClaw fails to accurately track token usage due to mismatched field names between expected and actual API responses, causing context usage to display as `0/80k (0%)` even when the model is actively consuming significant tokens.

**Environment:**
- 🦞 OpenClaw: 2026.3.23-1
- 🧠 Model: llama-cpp/qwen35b-local
- 📚 Context Display: 0/80k (0%)
- 🧵 Session: agent:main:main
- 🪢 Runtime: direct

### Affected Frameworks

| Framework | Status | Notes |
|-----------|--------|-------|
| ❌ **llama.cpp server** | AFFECTED | Most common local deployment solution |
| ❌ **Ollama** | AFFECTED | Popular model management service |
| ✅ **vLLM** | NOT AFFECTED | Compatible (OpenAI format) |
| ✅ **HuggingFace TGI** | NOT AFFECTED | Compatible (OpenAI format) |
| ✅ **OpenAI API** | NOT AFFECTED | Compatible (OpenAI format) |

### Root Cause

OpenClaw expects these field names at line ~181675:

```javascript
input: response.usage?.input_tokens ?? 0,
output: response.usage?.output_tokens ?? 0,
```

However, different frameworks return different field names:

#### llama.cpp server (OpenAI-compatible format)
```json
{
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 1,
    "total_tokens": 12
  }
}
```

#### Ollama (custom format)
```json
{
  "prompt_eval_count": 26,
  "eval_count": 259
}
```

#### vLLM / TGI / OpenAI (OpenAI standard format)
```json
{
  "usage": {
    "prompt_tokens": 100,
    "completion_tokens": 50,
    "total_tokens": 150
  }
}
```

### Real-World Case

**User Configuration:**
- OpenClaw Display: `0/80k (0%)`
- Remote llama-server (192.168.3.77:8080) Actual Usage: `43250/80000 (54%)`

**Cause:** llama.cpp server returns `prompt_tokens`, but OpenClaw expects `input_tokens`.

---

## Chain Reactions from Failed Context Statistics

### 1. Context Window Overflow Risk

**Due to inability to accurately track token usage:**

**Chain Reactions:**
1. User cannot see real-time token usage rate
2. Cannot determine if conversation is approaching the 80k context limit
3. May lead to:
   - **Model truncation:** Ultra-long conversations are forcibly truncated
   - **Quality degradation:** Context overflow causes model to forget early conversation
   - **Session crash:** API returns errors after exceeding limits

**Actual Impact:**
- In long conversation scenarios, users may encounter context overflow without warning
- Important conversation content may be lost

---

### 2. Conversation Management Failure

OpenClaw's conversation management mechanisms rely on accurate token counting:

**Chain Reactions:**

1. **Auto-compression mechanism fails:**
   - OpenClaw may decide to compress historical messages based on token usage rate
   - If count is 0, compression never triggers
   - Leads to unlimited accumulation of historical messages, eventually causing memory overflow

2. **Session reset strategy fails:**
   - Under some configurations, sessions automatically reset when token usage reaches a threshold
   - Due to count being 0, reset never triggers
   - Leads to uncontrolled session length

3. **Resource waste:**
   - Cannot accurately evaluate token cost per session
   - May lead to unnecessary long conversations

---

### 3. Cost Monitoring Failure

Even with free local models, token statistics are important performance metrics:

**Chain Reactions:**

1. **Performance analysis difficulty:**
   - Cannot analyze token consumption across different conversations
   - Cannot identify abnormally high token usage patterns
   - Difficult to optimize conversation strategies

2. **Multi-model comparison fails:**
   - If multiple model backends exist, cannot fairly compare token efficiency
   - Cannot make model switching decisions based on token usage

3. **API quota monitoring fails** (if using paid APIs):
   - Cannot accurately track API quota usage
   - May unexpectedly exceed quota causing service interruption

---

### 4. LCM (Lossless Context Management) Function Abnormalities

OpenClaw's LCM system relies on token statistics to manage conversation history:

**Chain Reactions:**

1. **Historical message compression strategy fails:**
   - LCM decides whether to compress history based on token usage rate
   - When count is 0, compression never triggers
   - Leads to uncontrolled memory usage

2. **Context optimization fails:**
   - LCM cannot intelligently retain important conversations
   - May lead to important information being discarded too early

3. **Search and retrieval functionality affected:**
   - LCM's search function may rely on token statistics
   - Leads to inaccurate search results

---

### 5. User Experience Degradation

**Chain Reactions:**

1. **User confusion:**
   - See `0/80k (0%)` display
   - User cannot determine conversation status
   - May mistakenly think system is malfunctioning

2. **Trust reduction:**
   - Key metrics display incorrectly
   - User may question the reliability of the entire system

3. **Cannot optimize conversation strategy:**
   - User cannot adjust conversation methods based on token usage
   - Cannot learn how to efficiently use the context window

---

### 6. Diagnosis and Debugging Difficulty

**Chain Reactions:**

1. **Problem troubleshooting difficulty:**
   - If conversation anomalies occur, cannot locate issues through token statistics
   - Increases troubleshooting time costs

2. **Performance optimization blocked:**
   - Cannot perform performance optimization based on token statistics
   - Difficult to identify performance bottlenecks

3. **Automated testing fails:**
   - Automated tests may rely on token statistics as success metrics
   - Leads to inaccurate test results

---

### 7. Resource Allocation Issues in Multi-User/Multi-Session Scenarios

If multiple users or concurrent sessions exist:

**Chain Reactions:**

1. **Unequal resource allocation:**
   - Cannot accurately track token usage per session
   - Leads to some sessions consuming excessive resources

2. **Service quality degradation:**
   - Some sessions may respond slowly due to resource exhaustion
   - Affects overall user experience

3. **Quota management difficult to implement:**
   - Cannot fairly allocate token quotas
   - May lead to certain users monopolizing resources

---

### Problem Severity Assessment

| Issue | Severity | Affected Scope | Probability |
|-------|----------|----------------|-------------|
| Context window overflow | 🔴 High | All long conversations | High |
| Conversation management failure | 🟡 Medium | LCM users | Medium |
| Cost monitoring failure | 🟡 Medium | All users | High |
| LCM function abnormality | 🔴 High | LCM users | High |
| User experience degradation | 🟢 Low | All users | High |
| Diagnosis difficulty | 🟡 Medium | Developers/Advanced users | Medium |
| Resource allocation issues | 🟡 Medium | Multi-user scenarios | Medium |

**Overall Severity: 🔴 High**

---

### Case Study 1: Long Conversation Leading to Content Loss

**User Scenario:**
- Conducting a 50+ turn technical discussion
- OpenClaw Display: `0/80k (0%)`
- Actual llama-server Usage: `65000/80000 (81%)`

**Result:**
- User thought there was still ample context available
- Continued conversation until model started truncating early content
- Key information from technical discussion was forgotten
- Conversation quality deteriorated rapidly

### Case Study 2: LCM Compression Mechanism Failure

**User Scenario:**
- Configured automatic compression of historical messages
- Expected compression to trigger when token usage reached 70%

**Result:**
- Due to count being 0, compression never triggered
- Historical messages accumulated infinitely
- Eventually led to excessive memory usage and slow system response

---

## Code Location

**File:** `~/.npm-global/lib/node_modules/openclaw/dist/pi-embedded-CwMQzdKD.js`
**Line:** ~181675 (exact line may vary by version)

---

## Test Steps

1. Configure llama.cpp server as model backend
2. Send a test message
3. Check if context display updates

**Expected Result:**
- Display actual token usage rate
- Example: `12/80k (15%)` instead of `0/80k (0%)`

---

## Environment Information

| Item | Value |
|------|-------|
| **OpenClaw Version** | 2026.3.23-1 |
| **Remote llama-server** | 192.168.3.77:8080 |
| **Model** | Qwen3.5-35B-A3B-GGUF |
| **Operating System** | macOS (user) / Ubuntu 24.04 (server) |
| **llama.cpp Version** | 8419 (commit: 509a31d00) |
| **Model File** | unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf |
| **GPU** | NVIDIA GeForce RTX 3090 (24GB) |

---

## Recommended Solution

**Modify OpenClaw code to support multiple field name formats:**

```javascript
// Before
input: response.usage?.input_tokens ?? 0,
output: response.usage?.output_tokens ?? 0,

// After - Support all formats
input: response.usage?.prompt_tokens ?? 
       response.usage?.input_tokens ?? 
       response.usage?.prompt_eval_count ?? 0,

output: response.usage?.completion_tokens ?? 
        response.usage?.output_tokens ?? 
        response.usage?.eval_count ?? 0,
```

This solution:
1. ✅ Backward compatible with all existing configurations
2. ✅ Supports llama.cpp server, Ollama, vLLM, and other frameworks
3. ✅ Zero configuration, works out of the box

---

## Expected Fix Priority

**Recommended: HIGH**

This issue has wide-ranging impact and may cause severe user experience problems.

---

## Server Information

### 192.168.3.77 Server Details

**Basic Information:**
- Hostname: vllm-server
- IP Address: 192.168.3.77
- OS: Ubuntu 24.04 (Linux 6.8.0-106-generic)
- Architecture: x86_64
- Uptime: 3 days 13 hours

**Hardware:**
- GPU: NVIDIA GeForce RTX 3090 (24GB VRAM)
- System Memory: 62GB
- Disk: 836GB (138GB used, 656GB available)

**Software:**
- llama.cpp Version: 8419 (commit: 509a31d00)
- GCC Version: 13.3.0
- NVIDIA Driver: 580.126.09
- CUDA Support: Enabled

**llama-server Configuration:**
```bash
/home/XXX/llama.cpp/build/bin/llama-server \
  -m /home/XXX/.cache/llama.cpp/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
  --mmproj /home/XXX/.cache/llama.cpp/mmproj-F16.gguf \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  -ngl 99 \
  -np 1 \
  -fa on \
  --ctx-size 96000 \
  --image-min-tokens 1024 \
  --image-max-tokens 4096 \
  --host 0.0.0.0 \
  --port 8080
```

**Key Configuration Notes:**
- Context window: 96,000 tokens (configured)
- Model size: 21 GB (Q4_K_XL quantized)
- GPU layers: 99 (all layers on GPU)
- Flash Attention: Enabled

---

## References

- Ollama API Docs: https://github.com/ollama/ollama/blob/main/docs/api.md
- vLLM OpenAI Compatible API: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
- HuggingFace TGI API: https://huggingface.co/docs/text-generation-inference/openai_api



### Steps to reproduce

telegrem /status

### Expected behavior

- 📚 Context Display: 10/100k (10%)

### Actual behavior

- 📚 Context Display: 0/100k (0%)

### OpenClaw version

2026.3.8~2026.3.23

### Operating system

macos12.7  llam8419 (commit: 509a31d00)  

### Install method

npm

### Model

unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf

### Provider / routing chain

openclaw---->llama-server

### Additional provider/model setup details

llama.cpp server (OpenAI-compatible format)

{
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 1,
    "total_tokens": 12
  }
}

Ollama (custom format)

{
  "prompt_eval_count": 26,
  "eval_count": 259
}

vLLM / TGI / OpenAI (OpenAI standard format)

{
  "usage": {
    "prompt_tokens": 100,
    "completion_tokens": 50,
    "total_tokens": 150
  }
}
| Framework          | Status       | Notes                                 |
| ------------------ | ------------ | ------------------------------------- |
| ❌ llama.cpp server | AFFECTED     | Most common local deployment solution |
| ❌ Ollama           | AFFECTED     | Popular model management service      |
| ✅ vLLM             | NOT AFFECTED | Compatible (OpenAI format)            |
| ✅ HuggingFace TGI  | NOT AFFECTED | Compatible (OpenAI format)            |
| ✅ OpenAI API       | NOT AFFECTED | Compatible (OpenAI format)            |

### Logs, screenshots, and evidence

```shell
Root Cause

OpenClaw expects these field names at line ~181675:

input: response.usage?.input_tokens ?? 0,
output: response.usage?.output_tokens ?? 0,

However, different frameworks return different field names:

llama.cpp server (OpenAI-compatible format)

{
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 1,
    "total_tokens": 12
  }
}

Ollama (custom format)

{
  "prompt_eval_count": 26,
  "eval_count": 259
}

vLLM / TGI / OpenAI (OpenAI standard format)

{
  "usage": {
    "prompt_tokens": 100,
    "completion_tokens": 50,
    "total_tokens": 150
  }
}

Real-World Case

User Configuration:

• OpenClaw Display: 0/80k (0%)
• Remote llama-server (192.168.3.77:8080) Actual Usage: 43250/80000 (54%)

Cause: llama.cpp server returns prompt_tokens, but OpenClaw expects
```

### Impact and severity

1. Context Window Overflow Risk

Due to inability to accurately track token usage:


1. User cannot see real-time token usage rate
2. Cannot determine if conversation is approaching the 80k context limit
3. May lead to:
  • Model truncation: Ultra-long conversations are forcibly truncated
  • Quality degradation: Context overflow causes model to forget early conversation
  • Session crash: API returns errors after exceeding limits

Actual Impact:

• In long conversation scenarios, users may encounter context overflow without warning
• Important conversation content may be lost

───

2. Conversation Management Failure

OpenClaw's conversation management mechanisms rely on accurate token counting:


1. Auto-compression mechanism fails:
  • OpenClaw may decide to compress historical messages based on token usage rate
  • If count is 0, compression never triggers
  • Leads to unlimited accumulation of historical messages, eventually causing memory overflow
2. Session reset strategy fails:
  • Under some configurations, sessions automatically reset when token usage reaches a threshold
  • Due to count being 0, reset never triggers
  • Leads to uncontrolled session length
3. Resource waste:
  • Cannot accurately evaluate token cost per session
  • May lead to unnecessary long conversations

───

3. Cost Monitoring Failure

Even with free local models, token statistics are important performance metrics:


1. Performance analysis difficulty:
  • Cannot analyze token consumption across different conversations
  • Cannot identify abnormally high token usage patterns
  • Difficult to optimize conversation strategies
2. Multi-model comparison fails:
  • If multiple model backends exist, cannot fairly compare token efficiency
  • Cannot make model switching decisions based on token usage
3. API quota monitoring fails (if using paid APIs):
  • Cannot accurately track API quota usage
  • May unexpectedly exceed quota causing service interruption
4. LCM (Lossless Context Management) Function Abnormalities

OpenClaw's LCM system relies on token statistics to manage conversation history:


1. Historical message compression strategy fails:
  • LCM decides whether to compress history based on token usage rate
  • When count is 0, compression never triggers
  • Leads to uncontrolled memory usage
2. Context optimization fails:
  • LCM cannot intelligently retain important conversations
  • May lead to important information being discarded too early
3. Search and retrieval functionality affected:
  • LCM's search function may rely on token statistics
  • Leads to inaccurate search results

───

5. User Experience Degradation


1. User confusion:
  • See 0/80k (0%) display
  • User cannot determine conversation status
  • May mistakenly think system is malfunctioning
2. Trust reduction:
  • Key metrics display incorrectly
  • User may question the reliability of the entire system
3. Cannot optimize conversation strategy:
  • User cannot adjust conversation methods based on token usage
  • Cannot learn how to efficiently use the context window

───

6. Diagnosis and Debugging Difficulty


1. Problem troubleshooting difficulty:
  • If conversation anomalies occur, cannot locate issues through token statistics
  • Increases troubleshooting time costs
2. Performance optimization blocked:
  • Cannot perform performance optimization based on token statistics
  • Difficult to identify performance bottlenecks
3. Automated testing fails:
  • Automated tests may rely on token statistics as success metrics
  • Leads to inaccurate test results

───

7. Resource Allocation Issues in Multi-User/Multi-Session Scenarios

If multiple users or concurrent sessions exist:


1. Unequal resource allocation:
  • Cannot accurately track token usage per session
  • Leads to some sessions consuming excessive resources
2. Service quality degradation:
  • Some sessions may respond slowly due to resource exhaustion
  • Affects overall user experience
3. Quota management difficult to implement:
  • Cannot fairly allocate token quotas
  • May lead to certain users monopolizing resources
| Issue                           | Severity  | Affected Scope            | Probability |
| ------------------------------- | --------- | ------------------------- | ----------- |
| Context window overflow         | 🔴 High   | All long conversations    | High        |
| Conversation management failure | 🟡 Medium | LCM users                 | Medium      |
| Cost monitoring failure         | 🟡 Medium | All users                 | High        |
| LCM function abnormality        | 🔴 High   | LCM users                 | High        |
| User experience degradation     | 🟢 Low    | All users                 | High        |
| Diagnosis difficulty            | 🟡 Medium | Developers/Advanced users | Medium      |
| Resource allocation issues      | 🟡 Medium | Multi-user scenarios      | Medium      |


### Additional information

_No response_

Framework	Status	Notes
❌ llama.cpp server	AFFECTED	Most common local deployment solution
❌ Ollama	AFFECTED	Popular model management service
✅ vLLM	NOT AFFECTED	Compatible (OpenAI format)
✅ HuggingFace TGI	NOT AFFECTED	Compatible (OpenAI format)
✅ OpenAI API	NOT AFFECTED	Compatible (OpenAI format)

Issue	Severity	Affected Scope	Probability
Context window overflow	🔴 High	All long conversations	High
Conversation management failure	🟡 Medium	LCM users	Medium
Cost monitoring failure	🟡 Medium	All users	High
LCM function abnormality	🔴 High	LCM users	High
User experience degradation	🟢 Low	All users	High
Diagnosis difficulty	🟡 Medium	Developers/Advanced users	Medium
Resource allocation issues	🟡 Medium	Multi-user scenarios	Medium

Item	Value
OpenClaw Version	2026.3.23-1
Remote llama-server	192.168.3.77:8080
Model	Qwen3.5-35B-A3B-GGUF
Operating System	macOS (user) / Ubuntu 24.04 (server)
llama.cpp Version	8419 (commit: 509a31d00)
Model File	unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf
GPU	NVIDIA GeForce RTX 3090 (24GB)

Framework	Status	Notes
❌ llama.cpp server	AFFECTED	Most common local deployment solution
❌ Ollama	AFFECTED	Popular model management service
✅ vLLM	NOT AFFECTED	Compatible (OpenAI format)
✅ HuggingFace TGI	NOT AFFECTED	Compatible (OpenAI format)
✅ OpenAI API	NOT AFFECTED	Compatible (OpenAI format)

Uh oh!

[Bug]: llama-cpp and Ollama providers return incorrect context usage due to field name mismatch #53448

Description

Bug type

Summary

OpenClaw GitHub Issue - Context Usage Bug

Issue Title

Issue Content

Problem Description

Affected Frameworks

Root Cause

llama.cpp server (OpenAI-compatible format)

Ollama (custom format)

vLLM / TGI / OpenAI (OpenAI standard format)

Real-World Case

Chain Reactions from Failed Context Statistics

1. Context Window Overflow Risk

2. Conversation Management Failure

3. Cost Monitoring Failure

4. LCM (Lossless Context Management) Function Abnormalities

5. User Experience Degradation

6. Diagnosis and Debugging Difficulty

7. Resource Allocation Issues in Multi-User/Multi-Session Scenarios

Problem Severity Assessment

Case Study 1: Long Conversation Leading to Content Loss

Case Study 2: LCM Compression Mechanism Failure

Code Location

Test Steps

Environment Information

Recommended Solution

Expected Fix Priority

Server Information

192.168.3.77 Server Details

References

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions