Summary
The OpenAI-compatible provider (OpenAiCompatibleChatClient) only parses the top-level usage object from llama.cpp responses. llama.cpp returns a sibling timings object with rich performance and cache data that we completely ignore.
What llama.cpp actually returns
{
"usage": {
"prompt_tokens": 11,
"completion_tokens": 5,
"total_tokens": 16
},
"timings": {
"cache_n": 0,
"prompt_n": 11,
"prompt_ms": 139.655,
"prompt_per_token_ms": 12.696,
"prompt_per_second": 78.766,
"predicted_n": 5,
"predicted_ms": 160.048,
"predicted_per_token_ms": 32.010,
"predicted_per_second": 31.241
}
}
Key fields we're missing
| Field |
What it tells us |
cache_n |
Cached prompt tokens — the KV cache hit count. This is the metric for validating session-sticky routing (#610) |
prompt_n |
Non-cached prompt tokens processed |
prompt_ms |
Prompt processing time (effective TTFT at the server) |
prompt_per_second |
Prefill throughput (tok/s) |
predicted_n |
Output tokens generated |
predicted_ms |
Generation time |
predicted_per_second |
Output generation throughput (tok/s) — clean metric, no reasoning token confusion |
What changes
1. Parse timings in OpenAiCompatibleChatClient.ParseUsage()
Extend ParseUsage() to read the timings object when present:
- Map
cache_n → UsageDetails.CachedInputTokenCount
- Store timing fields in
UsageDetails.AdditionalCounts (or a new extension property) so they flow through the existing pipeline
2. Surface in UsageOutput
The UsageOutput record already has CachedInputTokens — it just never gets populated for the OpenAI-compatible provider. Populating CachedInputTokenCount in UsageDetails will automatically flow through LlmSessionActor.EmitUsageOutput().
For the timing fields (prompt_ms, predicted_per_second, etc.), decide whether to:
- Add dedicated properties to
UsageOutput (clean, typed)
- Use
AdditionalCounts on UsageDetails (extensible, no protocol change)
3. Add timing metrics to headless --json envelope
The chat -p --json output (#611) currently includes usage.inputTokens/outputTokens/totalTokens. Extend to include:
cachedInputTokens — cache hit count
promptMs — server-side prefill time
predictedPerSecond — output tok/s
ttftMs — client-side time to first text delta (measured in HeadlessChannel)
totalMs — client-side prompt-to-turn-completed wall time
4. Graceful degradation
The timings object is llama.cpp-specific — other OpenAI-compatible servers (vLLM, TGI, Ollama) may not include it. Parsing must be optional: if timings is absent, all derived fields stay null.
Motivation
Acceptance criteria
Summary
The OpenAI-compatible provider (
OpenAiCompatibleChatClient) only parses the top-levelusageobject from llama.cpp responses. llama.cpp returns a siblingtimingsobject with rich performance and cache data that we completely ignore.What llama.cpp actually returns
{ "usage": { "prompt_tokens": 11, "completion_tokens": 5, "total_tokens": 16 }, "timings": { "cache_n": 0, "prompt_n": 11, "prompt_ms": 139.655, "prompt_per_token_ms": 12.696, "prompt_per_second": 78.766, "predicted_n": 5, "predicted_ms": 160.048, "predicted_per_token_ms": 32.010, "predicted_per_second": 31.241 } }Key fields we're missing
cache_nprompt_nprompt_msprompt_per_secondpredicted_npredicted_mspredicted_per_secondWhat changes
1. Parse
timingsinOpenAiCompatibleChatClient.ParseUsage()Extend
ParseUsage()to read thetimingsobject when present:cache_n→UsageDetails.CachedInputTokenCountUsageDetails.AdditionalCounts(or a new extension property) so they flow through the existing pipeline2. Surface in
UsageOutputThe
UsageOutputrecord already hasCachedInputTokens— it just never gets populated for the OpenAI-compatible provider. PopulatingCachedInputTokenCountinUsageDetailswill automatically flow throughLlmSessionActor.EmitUsageOutput().For the timing fields (
prompt_ms,predicted_per_second, etc.), decide whether to:UsageOutput(clean, typed)AdditionalCountsonUsageDetails(extensible, no protocol change)3. Add timing metrics to headless
--jsonenvelopeThe
chat -p --jsonoutput (#611) currently includesusage.inputTokens/outputTokens/totalTokens. Extend to include:cachedInputTokens— cache hit countpromptMs— server-side prefill timepredictedPerSecond— output tok/sttftMs— client-side time to first text delta (measured inHeadlessChannel)totalMs— client-side prompt-to-turn-completed wall time4. Graceful degradation
The
timingsobject is llama.cpp-specific — other OpenAI-compatible servers (vLLM, TGI, Ollama) may not include it. Parsing must be optional: iftimingsis absent, all derived fields stay null.Motivation
cache_non turn 2+ should be > 0 when the same GPU handles consecutive turns for a session.netclaw chatfor scripted multi-turn sessions #611): Multi-turn evals can now comparecache_nandprompt_msacross turns.predicted_per_secondgives clean output throughput without reasoning token noise.prompt_ms+ client-side TTFT gives end-to-end latency breakdown.Acceptance criteria
ParseUsage()readstimings.cache_n→CachedInputTokenCountwhen presentParseUsage()reads timing fields intoAdditionalCounts(or dedicated properties)chat -p --jsonoutput includescachedInputTokensand timing metrics when availabletimingsis absent (graceful degradation)ParseUsagewith and withouttimingsobjectquick-multi-turn-test.shupdated to assertcachedInputTokenson turn 2 (when using llama.cpp)