Skip to content

feat(providers): parse llama.cpp timings object for cache + performance metrics #614

@Aaronontheweb

Description

@Aaronontheweb

Summary

The OpenAI-compatible provider (OpenAiCompatibleChatClient) only parses the top-level usage object from llama.cpp responses. llama.cpp returns a sibling timings object with rich performance and cache data that we completely ignore.

What llama.cpp actually returns

{
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 5,
    "total_tokens": 16
  },
  "timings": {
    "cache_n": 0,
    "prompt_n": 11,
    "prompt_ms": 139.655,
    "prompt_per_token_ms": 12.696,
    "prompt_per_second": 78.766,
    "predicted_n": 5,
    "predicted_ms": 160.048,
    "predicted_per_token_ms": 32.010,
    "predicted_per_second": 31.241
  }
}

Key fields we're missing

Field What it tells us
cache_n Cached prompt tokens — the KV cache hit count. This is the metric for validating session-sticky routing (#610)
prompt_n Non-cached prompt tokens processed
prompt_ms Prompt processing time (effective TTFT at the server)
prompt_per_second Prefill throughput (tok/s)
predicted_n Output tokens generated
predicted_ms Generation time
predicted_per_second Output generation throughput (tok/s) — clean metric, no reasoning token confusion

What changes

1. Parse timings in OpenAiCompatibleChatClient.ParseUsage()

Extend ParseUsage() to read the timings object when present:

  • Map cache_nUsageDetails.CachedInputTokenCount
  • Store timing fields in UsageDetails.AdditionalCounts (or a new extension property) so they flow through the existing pipeline

2. Surface in UsageOutput

The UsageOutput record already has CachedInputTokens — it just never gets populated for the OpenAI-compatible provider. Populating CachedInputTokenCount in UsageDetails will automatically flow through LlmSessionActor.EmitUsageOutput().

For the timing fields (prompt_ms, predicted_per_second, etc.), decide whether to:

  • Add dedicated properties to UsageOutput (clean, typed)
  • Use AdditionalCounts on UsageDetails (extensible, no protocol change)

3. Add timing metrics to headless --json envelope

The chat -p --json output (#611) currently includes usage.inputTokens/outputTokens/totalTokens. Extend to include:

  • cachedInputTokens — cache hit count
  • promptMs — server-side prefill time
  • predictedPerSecond — output tok/s
  • ttftMs — client-side time to first text delta (measured in HeadlessChannel)
  • totalMs — client-side prompt-to-turn-completed wall time

4. Graceful degradation

The timings object is llama.cpp-specific — other OpenAI-compatible servers (vLLM, TGI, Ollama) may not include it. Parsing must be optional: if timings is absent, all derived fields stay null.

Motivation

Acceptance criteria

  • ParseUsage() reads timings.cache_nCachedInputTokenCount when present
  • ParseUsage() reads timing fields into AdditionalCounts (or dedicated properties)
  • chat -p --json output includes cachedInputTokens and timing metrics when available
  • Existing behavior unchanged when timings is absent (graceful degradation)
  • Unit tests for ParseUsage with and without timings object
  • quick-multi-turn-test.sh updated to assert cachedInputTokens on turn 2 (when using llama.cpp)

Metadata

Metadata

Assignees

No one assigned

    Labels

    sessionsLLM session actor, turn lifecycle, pipelines

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions