Skip to content

feat(providers): parse llama.cpp timings for cache + performance metrics#615

Merged
Aaronontheweb merged 4 commits into
devfrom
feat/parse-llama-timings
Apr 12, 2026
Merged

feat(providers): parse llama.cpp timings for cache + performance metrics#615
Aaronontheweb merged 4 commits into
devfrom
feat/parse-llama-timings

Conversation

@Aaronontheweb

Copy link
Copy Markdown
Collaborator

Summary

Resolves #614. Netclaw's OpenAI-compatible provider now parses llama.cpp's timings object from chat completion responses, surfacing KV cache hit counts and server-side throughput/latency data that were previously ignored.

What's new

Provider-side parsing (OpenAiCompatibleChatClient.ParseUsage)

  • Reads timings.cache_nUsageDetails.CachedInputTokenCount
  • Stores prompt_ms, prompt_per_second, predicted_ms, predicted_per_second in UsageDetails.AdditionalCounts using integer-encoded keys (microseconds / ×100 scale factors, since M.E.AI types AdditionalCounts as AdditionalPropertiesDictionary<long>)
  • Graceful degradation: all fields stay null when the provider doesn't emit a timings object (OpenAI, Anthropic, vLLM, TGI)

Protocol surface (UsageOutput, SessionOutputDto)

  • New typed fields: PromptMs, PredictedPerSecond (decoded from AdditionalCounts), plus CachedInputTokens/ReasoningTokens which were already in UsageOutput but missing from the SignalR wire DTO
  • LlmSessionActor.EmitUsageOutput decodes the scale factors and populates UsageOutput
  • SessionOutputDtoMapper maps all new fields in both directions

CLI metrics (HeadlessChannel)

  • JSON envelope (chat -p --json) now includes usage.cachedInputTokens, usage.promptMs, usage.predictedPerSecond, ttftMs (client-side time-to-first-delta), and totalMs (client-side prompt→turn-completed wall time)
  • Plain-text [usage] line extended with cached=, prompt_ms=, tok_s= so the eval runner can parse metrics without requiring --json mode

Eval harness (evals/run-evals.sh)

  • New eval_metrics SQLite table captures per-prompt performance data
  • store_metrics() parses the [usage] line after every prompt
  • print_metrics_summary() prints per-category and overall performance stats at the end of the run

Verified end-to-end

Docker-isolated 2-turn test against live llama.cpp server (Qwen3.5-27B-UD-Q4_K_XL, dual R9700):

Metric Turn 1 (cold) Turn 2 (warm) Delta
cachedInputTokens 4,855 4,855 Same system prompt cached
promptMs (server prefill) 743.6ms 331.4ms 55% faster
ttftMs (client-side) 2,821.6ms 1,247.9ms 56% faster
totalMs 3,360.5ms 1,555.1ms 54% faster
predictedPerSecond 25.3 tok/s 25.23 tok/s Stable

One-iteration eval run on dev branch printed the full performance summary table across 23 prompts with per-category breakdowns.

Code review fixes (applied after initial pass)

  • Extracted magic string keys (prompt_us, predicted_tok_per_sec_x100, etc.) as internal const on OpenAiCompatibleChatClient with a cross-reference comment on the consumer side in LlmSessionActor
  • Rewrote TryGetLong/TryGetDouble with idiomatic JsonElement.TryGetInt64/TryGetDouble instead of the (value = ...) is var _ trick
  • store_metrics uses plain INSERT instead of INSERT OR REPLACE so PK collisions fail loudly
  • Removed a Fact(Skip = ...) live-server integration test — the same coverage exists hermetically via StreamingResponse_WithTimings_SurfacesCachedTokens

Test plan

  • dotnet build — all projects compile, 0 warnings
  • dotnet slopwatch analyze — 0 issues
  • dotnet test src/Netclaw.Daemon.Tests (ParseUsage + streaming) — 9/9 pass
  • dotnet test src/Netclaw.Cli.Tests — 413/413 pass
  • End-to-end Docker test: 2-turn headless run shows all timing fields populated in JSON output
  • evals/run-evals.sh with 1 iteration — performance metrics table prints correctly per-category and overall

Out of scope (follow-ups)

  • Multi-turn eval cases — PR feat(cli): add --resume and -p flags to netclaw chat for scripted multi-turn sessions #613 added chat -p --resume but multi-turn behavioral/performance cases aren't wired into run_all() yet. Needed to validate KV cache growth across turns, especially for tool-call flows where sub-turn serialization could silently break cache hits.
  • Caddy session affinity — round-robin between GPU0/GPU1 currently defeats the prompt cache 50% of the time for same-session follow-ups. The metrics captured here are what's needed to measure the delta when affinity is added.

…ics (#614)

The OpenAI-compatible client now reads the llama.cpp `timings` object
from chat completion responses, surfacing cache hit counts and
throughput data that were previously ignored.

- ParseUsage reads `timings.cache_n` → CachedInputTokenCount
- Throughput fields (prompt_ms, predicted_per_second, etc.) stored in
  UsageDetails.AdditionalCounts and mapped to typed properties on
  UsageOutput (PromptMs, PredictedPerSecond)
- HeadlessChannel JSON envelope adds client-side ttftMs and totalMs
  (Stopwatch-based), plus server-side promptMs/predictedPerSecond
- Graceful degradation: all timing fields are null when the provider
  does not include a timings object (vLLM, TGI, Ollama)
- 5 new ParseUsage unit tests covering with/without timings,
  partial timings, and graceful absence
…g integration tests

The initial commit parsed llama.cpp timings correctly but the fields
were dropped by the SessionOutputDto/Mapper SignalR wire layer. Add
CachedInputTokens, ReasoningTokens, PromptMs, PredictedPerSecond to
both the DTO and mapper in both directions.

Verified end-to-end in Docker: Turn 2 shows 55% faster prefill
(promptMs 743→331ms) and 56% faster TTFT (2821→1247ms) due to KV
cache warmth.

Also adds streaming integration test that proves timings survive the
full GetStreamingResponseAsync → ToChatResponse pipeline.
Extend the plain text [usage] line with cached token count and
server-side timing fields so the eval runner can parse them without
requiring --json mode.

- HeadlessChannel: [usage] line now includes cached=, prompt_ms=, tok_s=
- run-evals.sh: add eval_metrics SQLite table, parse_metrics() extracts
  timing from [usage] output after every prompt, print_metrics_summary()
  shows per-category and overall performance stats at the end of the run
Review cleanup after /simplify:
- Extract llama.cpp timing AdditionalCounts keys as internal const fields
  on OpenAiCompatibleChatClient; consumer side (LlmSessionActor) gets a
  cross-reference comment so the two sides stay in sync.
- Rewrite TryGetLong/TryGetDouble using idiomatic TryGetInt64/TryGetDouble
  pattern instead of the `(value = ...) is var _` trick. Also defends
  against unexpected float values for cache_n.
- Remove skipped LiveServer_StreamingResponse_IncludesTimings — covered
  by the hermetic StreamingResponse_WithTimings_SurfacesCachedTokens test.
- store_metrics uses plain INSERT so primary-key collisions fail loudly.
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) April 12, 2026 19:55
@Aaronontheweb Aaronontheweb merged commit 69c3266 into dev Apr 12, 2026
4 checks passed
@Aaronontheweb Aaronontheweb deleted the feat/parse-llama-timings branch April 12, 2026 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(providers): parse llama.cpp timings object for cache + performance metrics

1 participant