feat(providers): parse llama.cpp timings for cache + performance metrics by Aaronontheweb · Pull Request #615 · netclaw-dev/netclaw

Aaronontheweb · 2026-04-12T19:52:20Z

Summary

Resolves #614. Netclaw's OpenAI-compatible provider now parses llama.cpp's timings object from chat completion responses, surfacing KV cache hit counts and server-side throughput/latency data that were previously ignored.

What's new

Provider-side parsing (OpenAiCompatibleChatClient.ParseUsage)

Reads timings.cache_n → UsageDetails.CachedInputTokenCount
Stores prompt_ms, prompt_per_second, predicted_ms, predicted_per_second in UsageDetails.AdditionalCounts using integer-encoded keys (microseconds / ×100 scale factors, since M.E.AI types AdditionalCounts as AdditionalPropertiesDictionary<long>)
Graceful degradation: all fields stay null when the provider doesn't emit a timings object (OpenAI, Anthropic, vLLM, TGI)

Protocol surface (UsageOutput, SessionOutputDto)

New typed fields: PromptMs, PredictedPerSecond (decoded from AdditionalCounts), plus CachedInputTokens/ReasoningTokens which were already in UsageOutput but missing from the SignalR wire DTO
LlmSessionActor.EmitUsageOutput decodes the scale factors and populates UsageOutput
SessionOutputDtoMapper maps all new fields in both directions

CLI metrics (HeadlessChannel)

JSON envelope (chat -p --json) now includes usage.cachedInputTokens, usage.promptMs, usage.predictedPerSecond, ttftMs (client-side time-to-first-delta), and totalMs (client-side prompt→turn-completed wall time)
Plain-text [usage] line extended with cached=, prompt_ms=, tok_s= so the eval runner can parse metrics without requiring --json mode

Eval harness (evals/run-evals.sh)

New eval_metrics SQLite table captures per-prompt performance data
store_metrics() parses the [usage] line after every prompt
print_metrics_summary() prints per-category and overall performance stats at the end of the run

Verified end-to-end

Docker-isolated 2-turn test against live llama.cpp server (Qwen3.5-27B-UD-Q4_K_XL, dual R9700):

Metric	Turn 1 (cold)	Turn 2 (warm)	Delta
`cachedInputTokens`	4,855	4,855	Same system prompt cached
`promptMs` (server prefill)	743.6ms	331.4ms	55% faster
`ttftMs` (client-side)	2,821.6ms	1,247.9ms	56% faster
`totalMs`	3,360.5ms	1,555.1ms	54% faster
`predictedPerSecond`	25.3 tok/s	25.23 tok/s	Stable

One-iteration eval run on dev branch printed the full performance summary table across 23 prompts with per-category breakdowns.

Code review fixes (applied after initial pass)

Extracted magic string keys (prompt_us, predicted_tok_per_sec_x100, etc.) as internal const on OpenAiCompatibleChatClient with a cross-reference comment on the consumer side in LlmSessionActor
Rewrote TryGetLong/TryGetDouble with idiomatic JsonElement.TryGetInt64/TryGetDouble instead of the (value = ...) is var _ trick
store_metrics uses plain INSERT instead of INSERT OR REPLACE so PK collisions fail loudly
Removed a Fact(Skip = ...) live-server integration test — the same coverage exists hermetically via StreamingResponse_WithTimings_SurfacesCachedTokens

Test plan

dotnet build — all projects compile, 0 warnings
dotnet slopwatch analyze — 0 issues
dotnet test src/Netclaw.Daemon.Tests (ParseUsage + streaming) — 9/9 pass
dotnet test src/Netclaw.Cli.Tests — 413/413 pass
End-to-end Docker test: 2-turn headless run shows all timing fields populated in JSON output
evals/run-evals.sh with 1 iteration — performance metrics table prints correctly per-category and overall

Out of scope (follow-ups)

Multi-turn eval cases — PR feat(cli): add --resume and -p flags to netclaw chat for scripted multi-turn sessions #613 added chat -p --resume but multi-turn behavioral/performance cases aren't wired into run_all() yet. Needed to validate KV cache growth across turns, especially for tool-call flows where sub-turn serialization could silently break cache hits.
Caddy session affinity — round-robin between GPU0/GPU1 currently defeats the prompt cache 50% of the time for same-session follow-ups. The metrics captured here are what's needed to measure the delta when affinity is added.

…ics (#614) The OpenAI-compatible client now reads the llama.cpp `timings` object from chat completion responses, surfacing cache hit counts and throughput data that were previously ignored. - ParseUsage reads `timings.cache_n` → CachedInputTokenCount - Throughput fields (prompt_ms, predicted_per_second, etc.) stored in UsageDetails.AdditionalCounts and mapped to typed properties on UsageOutput (PromptMs, PredictedPerSecond) - HeadlessChannel JSON envelope adds client-side ttftMs and totalMs (Stopwatch-based), plus server-side promptMs/predictedPerSecond - Graceful degradation: all timing fields are null when the provider does not include a timings object (vLLM, TGI, Ollama) - 5 new ParseUsage unit tests covering with/without timings, partial timings, and graceful absence

…g integration tests The initial commit parsed llama.cpp timings correctly but the fields were dropped by the SessionOutputDto/Mapper SignalR wire layer. Add CachedInputTokens, ReasoningTokens, PromptMs, PredictedPerSecond to both the DTO and mapper in both directions. Verified end-to-end in Docker: Turn 2 shows 55% faster prefill (promptMs 743→331ms) and 56% faster TTFT (2821→1247ms) due to KV cache warmth. Also adds streaming integration test that proves timings survive the full GetStreamingResponseAsync → ToChatResponse pipeline.

Extend the plain text [usage] line with cached token count and server-side timing fields so the eval runner can parse them without requiring --json mode. - HeadlessChannel: [usage] line now includes cached=, prompt_ms=, tok_s= - run-evals.sh: add eval_metrics SQLite table, parse_metrics() extracts timing from [usage] output after every prompt, print_metrics_summary() shows per-category and overall performance stats at the end of the run

Review cleanup after /simplify: - Extract llama.cpp timing AdditionalCounts keys as internal const fields on OpenAiCompatibleChatClient; consumer side (LlmSessionActor) gets a cross-reference comment so the two sides stay in sync. - Rewrite TryGetLong/TryGetDouble using idiomatic TryGetInt64/TryGetDouble pattern instead of the `(value = ...) is var _` trick. Also defends against unexpected float values for cache_n. - Remove skipped LiveServer_StreamingResponse_IncludesTimings — covered by the hermetic StreamingResponse_WithTimings_SurfacesCachedTokens test. - store_metrics uses plain INSERT so primary-key collisions fail loudly.

Aaronontheweb added 4 commits April 12, 2026 18:37

Aaronontheweb enabled auto-merge (squash) April 12, 2026 19:55

Aaronontheweb merged commit 69c3266 into dev Apr 12, 2026
4 checks passed

Aaronontheweb deleted the feat/parse-llama-timings branch April 12, 2026 19:56

This was referenced Apr 12, 2026

Extend OpenAiCompatibleChatClient for vLLM backend compatibility #619

Closed

feat(providers): add Anthropic cache_control ephemeral markers for explicit prompt caching #621

Open

release: prepare 0.12.0 release notes and version bump #624

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(providers): parse llama.cpp timings for cache + performance metrics#615

feat(providers): parse llama.cpp timings for cache + performance metrics#615
Aaronontheweb merged 4 commits into
devfrom
feat/parse-llama-timings

Aaronontheweb commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aaronontheweb commented Apr 12, 2026

Summary

What's new

Verified end-to-end

Code review fixes (applied after initial pass)

Test plan

Out of scope (follow-ups)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant