feat(providers): parse llama.cpp timings for cache + performance metrics#615
Merged
Conversation
…ics (#614) The OpenAI-compatible client now reads the llama.cpp `timings` object from chat completion responses, surfacing cache hit counts and throughput data that were previously ignored. - ParseUsage reads `timings.cache_n` → CachedInputTokenCount - Throughput fields (prompt_ms, predicted_per_second, etc.) stored in UsageDetails.AdditionalCounts and mapped to typed properties on UsageOutput (PromptMs, PredictedPerSecond) - HeadlessChannel JSON envelope adds client-side ttftMs and totalMs (Stopwatch-based), plus server-side promptMs/predictedPerSecond - Graceful degradation: all timing fields are null when the provider does not include a timings object (vLLM, TGI, Ollama) - 5 new ParseUsage unit tests covering with/without timings, partial timings, and graceful absence
…g integration tests The initial commit parsed llama.cpp timings correctly but the fields were dropped by the SessionOutputDto/Mapper SignalR wire layer. Add CachedInputTokens, ReasoningTokens, PromptMs, PredictedPerSecond to both the DTO and mapper in both directions. Verified end-to-end in Docker: Turn 2 shows 55% faster prefill (promptMs 743→331ms) and 56% faster TTFT (2821→1247ms) due to KV cache warmth. Also adds streaming integration test that proves timings survive the full GetStreamingResponseAsync → ToChatResponse pipeline.
Extend the plain text [usage] line with cached token count and server-side timing fields so the eval runner can parse them without requiring --json mode. - HeadlessChannel: [usage] line now includes cached=, prompt_ms=, tok_s= - run-evals.sh: add eval_metrics SQLite table, parse_metrics() extracts timing from [usage] output after every prompt, print_metrics_summary() shows per-category and overall performance stats at the end of the run
Review cleanup after /simplify: - Extract llama.cpp timing AdditionalCounts keys as internal const fields on OpenAiCompatibleChatClient; consumer side (LlmSessionActor) gets a cross-reference comment so the two sides stay in sync. - Rewrite TryGetLong/TryGetDouble using idiomatic TryGetInt64/TryGetDouble pattern instead of the `(value = ...) is var _` trick. Also defends against unexpected float values for cache_n. - Remove skipped LiveServer_StreamingResponse_IncludesTimings — covered by the hermetic StreamingResponse_WithTimings_SurfacesCachedTokens test. - store_metrics uses plain INSERT so primary-key collisions fail loudly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves #614. Netclaw's OpenAI-compatible provider now parses llama.cpp's
timingsobject from chat completion responses, surfacing KV cache hit counts and server-side throughput/latency data that were previously ignored.What's new
Provider-side parsing (
OpenAiCompatibleChatClient.ParseUsage)timings.cache_n→UsageDetails.CachedInputTokenCountprompt_ms,prompt_per_second,predicted_ms,predicted_per_secondinUsageDetails.AdditionalCountsusing integer-encoded keys (microseconds / ×100 scale factors, since M.E.AI typesAdditionalCountsasAdditionalPropertiesDictionary<long>)timingsobject (OpenAI, Anthropic, vLLM, TGI)Protocol surface (
UsageOutput,SessionOutputDto)PromptMs,PredictedPerSecond(decoded fromAdditionalCounts), plusCachedInputTokens/ReasoningTokenswhich were already inUsageOutputbut missing from the SignalR wire DTOLlmSessionActor.EmitUsageOutputdecodes the scale factors and populatesUsageOutputSessionOutputDtoMappermaps all new fields in both directionsCLI metrics (
HeadlessChannel)chat -p --json) now includesusage.cachedInputTokens,usage.promptMs,usage.predictedPerSecond,ttftMs(client-side time-to-first-delta), andtotalMs(client-side prompt→turn-completed wall time)[usage]line extended withcached=,prompt_ms=,tok_s=so the eval runner can parse metrics without requiring--jsonmodeEval harness (
evals/run-evals.sh)eval_metricsSQLite table captures per-prompt performance datastore_metrics()parses the[usage]line after every promptprint_metrics_summary()prints per-category and overall performance stats at the end of the runVerified end-to-end
Docker-isolated 2-turn test against live llama.cpp server (Qwen3.5-27B-UD-Q4_K_XL, dual R9700):
cachedInputTokenspromptMs(server prefill)ttftMs(client-side)totalMspredictedPerSecondOne-iteration eval run on dev branch printed the full performance summary table across 23 prompts with per-category breakdowns.
Code review fixes (applied after initial pass)
prompt_us,predicted_tok_per_sec_x100, etc.) asinternal constonOpenAiCompatibleChatClientwith a cross-reference comment on the consumer side inLlmSessionActorTryGetLong/TryGetDoublewith idiomaticJsonElement.TryGetInt64/TryGetDoubleinstead of the(value = ...) is var _trickstore_metricsuses plainINSERTinstead ofINSERT OR REPLACEso PK collisions fail loudlyFact(Skip = ...)live-server integration test — the same coverage exists hermetically viaStreamingResponse_WithTimings_SurfacesCachedTokensTest plan
dotnet build— all projects compile, 0 warningsdotnet slopwatch analyze— 0 issuesdotnet test src/Netclaw.Daemon.Tests(ParseUsage + streaming) — 9/9 passdotnet test src/Netclaw.Cli.Tests— 413/413 passevals/run-evals.shwith 1 iteration — performance metrics table prints correctly per-category and overallOut of scope (follow-ups)
chat -p --resumebut multi-turn behavioral/performance cases aren't wired intorun_all()yet. Needed to validate KV cache growth across turns, especially for tool-call flows where sub-turn serialization could silently break cache hits.