feat: better stats by thushan · Pull Request #52 · thushan/olla

thushan · 2025-08-15T04:12:39Z

This PR includes improved stats for LLM queries, we pull the stats out of each providers payloads instead of manually calculating it.

The addition of the Metrics Extraction Framework means we can (technically) have a best effort extraction from the final response chunk.

Supports:

Ollama: eval_duration, prompt_eval_duration
OpenAI: (some implementations) prompt_tokens & completion_tokens
LM Studio: timing and token usage
vLLM (tho not fully)

These are all wrapped into the provider profile.

For example, ollama.yaml has this (right now):

# Metrics extraction from Ollama responses
metrics:
  extraction:
    enabled: true
    source: "response_body"
    format: "json"
    
    # JSONPath expressions for extracting values from Ollama response
    paths:
      model: "$.model"
      done: "$.done"
      # Token counts
      input_tokens: "$.prompt_eval_count"
      output_tokens: "$.eval_count"
      # Timing data (in nanoseconds from Ollama)
      total_duration_ns: "$.total_duration"
      load_duration_ns: "$.load_duration"
      prompt_duration_ns: "$.prompt_eval_duration"
      eval_duration_ns: "$.eval_duration"
      
    # Simple calculations to convert to useful metrics
    calculations:
      tokens_per_second: "output_tokens / (eval_duration_ns / 1000000000)"
      ttft_ms: "prompt_duration_ns / 1000000"
      total_ms: "total_duration_ns / 1000000"
      model_load_ms: "load_duration_ns / 1000000"

It's got support for basic math operators which should suffice. This therefore adds the following metrics:

Metrics Available:

Metric	Description	Source
input_tokens	Prompt token count	Provider response
output_tokens	Generated token count	Provider response
ttft_ms	Time to first token	Calculated/Provider
total_ms	Total generation time	Provider response
tokens_per_second	Generation throughput	Calculated
model_load_ms	Model loading time	Ollama
prompt_ms	Prompt processing time	Provider response
generation_ms	Token generation time	Provider response

Performance Impact:

Extraction Speed: ~10.4µs per operation (2x improvement compared to earlier attempts)
Memory Usage: 2KB per extraction (8.5x reduction compared to earlier attempts)
Allocations: 63 per extraction (3x reduction compared to earlier revisions)
JSON Parsing: 356ns with gjson vs 2.7µs with encoding/json

Memory and CPU profiling helped and took time. Doing so also meant we tuned a few other things...

Memory Management
- Olla: Only captures last chunk on EOF (reduced allocations by ~13x)
- Sherpa: Ring buffer implementation (8KB max) for bounded memory usage
- Pre-allocation improvements in critical paths (repository, balancer, unifier)
Hot Path Optimizations
- Removed debug logging from request path
- Expression pre-compilation at startup vs runtime
- Conditional metrics extraction (only when configured)
- Using util.SafeInt32() for safe integer conversions
Reduced Cognitive Complexity
- Extracted streaming helper functions to improve maintainability
- Cleaner separation of concerns in proxy implementations

This means now, logs will also tell you at the request completed flag:

In the future we'll make use of these in a new load balancer. For now, just logging them.

Summary by CodeRabbit

New Features
- Real-time provider metrics extraction for OpenAI-compatible, Ollama, LM Studio and vLLM (model, finish reason, token counts, throughput, latencies).
- Metrics surfaced in logs, response headers and an aggregated status endpoint; new Prometheus metrics for token throughput and token counts.
Documentation
- New Provider Metrics docs, examples, monitoring/Grafana guidance and updated site navigation.
Refactor
- Streaming updated to retain the final response chunk for metrics extraction and performance optimisations.

fix broken tests :( add ollama profile update tests add tests to check for calcs hmm we need to add functionality to calc in the profiles add the ability to extract easily update olla and sherpa fix tests ahhhh more tests broken service_leak fails change the way proxy intercepts add lmstudio and openai use better json parsing logic

- Integrated health checking and routing improvements from main - Added provider metrics extraction functionality - Combined ProviderMetrics and RoutingDecision in RequestStats - Resolved streaming context handling conflicts

- Created comprehensive provider-metrics.md documentation - Updated API reference with new headers and metrics info - Added provider metrics section to monitoring guide - Updated concepts overview to include provider metrics - Added provider metrics to mkdocs.yml navigation

…, adopting gjson and expr also preallocate as many slices as we can for stats work

coderabbitai · 2025-08-15T04:12:47Z

Caution

Review failed

The pull request is closed.

Walkthrough

Adds a provider metrics extraction pipeline: new domain types and ports, a gjson/expr-based Extractor with a 10ms bound, profile-level metrics.extraction configs for multiple providers, proxy plumbing to capture the last streamed chunk and invoke extraction, startup wiring, logging/prometheus updates, docs, tests and related API signature adjustments.

Changes

Cohort / File(s)	Summary of changes
Provider profile metrics config `config/profiles/openai.yaml`, `config/profiles/ollama.yaml`, `config/profiles/lmstudio.yaml`, `config/profiles/vllm.yaml`	Add `metrics.extraction` blocks (enabled, source, format, paths, calculations) to extract JSON fields and compute derived metrics per provider; purely additive to profiles.
Metrics extraction engine & tests `internal/adapter/metrics/extractor.go`, `internal/adapter/metrics/extractor_test.go`, `internal/adapter/metrics/integration_test.go`, `internal/adapter/metrics/benchmark_test.go`	New Extractor type with ProfileFactory interface, cached gjson field paths, precompiled expr calculations, ExtractMetrics/ExtractFromChunk with 10ms timeout, stats and unit/integration/bench tests.
Domain & ports `internal/core/domain/provider_metrics.go`, `internal/core/domain/profile_config.go`, `internal/core/ports/metrics.go`, `internal/core/ports/proxy.go`	New ProviderMetrics, MetricsExtractionConfig, MetricsConfig types; add `Metrics` to ProfileConfig; add `MetricsExtractor` interface; add ProviderMetrics to RequestStats.
Proxy core plumbing & factory `internal/adapter/proxy/core/base.go`, `internal/adapter/proxy/core/metrics.go`, `internal/adapter/proxy/factory.go`	Thread MetricsExtractor through base components and factory; add ExtractProviderMetrics and AppendProviderMetricsToLog helpers; update ProxyCreator/NewFactory signatures and factory storage.
Olla proxy streaming & helpers `internal/adapter/proxy/olla/service.go`, `internal/adapter/proxy/olla/service_retry.go`, `internal/adapter/proxy/olla/streaming_helpers.go`, `internal/adapter/proxy/*_test.go`	New metricsExtractor param for NewService; streaming refactor returns lastChunk, centralized context checking, processStreamData/write helpers, integrate core.ExtractProviderMetrics, update logging and tests to new signatures.
Sherpa proxy streaming & ring buffer `internal/adapter/proxy/sherpa/service.go`, `internal/adapter/proxy/sherpa/service_retry.go`, `internal/adapter/proxy/sherpa/service_streaming.go`, `internal/adapter/proxy/sherpa/ring_buffer.go`, `internal/adapter/proxy/*_test.go`	New metricsExtractor param for NewService; streamResponse now returns lastChunk; add thread-safe SimpleRingBuffer to retain last 8KB for metrics; integrate extraction and update tests.
Proxy constructor callsites / tests `internal/adapter/proxy/*/.go`, `internal/adapter/proxy/*/_test.go`, `internal/adapter/proxy/benchmark_*.go`	Update numerous NewService/NewFactory call sites and tests to accept the extra MetricsExtractor parameter (tests pass nil).
Startup wiring & logging `internal/app/services/proxy.go`, `internal/app/handlers/handler_proxy.go`	Create ProfileFactory and MetricsExtractor at startup (optional/fault-tolerant), pre-validate profiles, pass extractor to proxy factory, and augment request logs with provider metrics where available.
Prometheus & monitoring docs `docs/content/configuration/practices/monitoring.md`	Add exporter histograms (tokens_per_second, prompt_tokens, completion_tokens), observe provider metrics from status data, and dashboard panel suggestions.
Documentation & nav `docs/content/concepts/provider-metrics.md`, `docs/content/concepts/overview.md`, `docs/content/api-reference/overview.md`, `docs/mkdocs.yml`	New Provider Metrics doc; add Provider Metrics section in Concepts; API reference adds routing headers and Provider Metrics (debug logs) details; update MkDocs nav to include new page.
Dependencies `go.mod`	Add `github.com/expr-lang/expr v1.17.6` and `github.com/tidwall/gjson v1.18.0` (plus indirect tidwall deps).
Allocation optimisations `internal/adapter/balancer/priority.go`, `internal/adapter/discovery/repository.go`, `internal/adapter/health/checker_test.go`, `internal/adapter/unifier/default_unifier.go`	Pre-allocate slice capacities to reduce reallocations; no behavior changes.
Utilities `internal/util/safe.go`	Add `SafeInt32(int64) int32` and `SafeFloat32(float64) float32` utilities for safe numeric conversions/clamping.
Integration tests removed `internal/integration/providers/vllm_integration_test.go`	Delete vLLM integration test suite (removes those integration checks).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant Proxy as Proxy Service
  participant Upstream as Provider
  participant Extractor as MetricsExtractor
  participant Stats as StatsCollector

  Client->>Proxy: Request (streaming)
  Proxy->>Upstream: Forward request
  Upstream-->>Proxy: Stream chunks
  loop Stream
    Proxy-->>Client: Write chunk
  end
  Proxy->>Extractor: ExtractFromChunk(lastChunk, providerType)
  Extractor-->>Proxy: ProviderMetrics (best effort)
  Proxy->>Stats: Record RequestStats (incl. ProviderMetrics)
  Proxy-->>Client: Response complete

sequenceDiagram
  autonumber
  participant App as App Startup
  participant ProfF as ProfileFactory
  participant Extr as MetricsExtractor
  participant ProxyF as Proxy Factory

  App->>ProfF: NewFactoryWithDefaults()
  ProfF-->>App: Profiles
  App->>Extr: NewExtractor(ProfF)
  loop For each profile
    App->>Extr: ValidateProfile(profile)
    Extr-->>App: ok/error
  end
  App->>ProxyF: NewFactory(stats, extractor, logger)
  ProxyF-->>App: Factory ready

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~90 minutes

Possibly related PRs

feat: proxy profiles #42 — Overlaps proxy streaming/signature and factory wiring changes; likely touches the same NewService and streamResponse signature updates.
feat: proxy consolidation #34 — Related proxy API/factory changes (ProxyCreator/NewFactory) and metrics plumbing; high code-level overlap.
feat: olla profile #32 — Related profile/profile-config changes and earlier metrics/profile wiring that intersect the new metrics.extraction additions.

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 09cac43 and 982cdc6.

📒 Files selected for processing (2)

internal/adapter/metrics/extractor.go (1 hunks)
internal/util/safe.go (1 hunks)

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/better-stats

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 8

🔭 Outside diff range comments (1)

internal/adapter/proxy/olla/service.go (1)
593-606: Set X-Olla-Response-Time as an HTTP trailer after streaming completes.

Coding guidelines require X-Olla-Response-Time as a trailer. Ensure it’s declared before WriteHeader and assigned after streaming when latency is known.

Minimal changes:

Declare the trailer before Line 576 (WriteHeader).

Set the trailer value after Lines 611–613 (latency calculated).
-    // Copy response headers
+    // Copy response headers
     for key, values := range resp.Header {
         for _, value := range values {
             w.Header().Add(key, value)
         }
     }
+
+    // Ensure trailer is declared before headers are written
+    w.Header().Add("Trailer", "X-Olla-Response-Time")
 
     w.WriteHeader(resp.StatusCode)
@@
-    stats.EndTime = time.Now()
-    stats.Latency = duration.Milliseconds()
+    stats.EndTime = time.Now()
+    stats.Latency = duration.Milliseconds()
+    // Assign trailer value after latency is known
+    w.Header().Set("X-Olla-Response-Time", fmt.Sprintf("%dms", stats.Latency))
This keeps semantics correct for streaming and meets the documented contract.

🧹 Nitpick comments (61)

internal/adapter/unifier/default_unifier.go (1)
488-490: Pre‑allocate slices after de‑dup to exact sizes; tighten comment wording (Aussie English).

You de‑duplicate into maps and only then flatten into slices. Pre‑allocating here with rough heuristics risks under/over‑allocation and doesn’t help until the final append anyway. Initialise these slices as nil and pre‑allocate to the exact de‑duplicated sizes immediately before flattening. Also prefer Australian English and avoid abbreviations in comments.

Apply this change within the selected lines:
-        Aliases:          make([]domain.AliasEntry, 0, len(models)*2),   // thinking avg 2 aliases per model
-        SourceEndpoints:  make([]domain.SourceEndpoint, 0, len(models)), // at least 1 source per model
-        Capabilities:     make([]string, 0, len(models)*5),              // should be ~5 capabilities per model
+        // Leave nil here; we pre‑allocate to the exact de‑duplicated size later to minimise reallocations.
+        Aliases:         nil,
+        SourceEndpoints: nil,
+        Capabilities:    nil,
And insert the following just before you start appending from the maps (above Line 531):
// Pre‑allocate exactly to de‑duplicated sizes for better allocation behaviour.
merged.SourceEndpoints = make([]domain.SourceEndpoint, 0, len(sourceMap)) // optimisation: avoid growth during flatten
merged.Aliases = make([]domain.AliasEntry, 0, len(aliasMap))
merged.Capabilities = make([]string, 0, len(capSet))
Optional: If you want to keep capacity hints earlier, replace slang/US terms in comments (e.g., “avg”) with clearer Australian English (“average”) and consider extracting the heuristic factors as named constants to make the rationale explicit and maintainable.
internal/util/safe.go (1)
24-32: SafeInt32 clamp logic is correct; please add a doc comment and boundary tests

Implementation is sound and avoids overflow/underflow. Add a doc comment (why it exists) and unit tests for MinInt32-1, MaxInt32+1, and in-range values.

Apply this diff to document the function:
- 
+// SafeInt32 clamps a 64-bit value into the 32-bit signed range.
+// Why: prevents overflow when normalising provider metrics (e.g. token counts, durations) to int32 for lower memory use.
 func SafeInt32(value int64) int32 {
  	if value < math.MinInt32 {
  		return math.MinInt32
  	}
  	if value > math.MaxInt32 {
  		return math.MaxInt32
  	}
  	return int32(value)
  }
go.mod (1)

3-3: Go 1.24 target and new deps LGTM

go directive set to 1.24 meets the toolchain guideline.

expr and gjson additions align with the metrics extractor design.

Optionally consider adding a toolchain directive once your CI image is finalised (e.g. toolchain go1.24.x) to avoid local tool mismatches.

Also applies to: 7-15, 34-35
config/profiles/ollama.yaml (3)
178-182: Confirm extraction source for streaming; prefer last_chunk

If metrics are extracted from the final streamed chunk, set the source accordingly (unless the extractor already routes to the last chunk internally). Using response_body may miss streamed-only fields.

Apply this diff if the extractor supports chunk-based extraction:
-    source: "response_body"
+    source: "last_chunk"
183-195: GJSON vs JSONPath: strip '$.' prefixes unless the extractor normalises them

gjson expects paths like model, not $.model. If normalisation isn’t implemented in the extractor, these lookups will fail silently.

Apply this diff if normalisation is not already handled:
-      model: "$.model"
-      done: "$.done"
-      # Token counts
-      input_tokens: "$.prompt_eval_count"
-      output_tokens: "$.eval_count"
-      # Timing data (in nanoseconds from Ollama)
-      total_duration_ns: "$.total_duration"
-      load_duration_ns: "$.load_duration"
-      prompt_duration_ns: "$.prompt_eval_duration"
-      eval_duration_ns: "$.eval_duration"
+      model: "model"
+      done: "done"
+      # Token counts
+      input_tokens: "prompt_eval_count"
+      output_tokens: "eval_count"
+      # Timing data (in nanoseconds from Ollama)
+      total_duration_ns: "total_duration"
+      load_duration_ns: "load_duration"
+      prompt_duration_ns: "prompt_eval_duration"
+      eval_duration_ns: "eval_duration"
182-182: Fix YAML lint issues: trailing spaces and missing newline

There are trailing spaces and missing newline at EOF flagged by yamllint.

Apply these diffs:
-    
+
-      
+
And ensure the file ends with a newline after line 201.

Also applies to: 195-195, 201-201
internal/adapter/proxy/sherpa/ring_buffer.go (4)
5-7: Explain why, not what, in the comment

Make the comment convey purpose (why we need this) to aid future maintainers.

Apply this diff:
-// SimpleRingBuffer implements a thread-safe ring buffer for storing the last N bytes
-// This is much simpler than hand-rolling complex buffer logic
+// SimpleRingBuffer stores only the last N bytes of a stream for cheap “last chunk”
+// access during metrics extraction in streaming paths. Thread-safe; optimised for
+// small capacities and many small writes.
13-19: Defend against negative capacities

A negative capacity will panic at make(..., capacity). Clamp to zero to be defensive.

Apply this diff:
 func NewSimpleRingBuffer(capacity int) *SimpleRingBuffer {
+	if capacity < 0 {
+		capacity = 0
+	}
 	return &SimpleRingBuffer{
 		data:     make([]byte, 0, capacity),
 		capacity: capacity,
 	}
 }
21-55: Optional micro-optimisation: re-slice instead of copy in drop-old path

Functionally correct. You can simplify the drop-old branch by re-slicing to the last keepBytes before appending.

Apply this diff:
-		keepBytes := rb.capacity - n
-		if keepBytes > 0 && len(rb.data) > keepBytes {
-			// Shift existing data
-			copy(rb.data[:keepBytes], rb.data[len(rb.data)-keepBytes:])
-			rb.data = rb.data[:keepBytes]
-		} else {
-			rb.data = rb.data[:0]
-		}
-		rb.data = append(rb.data, p...)
+		keepBytes := rb.capacity - n
+		if keepBytes > 0 && len(rb.data) > keepBytes {
+			// Keep the newest bytes, then append
+			rb.data = append(rb.data[:0], rb.data[len(rb.data)-keepBytes:]...)
+		} else {
+			rb.data = rb.data[:0]
+		}
+		rb.data = append(rb.data, p...)
57-69: Return empty slice instead of nil for Bytes()

Returning []byte{} avoids nil checks at call sites; semantics are equivalent for most uses.

Apply this diff:
 	if len(rb.data) == 0 {
-		return nil
+		return []byte{}
 	}
Additionally, we can add a compile-time assertion to make the io.Writer intent explicit (outside the selected range):
// At file scope
// var _ io.Writer = (*SimpleRingBuffer)(nil)
config/profiles/lmstudio.yaml (2)
138-152: Empty ‘calculations’ block: make intent explicit or drop it

An empty map is fine, but some config readers treat missing vs empty differently. If unused, either omit the block or make it explicit as an empty map for clarity.

Here’s a minimal tweak to make the intent explicit:
-    calculations:
-      # LM Studio doesn't provide timing data, so we can't calculate tokens/sec
+    calculations: {}
+      # LM Studio doesn't provide timing data, so we cannot calculate tokens/sec
152-152: Add a newline at EOF to satisfy linters and tooling

Also avoid contractions in docs/config to keep tone formal.

Apply:
-      # LM Studio doesn't provide timing data, so we can't calculate tokens/sec
+      # LM Studio doesn't provide timing data, so we cannot calculate tokens/sec
+
internal/adapter/proxy/core/base.go (1)
67-68: Consider functional options to avoid constructor churn

As dependencies grow, positional parameters become brittle for call sites. A small options pattern keeps the API stable and self-documenting.

Example pattern (illustrative):
type BaseProxyOption func(*BaseProxyComponents)

func WithMetricsExtractor(mx ports.MetricsExtractor) BaseProxyOption {
    return func(b *BaseProxyComponents) { b.MetricsExtractor = mx }
}

func NewBaseProxyComponents(ds ports.DiscoveryService, sel domain.EndpointSelector, sc ports.StatsCollector, logger logger.StyledLogger, opts ...BaseProxyOption) *BaseProxyComponents {
    b := &BaseProxyComponents{
        DiscoveryService: ds,
        Selector:         sel,
        StatsCollector:   sc,
        Logger:           logger,
        EventBus:         eventbus.New[ProxyEvent](),
    }
    for _, opt := range opts {
        opt(b)
    }
    return b
}
Call sites can then opt-in: NewBaseProxyComponents(ds, sel, sc, logger, WithMetricsExtractor(mx))

Also applies to: 74-74
docs/content/concepts/overview.md (1)
75-84: Minor wording tweak

“Provide” reads more formal than “give” in docs.
-Provider metrics give insights into model performance and resource usage.
+Provider metrics provide insights into model performance and resource usage.
docs/content/api-reference/overview.md (1)
91-94: Clarify routing header semantics with an example (operator-focused).

A short example helps operators sanity-check header values during incident response and makes intent obvious.

Apply this diff to add an example block:
 | `X-Olla-Response-Time` | Total processing time |
 | `X-Olla-Routing-Strategy` | Routing strategy used (when model routing is active) |
 | `X-Olla-Routing-Decision` | Routing decision made (routed/fallback/rejected) |
 | `X-Olla-Routing-Reason` | Human-readable reason for routing decision |
+
+Example:
+
+```
+X-Olla-Routing-Strategy: priority
+X-Olla-Routing-Decision: routed
+X-Olla-Routing-Reason: matched profile 'openai:fast' (latency)
+```
internal/adapter/discovery/repository.go (1)
58-60: Reword the comment to explain the rationale (why) rather than the action (what).

Small doc tweak improves clarity per our documentation guideline.

Apply this diff to both occurrences:
-// Pre-allocate with capacity of all endpoints (worst case)
+// Pre-allocate to avoid reallocations during filtering; worst case is when all endpoints match
Also applies to: 74-76
internal/adapter/proxy/benchmark_refactor_test.go (1)

71-71: Benchmarks updated to new NewService signatures; consider a variant with a no‑op extractor.

Change looks right. For completeness, consider adding a benchmark variant that passes a no‑op MetricsExtractor to measure overhead of the extraction path vs nil, ensuring both engines remain comparable under identical conditions.

Also applies to: 81-81
internal/adapter/proxy/proxy_olla_test.go (1)
32-32: Reduce duplication by centralising proxy construction in helpers.

Several tests still construct Olla services directly. Prefer routing these through the existing helpers (createTestOllaProxy/createTestOllaProxyWithConfig) or a new small helper that supplies a nil metrics extractor to keep all call sites consistent and reduce churn if the constructor changes again.

Apply this pattern to the modified sites (illustrated for one call; repeat similarly elsewhere):
- proxy, err := olla.NewService(discovery, selector, config, collector, nil, createTestLogger())
+ proxy, err := createTestOllaProxyWithConfig(endpoints, config)
Alternatively, use a typed-nil to make intent explicit:
- proxy, err := olla.NewService(discovery, selector, config, collector, nil, createTestLogger())
+ var noMetrics ports.MetricsExtractor
+ proxy, err := olla.NewService(discovery, selector, config, collector, noMetrics, createTestLogger())
Also applies to: 43-44, 61-61, 128-128, 211-211, 270-270, 456-456, 492-492, 545-545
internal/app/handlers/handler_proxy.go (2)
201-219: Log type consistency: emit tokens_per_sec as a numeric value (not a formatted string).

INFO logs currently render tokens_per_sec as a string, while DEBUG logs emit it as a numeric value. This inconsistency can complicate downstream parsing. Keep it numeric in both places.

Apply this diff:
-            if pm.TokensPerSecond > 0 {
-                infoFields = append(infoFields, "tokens_per_sec", fmt.Sprintf("%.1f", pm.TokensPerSecond))
-            }
+            if pm.TokensPerSecond > 0 {
+                infoFields = append(infoFields, "tokens_per_sec", pm.TokensPerSecond)
+            }
201-219: De-duplicate provider metrics field assembly via a small helper.

The metrics field-building logic is duplicated between INFO (short) and DEBUG (detailed) paths. Extract it into a helper to keep keys/types aligned and simplify future changes.

Apply these diffs to use a shared helper:
-        // Add provider metrics if available
-        if pr.stats.ProviderMetrics != nil {
-            pm := pr.stats.ProviderMetrics
-            if pm.InputTokens > 0 {
-                infoFields = append(infoFields, "input_tokens", pm.InputTokens)
-            }
-            if pm.OutputTokens > 0 {
-                infoFields = append(infoFields, "output_tokens", pm.OutputTokens)
-            }
-            if pm.TotalTokens > 0 {
-                infoFields = append(infoFields, "total_tokens", pm.TotalTokens)
-            }
-            if pm.TokensPerSecond > 0 {
-                infoFields = append(infoFields, "tokens_per_sec", pm.TokensPerSecond)
-            }
-            if pm.TTFTMs > 0 {
-                infoFields = append(infoFields, "ttft_ms", pm.TTFTMs)
-            }
-        }
+        // Add provider metrics if available
+        infoFields = appendProviderMetricsFields(infoFields, pr.stats.ProviderMetrics, false)
-    // Add provider metrics if available (detailed view)
-    if pr.stats.ProviderMetrics != nil {
-        pm := pr.stats.ProviderMetrics
-        if pm.InputTokens > 0 {
-            fields = append(fields, "input_tokens", pm.InputTokens)
-        }
-        if pm.OutputTokens > 0 {
-            fields = append(fields, "output_tokens", pm.OutputTokens)
-        }
-        if pm.TotalTokens > 0 {
-            fields = append(fields, "total_tokens", pm.TotalTokens)
-        }
-        if pm.TokensPerSecond > 0 {
-            fields = append(fields, "tokens_per_sec", pm.TokensPerSecond)
-        }
-        if pm.TTFTMs > 0 {
-            fields = append(fields, "ttft_ms", pm.TTFTMs)
-        }
-        if pm.PromptMs > 0 {
-            fields = append(fields, "prompt_ms", pm.PromptMs)
-        }
-        if pm.GenerationMs > 0 {
-            fields = append(fields, "generation_ms", pm.GenerationMs)
-        }
-        if pm.Model != "" {
-            fields = append(fields, "provider_model", pm.Model)
-        }
-        if pm.FinishReason != "" {
-            fields = append(fields, "finish_reason", pm.FinishReason)
-        }
-    }
+    // Add provider metrics if available (detailed view)
+    fields = appendProviderMetricsFields(fields, pr.stats.ProviderMetrics, true)
Add this helper in the same file (outside the edited ranges):
// appendProviderMetricsFields appends provider metrics to the given field list.
// If detailed is true, include extended timings and metadata.
// We keep numeric types stable for easier downstream ingestion.
func appendProviderMetricsFields(fields []any, pm *domain.ProviderMetrics, detailed bool) []any {
    if pm == nil {
        return fields
    }
    if pm.InputTokens > 0 {
        fields = append(fields, "input_tokens", pm.InputTokens)
    }
    if pm.OutputTokens > 0 {
        fields = append(fields, "output_tokens", pm.OutputTokens)
    }
    if pm.TotalTokens > 0 {
        fields = append(fields, "total_tokens", pm.TotalTokens)
    }
    if pm.TokensPerSecond > 0 {
        fields = append(fields, "tokens_per_sec", pm.TokensPerSecond)
    }
    if pm.TTFTMs > 0 {
        fields = append(fields, "ttft_ms", pm.TTFTMs)
    }
    if detailed {
        if pm.PromptMs > 0 {
            fields = append(fields, "prompt_ms", pm.PromptMs)
        }
        if pm.GenerationMs > 0 {
            fields = append(fields, "generation_ms", pm.GenerationMs)
        }
        if pm.Model != "" {
            fields = append(fields, "provider_model", pm.Model)
        }
        if pm.FinishReason != "" {
            fields = append(fields, "finish_reason", pm.FinishReason)
        }
    }
    return fields
}
Also applies to: 250-281
internal/adapter/proxy/sherpa/service_retry.go (2)
149-157: Best-effort metrics extraction currently skipped on error paths

The PR objective calls out best-effort extraction from the final streamed chunk. Here, provider metrics are only extracted on the success path. If streaming fails (but we still captured lastChunk), we return early and never attempt extraction.

Recommend extracting before error handling (so both success and error paths benefit) and removing the later call to avoid duplication.

Apply this diff:
-	bytesWritten, lastChunk, streamErr := s.streamResponseWithTimeout(r.Context(), upstreamCtx, w, resp, *buffer, rlog)
+	bytesWritten, lastChunk, streamErr := s.streamResponseWithTimeout(r.Context(), upstreamCtx, w, resp, *buffer, rlog)
 	stats.StreamingMs = time.Since(streamStart).Milliseconds()
 	stats.TotalBytes = bytesWritten
 
-	if streamErr != nil && !errors.Is(streamErr, context.Canceled) {
-		rlog.Error("streaming failed", "error", streamErr)
-		s.RecordFailure(ctx, endpoint, time.Since(stats.StartTime), streamErr)
-		return common.MakeUserFriendlyError(streamErr, time.Since(stats.StartTime), "streaming", s.configuration.GetResponseTimeout())
-	}
+	// Extract metrics from the last chunk even if the stream errored.
+	// Why: We want to surface provider timing/token metrics whenever possible to aid debugging and autoscaling decisions.
+	core.ExtractProviderMetrics(ctx, s.MetricsExtractor, lastChunk, endpoint, stats, rlog, "Sherpa")
+
+	if streamErr != nil && !errors.Is(streamErr, context.Canceled) {
+		rlog.Error("streaming failed", "error", streamErr)
+		s.RecordFailure(ctx, endpoint, time.Since(stats.StartTime), streamErr)
+		return common.MakeUserFriendlyError(streamErr, time.Since(stats.StartTime), "streaming", s.configuration.GetResponseTimeout())
+	}
@@
-	// Extract metrics from response if available
-	core.ExtractProviderMetrics(ctx, s.MetricsExtractor, lastChunk, endpoint, stats, rlog, "Sherpa")
+	// (metrics extracted above for both success and error paths)
Also applies to: 180-182

184-206: Preserve debug metrics logging even when ctx logger is absent

At completion you log metrics only if a context logger is present. Elsewhere in this function you fall back to rlog. For consistency (and to avoid silent loss of useful diagnostics), add the same fallback here.
 	if ctxLogger != nil {
 		logFields = append(logFields, "request_id", middleware.GetRequestID(ctx))
 		ctxLogger.Debug("Sherpa proxy metrics", logFields...)
-	}
+	} else {
+		// Why: Keep completion diagnostics available even when no context logger is attached.
+		rlog.Debug("Sherpa proxy metrics", logFields...)
+	}
internal/adapter/proxy/sherpa/service_retry_test.go (1)

123-124: Optionally exercise metrics wiring in tests

Consider adding a small fake MetricsExtractor in one test to assert it’s invoked (even when extraction yields nil). This would catch regressions in constructor wiring without coupling to real provider payloads.

If you’d like, I can sketch a minimal fake extractor and an assertion pattern that doesn’t depend on specific payloads.
internal/adapter/proxy/sherpa/service_streaming.go (1)
39-46: Parameterise ring buffer capacity rather than hard-coding 8KB

Hard-coding 8192 works, but tying it to configuration (with a sensible default and an upper bound) improves flexibility across deployments and avoids magic numbers.

Apply this diff and helper:
-func (s *Service) streamResponseWithTimeout(clientCtx, upstreamCtx context.Context, w http.ResponseWriter, resp *http.Response, buffer []byte, rlog logger.StyledLogger) (int, []byte, error) {
+func (s *Service) streamResponseWithTimeout(clientCtx, upstreamCtx context.Context, w http.ResponseWriter, resp *http.Response, buffer []byte, rlog logger.StyledLogger) (int, []byte, error) {
 	state := &streamState{
-		lastReadTime:    time.Now(),
-		contentType:     resp.Header.Get(constants.HeaderContentType),
-		isStreaming:     core.AutoDetectStreamingMode(clientCtx, resp, s.configuration.GetProxyProfile()),
-		lastChunkBuffer: NewSimpleRingBuffer(8192), // Keep last 8KB for metrics
+		lastReadTime:    time.Now(),
+		contentType:     resp.Header.Get(constants.HeaderContentType),
+		isStreaming:     core.AutoDetectStreamingMode(clientCtx, resp, s.configuration.GetProxyProfile()),
+		lastChunkBuffer: NewSimpleRingBuffer(s.lastChunkCapacity()), // Why: capacity aligned with config to balance fidelity and memory
 	}
Add this helper (same file), explaining rationale:
// lastChunkCapacity determines how many recent bytes we retain for metric extraction.
// Why: Big enough to capture provider tail JSON (incl. stats), small enough to be cheap across many concurrent streams.
func (s *Service) lastChunkCapacity() int {
	capacity := s.configuration.GetStreamBufferSize()
	if capacity <= 0 {
		capacity = 8192 // default
	}
	// Clamp to a reasonable upper bound to prevent unbounded memory usage on misconfiguration.
	if capacity > 64*1024 {
		capacity = 64 * 1024
	}
	return capacity
}
internal/adapter/proxy/olla/service_retry.go (3)
188-190: Prefer version.Name over string literal "Olla"

Avoid hard-coding engine name; use the centralised constant for consistency.

Apply:
- core.ExtractProviderMetrics(ctx, s.MetricsExtractor, lastChunk, endpoint, stats, rlog, "Olla")
+ core.ExtractProviderMetrics(ctx, s.MetricsExtractor, lastChunk, endpoint, stats, rlog, version.Name)
Add import outside this hunk:
import (
  // ...
  "github.com/thushan/olla/internal/version"
)
192-205: Always emit completion metrics even without a context logger

At present, completion metrics are only logged when ctxLogger is present; otherwise nothing is emitted. For parity with earlier logging branches, fall back to rlog.
   if ctxLogger != nil {
     logFields = append(logFields, "request_id", middleware.GetRequestID(ctx))
     ctxLogger.Debug("Olla proxy metrics", logFields...)
+  } else {
+    rlog.Debug("Olla proxy metrics", logFields...)
   }
Also applies to: 211-214

183-187: Remove duplicate assignment to TotalBytes

stats.TotalBytes is set both immediately after streaming and again after stats.EndTime; the second assignment is redundant.
   stats.EndTime = time.Now()
   stats.Latency = stats.EndTime.Sub(stats.StartTime).Milliseconds()
-  stats.TotalBytes = bytesWritten
internal/core/domain/profile_config.go (1)
10-15: Tighten the comment to explain the “why”, using Australian English

The comment currently explains what. Prefer explaining why we expose metrics in profiles (e.g. to standardise extraction and feed scheduling/logging) and use Australian English.
- // Metrics extraction configuration for provider responses
+ // Metrics extraction configuration for provider responses – enables us to surface
+ // provider timings and token counts without bespoke code per platform, improving
+ // observability and informing future scheduling/load-balancing behaviour.
config/profiles/openai.yaml (2)
116-116: Trim trailing space to satisfy yamllint

There’s a trailing space after output_tokens.
-      output_tokens: "$.usage.completion_tokens" 
+      output_tokens: "$.usage.completion_tokens"
122-122: Add a newline at EOF

YAML linters require a newline at the end of file.

No code diff provided as it affects the final EOF marker; please ensure the file ends with a newline.
internal/adapter/metrics/extractor_test.go (2)
22-27: Return an error for unknown profiles to fail fast in tests

Returning (nil, nil) can mask mis-typed provider names and make tests pass vacuously. Prefer returning a descriptive error.
 func (f *mockProfileFactory) GetProfile(name string) (domain.InferenceProfile, error) {
   if p, ok := f.profiles[name]; ok {
     return p, nil
   }
-  return nil, nil
+  return nil, fmt.Errorf("mock profile not found: %s", name)
 }
Add import:
import "fmt"
91-99: Run tests in parallel to speed up the suite

These unit tests are independent; running them in parallel reduces wall time.
 func TestExtractor_ExtractMetrics_OllamaResponse(t *testing.T) {
+  t.Parallel()
@@
 func TestExtractor_ExtractMetrics_InvalidJSON(t *testing.T) {
+  t.Parallel()
@@
 func TestExtractor_ExtractFromChunk(t *testing.T) {
+  t.Parallel()
@@
 func TestExtractor_ExtractFromHeaders(t *testing.T) {
+  t.Parallel()
@@
 func TestExtractor_ValidateProfile_InvalidPath(t *testing.T) {
+  t.Parallel()
Also applies to: 170-173, 205-213, 240-247, 279-283
internal/core/ports/metrics.go (1)
10-21: Enrich interface docs to capture rationale

A short rationale helps future maintainers understand why this interface exists (standardising metrics extraction across providers, supporting streaming and full responses).
-// MetricsExtractor extracts provider-specific metrics from responses
+// MetricsExtractor standardises provider-specific metrics extraction across
+// both full responses and streamed chunks. This keeps proxy code lean and
+// enables consistent logging/observability (and future scheduling heuristics)
+// without per-provider plumbing.
 type MetricsExtractor interface {
   // ValidateProfile validates metrics configuration at startup
   ValidateProfile(profile domain.InferenceProfile) error
@@
   // ExtractFromChunk extracts metrics from a streaming chunk (final chunk for streaming responses)
   ExtractFromChunk(ctx context.Context, chunk []byte, providerName string) *domain.ProviderMetrics
 }
docs/content/concepts/provider-metrics.md (7)
77-82: Add a language to the fenced code block (markdownlint MD040).

Specify a language for the vLLM metrics snippet to satisfy linters and improve rendering.
-```
+```text
 vllm:prompt_tokens_total{model="llama3.2"} 50
 vllm:generation_tokens_total{model="llama3.2"} 200
 vllm:time_to_first_token_seconds{model="llama3.2"} 0.1
 vllm:time_per_output_token_seconds{model="llama3.2"} 0.022
---

`223-233`: **Add a language to the log snippet (markdownlint MD040).**

Mark the log block as plain text for consistency with other snippets.


```diff
-```
+```text
 2025/08/15 10:00:00 DEBUG Sherpa proxy metrics
   endpoint=local-ollama
   latency_ms=5000
   provider_total_ms=5000
   provider_prompt_eval_ms=100
   provider_eval_ms=4400
   provider_prompt_tokens=50
   provider_completion_tokens=200
   provider_tokens_per_second=45.45
---

`55-56`: **Remove duplicated provider names in headings and lead-in text.**

Headings already name the provider; repeating in the first sentence is redundant and flagged by LanguageTool. Tighten the phrasing.


```diff
-### LM Studio
-LM Studio provides timing and token metrics:
+### LM Studio
+Provides timing and token metrics:
-### vLLM
-vLLM exposes detailed metrics via Prometheus endpoint:
+### vLLM
+Exposes detailed metrics via a Prometheus endpoint:
Also applies to: 74-76

88-102: Align metric names with exported JSON field names.

Code exposes tokens_per_sec and ttft_ms (per ProviderMetrics JSON tags). Using different names in docs causes confusion for operators and tooling.
-| `tokens_per_second` | Generation speed | tokens/sec |
-| `time_to_first_token_ms` | Latency to first token | milliseconds |
+| `tokens_per_sec` | Generation speed | tokens/sec |
+| `ttft_ms` | Latency to first token (time to first token) | milliseconds |
220-223: Heading misleads: these are logs, not response headers.

The example shows application logs, not HTTP response headers. Rename to avoid operational confusion.
-### Via Response Headers
-Provider metrics are included in detailed debug logs when available:
+### Via Logs
+Provider metrics are included in detailed debug logs when available:
299-313: Grammar: add missing preposition and minor polish.

Light edits make the guidance clearer and fix the “attempts” phrasing.
-    condition: tokens_per_second < 10
+    condition: tokens_per_sec < 10
-3. Enable debug logging to see extraction attempts
+3. Enable debug logging to see extraction attempts in logs
213-216: Cross-check derived metrics exposure.

Docs promise time_per_token_ms; ProviderMetrics currently has no corresponding field, so this value won’t appear in structured logs or status unless mapped into GenerationMs or custom outputs. Consider either documenting it as “provider-specific derived metric (not universally exported)” or extending ProviderMetrics.
internal/adapter/proxy/core/metrics.go (3)
26-31: Defensive: guard against a nil stats pointer.

While callers should pass a non-nil stats, a cheap check avoids a panic in edge cases and keeps this helper safe to reuse.
 	// Early return if we can't extract metrics - most common case first
-	if extractor == nil || len(lastChunk) == 0 || endpoint == nil || endpoint.Type == "" {
+	if stats == nil || extractor == nil || len(lastChunk) == 0 || endpoint == nil || endpoint.Type == "" {
 		// Skip logging in hot path unless needed for debugging
 		return
 	}
44-49: Optional: include model or provider in debug to aid triage.

Adding model/provider type helps correlate metrics without enabling verbose logs elsewhere.
-		rlog.Debug("Metrics extracted",
-			"engine", engineName,
-			"tokens", pm.TotalTokens)
+		rlog.Debug("Metrics extracted",
+			"engine", engineName,
+			"provider", endpoint.Type,
+			"model", pm.Model,
+			"tokens", pm.TotalTokens)
74-88: Append ‘total_ms’ and ‘model_load_ms’ to logs for completeness.

Docs and operator expectations include total_ms and model_load_ms. Exposing them keeps logging consistent with the profile-derived metrics set.
 	if pm.GenerationMs > 0 {
 		logFields = append(logFields, "generation_ms", pm.GenerationMs)
 	}
+	if pm.TotalMs > 0 {
+		logFields = append(logFields, "total_ms", pm.TotalMs)
+	}
+	if pm.ModelLoadMs > 0 {
+		logFields = append(logFields, "model_load_ms", pm.ModelLoadMs)
+	}
 	if pm.Model != "" {
 		logFields = append(logFields, "provider_model", pm.Model)
 	}
internal/adapter/proxy/factory.go (1)

32-38: Optionally make transport defaults configurable rather than hard-coded.

MaxIdleConns, IdleConnTimeout, and MaxConnsPerHost are currently fixed here. Surfacing them via ProxyConfiguration (or profile) improves operability across different deployments.

I can draft the minimal config plumb-through if you want to make these tunables in this PR or a follow-up.

Also applies to: 41-52, 55-69, 91-93
docs/mkdocs.yml (1)
127-133: Remove trailing spaces to satisfy yamllint and avoid CI flakiness

There are trailing spaces reported by YAMLlint on these lines. Safe to trim.

Apply this diff:
-      - Health Checking: concepts/health-checking.md      
-      - Proxy Engines: 
+      - Health Checking: concepts/health-checking.md
+      - Proxy Engines:
           - Overview: concepts/proxy-engines.md
-          - Proxy Profiles: concepts/proxy-profiles.md
-      - Profile System: 
+          - Proxy Profiles: concepts/proxy-profiles.md
+      - Profile System:
           - Overview: concepts/profile-system.md
           - Provider Metrics: concepts/provider-metrics.md
internal/app/services/proxy.go (3)
92-126: Validate via interface; avoid downcasting to concrete extractor

You already depend on the MetricsExtractor interface which exposes ValidateProfile. The downcast to *metrics.Extractor narrows the implementation unnecessarily and will silently skip validation for alternative implementations.

Apply this diff:
-            // Pre-validate profiles for metrics extraction
-            for _, profileName := range profileFactory.GetAvailableProfiles() {
-                if inferenceProfile, profileErr := profileFactory.GetProfile(profileName); profileErr == nil {
-                    if extractor, ok := metricsExtractor.(*metrics.Extractor); ok {
-                        if validationErr := extractor.ValidateProfile(inferenceProfile); validationErr != nil {
-                            s.logger.Debug("Profile metrics validation failed",
-                                "profile", profileName, "error", validationErr)
-                        }
-                    }
-                }
-            }
+            // Pre-validate profiles for metrics extraction
+            for _, profileName := range profileFactory.GetAvailableProfiles() {
+                if inferenceProfile, profileErr := profileFactory.GetProfile(profileName); profileErr == nil {
+                    if validationErr := metricsExtractor.ValidateProfile(inferenceProfile); validationErr != nil {
+                        s.logger.Debug("Profile metrics validation failed",
+                            "profile", profileName, "error", validationErr)
+                    }
+                }
+            }
100-102: Consider dropping log level to Debug to reduce startup noise

Listing all available profiles on every start can be noisy in production. Debug level is typically sufficient.
-        s.logger.Info("Profile factory created successfully",
-            "profiles", profileFactory.GetAvailableProfiles())
+        s.logger.Debug("Profile factory created successfully",
+            "profiles", profileFactory.GetAvailableProfiles())
129-129: Use the field for consistency when constructing the proxy factory

Functionally equivalent, but using the struct field avoids confusion with shadowed locals.
-    proxyFactory := proxy.NewFactory(s.statsCollector, metricsExtractor, s.logger)
+    proxyFactory := proxy.NewFactory(s.statsCollector, s.metricsExtractor, s.logger)
internal/adapter/metrics/benchmark_test.go (1)
49-52: Potential integer division in calculations; prefer float constants

Depending on the expression engine, dividing integers may truncate, yielding 0 for small durations. Using float constants avoids truncation and better reflects tokens/sec and ms.
-        "tokens_per_second": "output_tokens / (eval_duration_ns / 1000000000)",
-        "ttft_ms":           "prompt_duration_ns / 1000000",
+        "tokens_per_second": "output_tokens / (eval_duration_ns / 1e9)",
+        "ttft_ms":           "prompt_duration_ns / 1e6",
internal/adapter/proxy/olla/streaming_helpers.go (2)
24-39: Include richer metadata when publishing client disconnect events

Adding request/endpoint context significantly improves operability when diagnosing disconnects in multi-tenant environments.

For example, if available in scope:
-        s.PublishEvent(core.ProxyEvent{
-            Type: core.EventTypeClientDisconnect,
-            Metadata: core.ProxyEventMetadata{
-                BytesSent: int64(state.totalBytes),
-            },
-        })
+        s.PublishEvent(core.ProxyEvent{
+            Type: core.EventTypeClientDisconnect,
+            Metadata: core.ProxyEventMetadata{
+                BytesSent: int64(state.totalBytes),
+            },
+            // If accessible here:
+            // RequestID: stats.RequestID,
+            // Endpoint:  endpoint.URLString,
+        })
78-110: Handle client write errors as disconnects where appropriate

Write errors like EPIPE/ECONNRESET typically indicate the client has gone. Treating them as disconnects avoids surfacing benign errors to callers.
-            if writeErr != nil {
-                rlog.Debug("write error during streaming", "error", writeErr, "bytes_written", state.totalBytes)
-                return writeErr
-            }
+            if writeErr != nil {
+                rlog.Debug("write error during streaming", "error", writeErr, "bytes_written", state.totalBytes)
+                // Treat as client disconnect and let the loop unwind gracefully
+                s.handleClientDisconnect(state, rlog)
+                return context.Canceled
+            }
Why: reduces noisy error propagation for expected client behaviour (navigating away, cancelling requests).
docs/content/configuration/practices/monitoring.md (4)
228-243: Histograms are fine, but ensure you observe raw per-request values (not rolled-up averages).

Observing aggregated averages from /internal/status into a Histogram skews distribution. Ideally, observe per-request values at the point of completion. If you keep this example, consider adding a note that these are coarse-grained and meant for directional insights.

Also, you’ve defined prompt_tokens and completion_tokens but don’t observe them yet—see the next comment for a minimal extension.

253-261: Also export prompt_tokens and completion_tokens from provider stats.

To align with the docs and make dashboards useful, observe prompt and completion tokens when available. Example tweak:
             for endpoint_name, stats in data.get('proxy', {}).get('endpoints', {}).items():
                 if 'avg_tokens_per_second' in stats:
                     tokens_per_second.labels(
                         endpoint=endpoint_name,
                         model=stats.get('primary_model', 'unknown')
                     ).observe(stats['avg_tokens_per_second'])
+                # If available, observe token usage as well.
+                if 'prompt_tokens' in stats:
+                    prompt_tokens.labels(
+                        endpoint=endpoint_name,
+                        model=stats.get('primary_model', 'unknown')
+                    ).observe(stats['prompt_tokens'])
+                if 'completion_tokens' in stats:
+                    completion_tokens.labels(
+                        endpoint=endpoint_name,
+                        model=stats.get('primary_model', 'unknown')
+                    ).observe(stats['completion_tokens'])
Caveat: if these are already aggregated across requests, consider using Gauges instead. Histograms yield most value on per-request observations.

279-281: Great addition of dashboard panels. Minor note on units.

Consider noting explicit units in panel descriptions (e.g., “tokens per second (tokens/s)”) and that token histograms represent counts per request. This reduces ambiguity for new users.

282-304: “Response-Time” is a trailer in streaming; call this out in docs.

X-Olla-Response-Time is sent as an HTTP trailer for streaming responses. Some clients don’t surface trailers by default. Add a short note with a curl example (e.g., curl --no-buffer --raw --trailers …) to avoid confusion.
internal/adapter/metrics/integration_test.go (2)
100-168: Performance threshold might be flaky on some CI runners.

A hard <50µs/op bound can intermittently fail on noisy or virtualised CI. Consider either:

Relaxing the bound (e.g., <100µs), or

Measuring a percentile across batches and asserting on p95/p99, or

Gating with an env flag for CI.
-    assert.Less(t, perOp, 50*time.Microsecond,
+    assert.Less(t, perOp, 100*time.Microsecond,
         "Extraction took %v per operation, expected < 50µs", perOp)
216-225: Use strings.Builder to avoid quadratic concatenation in tests.

Not critical at size=1000, but easy to make linear and tidy.
 func generateLargeArray(size int) string {
-    result := ""
-    for i := 0; i < size; i++ {
-        if i > 0 {
-            result += ","
-        }
-        result += "1"
-    }
-    return result
+    var b strings.Builder
+    b.Grow(size * 2) // rough capacity
+    for i := 0; i < size; i++ {
+        if i > 0 {
+            b.WriteByte(',')
+        }
+        b.WriteByte('1')
+    }
+    return b.String()
 }
Note: add import "strings".
internal/core/domain/provider_metrics.go (1)
22-25: Align JSON tag with docs: tokens_per_second vs tokens_per_sec.

Docs and exporter examples refer to tokens_per_second. The struct currently exposes tokens_per_sec, which can cause confusion and mismatches in JSON consumers.
-    TokensPerSecond float32 `json:"tokens_per_sec,omitempty"`
+    TokensPerSecond float32 `json:"tokens_per_second,omitempty"`
If there are existing consumers relying on tokens_per_sec, consider dual tags via a custom marshaler or a transitional compatibility layer at the API boundary.

Would you like a follow-up PR to add a backward-compatible marshaler for a deprecation window?
internal/adapter/metrics/extractor.go (3)

121-151: Timeout enforcement still runs the full extraction; consider avoiding excess goroutines.

The goroutine enforces a caller timeout but cannot pre-empt doExtract. Under persistent timeouts, goroutines may pile up briefly until they complete. Given extraction is fast, this is likely OK, but you could:

Run doExtract inline and rely on its micro-path optimisations, or

Use a worker pool if you later add heavier calculations.

Not urgent; just calling out the trade-off.

225-254: Mapping covers key fields; consider explicit support for generation_ms/prompt_ms calculations.

If profiles start emitting calculated ms fields directly, map them explicitly to avoid relying solely on expr-derived paths. This keeps the mapping robust across providers.

315-324: Header extraction is a stub—either wire it or remove config to avoid surprises.

If profiles define headers in config, silently ignoring them can confuse operators. Happy to implement a minimal mapping (e.g., X-Provider-Input-Tokens → InputTokens) if you share the expected header names.

Would you like me to add a small, documented mapping for common providers and make it opt-in per profile?

config/profiles/lmstudio.yaml

config/profiles/ollama.yaml

coderabbitai · 2025-08-15T04:28:59Z

docs/content/api-reference/overview.md

+### Provider Metrics (Debug Logs)
+
+When available, provider-specific performance metrics are extracted from responses and included in debug logs:
+
+| Metric | Description | Providers |
+|--------|-------------|-----------|
+| `provider_total_ms` | Total processing time | Ollama, LM Studio |
+| `provider_prompt_tokens` | Tokens in prompt | All |
+| `provider_completion_tokens` | Tokens generated | All |
+| `provider_tokens_per_second` | Generation speed | Ollama, LM Studio |
+| `provider_model` | Actual model used | All |
+
+See [Provider Metrics](../concepts/provider-metrics.md) for detailed information.


💡 Verification agent

🧩 Analysis chain

Add units and verify metric key naming matches logs/struct tags.

Including units avoids ambiguity, and aligning names with runtime keys prevents confusion (note: ProviderMetrics JSON tag uses tokens_per_sec).

Suggest adding units to descriptions:

-| `provider_total_ms` | Total processing time | Ollama, LM Studio | -| `provider_prompt_tokens` | Tokens in prompt | All | -| `provider_completion_tokens` | Tokens generated | All | -| `provider_tokens_per_second` | Generation speed | Ollama, LM Studio | +| `provider_total_ms` | Total processing time (ms) | Ollama, LM Studio | +| `provider_prompt_tokens` | Tokens in prompt (count) | All | +| `provider_completion_tokens` | Tokens generated (count) | All | +| `provider_tokens_per_second` | Generation speed (tokens/s) | Ollama, LM Studio | | `provider_model` | Actual model used | All |

Please also verify naming consistency between docs, logs, and JSON tags:

🏁 Script executed:

#!/bin/bash # Check for potential mismatches between documented and runtime keys. rg -nP -C2 '(tokens_per_second|tokens_per_sec|provider_tokens_per_second|provider_tokens_per_sec)'

Length of output: 14061

Align provider metric names with runtime keys and add units

Docs currently mix "provider_tokens_per_second" / "tokens_per_second" while runtime/logs/JSON use "tokens_per_sec" (and the extractor expects "tokens_per_second"). Add explicit units to the docs table and unify or document the mapping between external keys and internal JSON/log keys.

Files that need attention (examples found):

docs/content/api-reference/overview.md — update table (add units + confirm key name)

docs/content/concepts/provider-metrics.md — examples and calculations (tokens_per_second vs tokens_per_sec)

docs/content/configuration/practices/monitoring.md — metric names used for Prometheus and examples

internal/core/domain/provider_metrics.go — JSON tag: TokensPerSecond float32 \json:"tokens_per_sec,omitempty"``

internal/adapter/metrics/extractor.go — handles "tokens_per_second" and maps to TokensPerSecond

internal/adapter/proxy/core/metrics.go and internal/app/handlers/handler_proxy.go — log fields use "tokens_per_sec"

config/profiles/*.yaml (ollama.yaml, openai.yaml) — use tokens_per_second in calculations

test/scripts/streaming/test-streaming-responses.py — uses tokens_per_second

Suggested minimal doc change (add units):

-| `provider_total_ms` | Total processing time | Ollama, LM Studio | -| `provider_prompt_tokens` | Tokens in prompt | All | -| `provider_completion_tokens` | Tokens generated | All | -| `provider_tokens_per_second` | Generation speed | Ollama, LM Studio | +| `provider_total_ms` | Total processing time (ms) | Ollama, LM Studio | +| `provider_prompt_tokens` | Tokens in prompt (count) | All | +| `provider_completion_tokens` | Tokens generated (count) | All | +| `provider_tokens_per_second` | Generation speed (tokens/s) | Ollama, LM Studio | | `provider_model` | Actual model used | All |

Recommendation: pick a single canonical external metric name (e.g. "provider_tokens_per_second" or match the runtime/log key "tokens_per_sec"), update docs and examples to match, and either:

update the extractor/logging to use the chosen external name, or

explicitly document the mapping (external → internal JSON tag) and accept both names in parsing.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### Provider Metrics (Debug Logs)

When available, provider-specific performance metrics are extracted from responses and included in debug logs:

| Metric | Description | Providers |

|--------|-------------|-----------|

| `provider_total_ms` | Total processing time | Ollama, LM Studio |

| `provider_prompt_tokens` | Tokens in prompt | All |

| `provider_completion_tokens` | Tokens generated | All |

| `provider_tokens_per_second` | Generation speed | Ollama, LM Studio |

| `provider_model` | Actual model used | All |

See [Provider Metrics](../concepts/provider-metrics.md) for detailed information.

### Provider Metrics (Debug Logs)

When available, provider-specific performance metrics are extracted from responses and included in debug logs:

| Metric | Description | Providers |

|--------|-------------|-----------|

| `provider_total_ms` | Total processing time (ms) | Ollama, LM Studio |

| `provider_prompt_tokens` | Tokens in prompt (count) | All |

| `provider_completion_tokens` | Tokens generated (count) | All |

| `provider_tokens_per_second` | Generation speed (tokens/s) | Ollama, LM Studio |

| `provider_model` | Actual model used | All |

See [Provider Metrics](../concepts/provider-metrics.md) for detailed information.

docs/content/concepts/provider-metrics.md

coderabbitai · 2025-08-15T04:28:59Z

internal/adapter/metrics/benchmark_test.go

+func createTestProfileFactory() ProfileFactory {
+	return &mockProfileFactory{
+		profiles: map[string]*mockProfile{
+			"ollama": {
+				name: "ollama",
+				config: &domain.ProfileConfig{
+					Metrics: domain.MetricsConfig{
+						Extraction: domain.MetricsExtractionConfig{
+							Enabled: true,
+							Source:  "response_body",
+							Format:  "json",
+							Paths: map[string]string{
+								"model":              "$.model",
+								"done":               "$.done",
+								"input_tokens":       "$.prompt_eval_count",
+								"output_tokens":      "$.eval_count",
+								"prompt_duration_ns": "$.prompt_eval_duration",
+								"eval_duration_ns":   "$.eval_duration",
+								"total_duration_ns":  "$.total_duration",
+								"load_duration_ns":   "$.load_duration",
+							},
+							Calculations: map[string]string{
+								"tokens_per_second": "output_tokens / (eval_duration_ns / 1000000000)",
+								"ttft_ms":           "prompt_duration_ns / 1000000",
+							},
+						},
+					},
+				},
+			},
+		},
+	}
+}


💡 Verification agent

🧩 Analysis chain

Compilation blocker: missing mock types (mockProfileFactory/mockProfile) referenced here

This benchmark won’t compile as-is because mockProfileFactory and mockProfile are not defined in this file or imported. Define minimal mocks that satisfy metrics.ProfileFactory and domain.InferenceProfile (or switch to using the real registry/profile.Factory with test fixtures).

I can generate mocks once we confirm the method set of domain.InferenceProfile. Please run the script below to reveal the interface/struct so we can implement correct stubs.

If domain.InferenceProfile is an interface with GetName()/GetConfig() methods, a minimal mock could look like:

+// --- test-only mocks to satisfy metrics.ProfileFactory --- +type mockProfileFactory struct { + profiles map[string]*mockProfile +} + +func (m *mockProfileFactory) GetProfile(profileType string) (domain.InferenceProfile, error) { + p, ok := m.profiles[profileType] + if !ok { + return nil, fmt.Errorf("profile not found: %s", profileType) + } + return p, nil +} + +type mockProfile struct { + name string + config *domain.ProfileConfig +} + +// Adjust the method set below to match domain.InferenceProfile +func (m *mockProfile) GetName() string { return m.name } +func (m *mockProfile) GetConfig() *domain.ProfileConfig { return m.config }

Replace/augment the methods to match the actual interface once discovered.

--- 🏁 Script executed: ```shell #!/bin/bash set -euo pipefail echo "Locate InferenceProfile type and its methods" rg -nC3 -g '!**/vendor/**' -P '^\s*type\s+InferenceProfile\b' internal | sed -n '1,200p' rg -nC2 -g '!**/vendor/**' -P '\bInferenceProfile\b' internal/core/domain | sed -n '1,200p'

Length of output: 1581

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Printing internal/core/domain/inference_profile.go" sed -n '1,400p' internal/core/domain/inference_profile.go

Length of output: 3307

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Searching for PlatformProfile and ProfileFactory definitions..." rg -nC3 -g '!**/vendor/**' -P '^\s*type\s+PlatformProfile\b' || true rg -nC3 -g '!**/vendor/**' -P '^\s*type\s+ProfileFactory\b' || true # Print the files that contain PlatformProfile / ProfileFactory if found pf_file=$(rg -n --hidden -g '!**/vendor/**' -P '^\s*type\s+PlatformProfile\b' | head -n1 | cut -d: -f1 || true) factory_file=$(rg -n --hidden -g '!**/vendor/**' -P '^\s*type\s+ProfileFactory\b' | head -n1 | cut -d: -f1 || true) if [ -n "$pf_file" ]; then echo echo "=== $pf_file ===" sed -n '1,240p' "$pf_file" else echo "PlatformProfile type not found by name." fi if [ -n "$factory_file" ]; then echo echo "=== $factory_file ===" sed -n '1,240p' "$factory_file" else echo "ProfileFactory type not found by name." fi # Also search for any symbols named ProfileFactory (usages) echo echo "Usages of ProfileFactory (context):" rg -nC2 -g '!**/vendor/**' '\bProfileFactory\b' || true

Length of output: 14149

Add test-only mocks that fully implement domain.InferenceProfile / ProfileFactory (fix compilation)

The benchmark references mockProfileFactory/mockProfile but InferenceProfile embeds PlatformProfile and exposes many methods — the current mocks must implement the full interface. Add a test-only mock implementation (or use the real registry factory with fixtures).

Files to update:

internal/adapter/metrics/benchmark_test.go — add mocks and keep createTestProfileFactory

internal/core/domain/inference_profile.go — reference for required method set

internal/core/domain/profile.go — reference for embedded PlatformProfile methods

Patch to add (append to the test file). Ensure the test file imports fmt and time if not already present.

+// --- test-only mocks to satisfy metrics.ProfileFactory --- +type mockProfileFactory struct { + profiles map[string]*mockProfile +} + +func (m *mockProfileFactory) GetProfile(profileType string) (domain.InferenceProfile, error) { + p, ok := m.profiles[profileType] + if !ok { + return nil, fmt.Errorf("profile not found: %s", profileType) + } + return p, nil +} + +type mockProfile struct { + name string + version string + config *domain.ProfileConfig + paths []string +} + +// PlatformProfile methods +func (m *mockProfile) GetName() string { return m.name } +func (m *mockProfile) GetVersion() string { return m.version } +func (m *mockProfile) GetModelDiscoveryURL(baseURL string) string { return "" } +func (m *mockProfile) GetHealthCheckPath() string { return "" } +func (m *mockProfile) IsOpenAPICompatible() bool { return false } +func (m *mockProfile) GetRequestParsingRules() domain.RequestParsingRules { + return domain.RequestParsingRules{} +} +func (m *mockProfile) GetModelResponseFormat() domain.ModelResponseFormat { return domain.ModelResponseFormat{} } +func (m *mock_profile) GetDetectionHints() domain.DetectionHints { return domain.DetectionHints{} } +func (m *mockProfile) ParseModelsResponse(data []byte) ([]*domain.ModelInfo, error) { + return nil, nil +} +func (m *mockProfile) GetPaths() []string { + return m.paths +} +func (m *mockProfile) GetPath(index int) string { + if index < 0 || index >= len(m.paths) { + return "" + } + return m.paths[index] +} + +// InferenceProfile methods +func (m *mockProfile) GetTimeout() time.Duration { return 0 } +func (m *mock_profile) GetMaxConcurrentRequests() int { return 1 } +func (m *mockProfile) ValidateEndpoint(endpoint *domain.Endpoint) error { return nil } +func (m *mockProfile) GetDefaultPriority() int { return 0 } +func (m *mockProfile) GetConfig() *domain.ProfileConfig { return m.config } +func (m *mockProfile) GetModelCapabilities(modelName string, registry domain.ModelRegistry) domain.ModelCapabilities { + return domain.ModelCapabilities{} +} +func (m *mockProfile) IsModelSupported(modelName string, registry domain.ModelRegistry) bool { return true } +func (m *mockProfile) TransformModelName(fromName string, toFormat string) string { return fromName } +func (m *mock_profile) GetResourceRequirements(modelName string, registry domain.ModelRegistry) domain.ResourceRequirements { + return domain.ResourceRequirements{} +} +func (m *mockProfile) GetOptimalConcurrency(modelName string) int { return 1 } +func (m *mockProfile) GetRoutingStrategy() domain.RoutingStrategy { return domain.RoutingStrategy{} } +func (m *mockProfile) ShouldBatchRequests() bool { return false } +func (m *mockProfile) GetRequestTimeout(modelName string) time.Duration { return 0 }

Notes:

Replace the stub returns where tests need specific behaviour (e.g. ParseModelsResponse, GetPaths).

Add imports "fmt" and "time" to the test file if they are not present.

Alternatively, wire the test to use the real profile factory from internal/adapter/registry/profile/factory.go with test fixtures if you prefer not to maintain test mocks.

Committable suggestion skipped: line range outside the PR's diff.

internal/adapter/metrics/extractor.go

coderabbitai · 2025-08-15T04:28:59Z

internal/adapter/proxy/olla/service_retry.go

 	// Stream with Olla's optimized streaming
 	// Use r.Context() for client context and upstreamCtx for upstream context
-	bytesWritten, streamErr := s.streamResponse(r.Context(), upstreamCtx, w, resp, *buffer, rlog)
+	bytesWritten, lastChunk, streamErr := s.streamResponse(r.Context(), upstreamCtx, w, resp, *buffer, rlog)


💡 Verification agent

🧩 Analysis chain

Verify other call sites updated to new streamResponse signature

streamResponse now returns (int, []byte, error). Please ensure all usages were updated accordingly.

Run this script to locate call sites and the definition:

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Find streamResponse definition ==" rg -n -C2 -g '!**/vendor/**' -P 'func\s+$\s*\w+\s+\*\w+\s*$\s*streamResponse\s*\(' echo echo "== Find all call sites ==" rg -n -C2 -g '!**/vendor/**' -P '\bstreamResponse\s*\(' echo echo "== Grep for assignments expecting 3 return values ==" rg -n -C2 -g '!**/vendor/**' -P '^\s*(?:[\w\[\]\*\s,]+:=\s*)?[^=\n]*streamResponse\(' | sed -n 'p'

Length of output: 4868

streamResponse signature verified — code updated; docs need one fix

internal/adapter/proxy/olla/service.go — definition at line 647 returns (int, []byte, error); call at line 594 uses bytesWritten, lastChunk, streamErr := s.streamResponse(...)

internal/adapter/proxy/olla/service_retry.go — call at line 157 uses bytesWritten, lastChunk, streamErr := s.streamResponse(...)

docs/content/development/architecture.md — line 213 still shows return s.streamResponse(w, resp); update the example to reflect the new (int, []byte, error) return values.

No other call sites found in the repo (excluding vendor).

🤖 Prompt for AI Agents

In internal/adapter/proxy/olla/service_retry.go around line 157 the call to s.streamResponse was updated to return (int, []byte, error) but the docs example in docs/content/development/architecture.md at line 213 still shows the old signature; update the docs example to match the new call signature by showing the three return values (e.g., bytesWritten, lastChunk, err := s.streamResponse(...)) and include minimal handling of the new lastChunk and err returns (check/propagate err and document what lastChunk contains) so the example compiles and reflects the actual function contract.

internal/adapter/proxy/olla/service.go

and fix a timer race

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

internal/adapter/proxy/olla/service.go (1)
567-576: Action required — declare Trailer and set final X-Olla-Response-Time trailer

Short summary:

core.SetResponseHeaders(...) does set X-Olla-Request-ID, X-Olla-Endpoint, X-Olla-Backend-Type and X-Olla-Model (confirmed in internal/adapter/proxy/core/common.go). It also currently sets X-Olla-Response-Time immediately (time.Since(stats.StartTime)).

There is no explicit Trailer declaration before WriteHeader in internal/adapter/proxy/olla/service.go and no final trailer write after streaming completes (service.go calls SetResponseHeaders, copies resp headers, then WriteHeader, then streams and computes final duration).

What to fix (files / locations):

internal/adapter/proxy/olla/service.go — handleSuccessfulResponse

After core.SetResponseHeaders(w, stats, endpoint) and before calling w.WriteHeader(...), declare the trailer: w.Header().Add(constants.HeaderTrailer, constants.HeaderXOllaResponseTime).

After streaming completes and final duration is computed, set the trailer value: w.Header().Set(constants.HeaderXOllaResponseTime, duration.String()).

(Optional) internal/adapter/proxy/core/common.go — consider whether SetResponseHeaders should avoid setting X-Olla-Response-Time early for streaming responses to prevent duplicate header+trailer values.

Suggested minimal patch (illustrative):
 core.SetResponseHeaders(w, stats, endpoint)
+// Declare we'll send the final response time as a trailer (must be set before WriteHeader)
+w.Header().Add(constants.HeaderTrailer, constants.HeaderXOllaResponseTime)

 // Copy response headers
 for key, values := range resp.Header {
@@
 	w.WriteHeader(resp.StatusCode)
@@
-	// stats update
-	duration := time.Since(stats.StartTime)
-	s.RecordSuccess(endpoint, duration.Milliseconds(), int64(bytesWritten))
+	// stats update
+	duration := time.Since(stats.StartTime)
+	// Set final response-time trailer value
+	w.Header().Set(constants.HeaderXOllaResponseTime, duration.String())
+	s.RecordSuccess(endpoint, duration.Milliseconds(), int64(bytesWritten))
Reason: HTTP trailers must be declared (Trailer header) before WriteHeader; the final X-Olla-Response-Time value can then be set after streaming completes so clients receive the accurate total time.

♻️ Duplicate comments (4)

internal/adapter/proxy/olla/service.go (1)

646-684: Timer Reset race fixed correctly (drain-before-Reset pattern)

Good fix. Stopping and draining the timer channel before Reset removes the spurious wake-up race noted previously. This aligns with net/http timer best practice for long-lived streaming loops.
docs/content/concepts/provider-metrics.md (2)
118-123: Use “gjson-compatible JSON paths”, not “JSONPath expressions”.

The extractor accepts gjson-compatible paths and normalises JSONPath-style prefixes. Wording this as “JSONPath expressions” is misleading.

Apply this diff:
-    paths:                      # JSONPath expressions to extract raw values
+    paths:                      # JSON paths (gjson-compatible) to extract raw values
266-268: Tweak gjson description: it queries JSON, it doesn’t “parse JSONPath”.

gjson provides fast JSON querying; we normalise JSONPath-like prefixes before querying.

Apply this diff:
-- **[gjson](https://github.com/tidwall/gjson)**: For JSON path parsing (7.6x faster than encoding/json)
+- **[gjson](https://github.com/tidwall/gjson)**: For fast JSON querying (7.6x faster than encoding/json)
config/profiles/ollama.yaml (1)

198-203: Zero-guard and multiply-first precision for tokens_per_second — looks good

The expression now guards divide-by-zero and multiplies before dividing, which improves precision and avoids integer truncation. Nice one.

🧹 Nitpick comments (17)

internal/adapter/proxy/olla/service.go (1)
593-606: Metrics extraction timing/context: avoid skipping extraction on client-cancel

Nice capture of lastChunk and post-stream extraction. However, using the client ctx means extraction is skipped if the client disconnects (ctx cancelled) even when we have the final chunk. Prefer using the upstream context to improve best-effort extraction.

Additionally, per coding guidelines, ensure X-Olla-Response-Time is emitted as a trailer. If core.SetResponseHeaders pre-declares the Trailer header, we can safely set the trailer value here once the duration is known.

Apply this diff locally:
-    // Extract metrics from response if available
-    core.ExtractProviderMetrics(ctx, s.MetricsExtractor, lastChunk, endpoint, stats, rlog, "Olla")
+    // Extract metrics from response if available.
+    // Use upstreamCtx so provider metrics extraction is not suppressed by client cancellation.
+    core.ExtractProviderMetrics(upstreamCtx, s.MetricsExtractor, lastChunk, endpoint, stats, rlog, "Olla")
And set the trailer value (assumes core.SetResponseHeaders declares Trailer: X-Olla-Response-Time before WriteHeader):
@@
     stats.EndTime = time.Now()
     stats.Latency = duration.Milliseconds()
+    // Populate response-time trailer (must be declared earlier).
+    w.Header().Set("X-Olla-Response-Time", fmt.Sprintf("%d", stats.Latency))
docs/content/concepts/provider-metrics.md (7)
127-131: Prefer Australian English and clarify normalisation rules.

Use “normalised/normalisation” (en-AU), and keep the note concise.

Apply this diff:
-1. **`paths`**: Maps field names to JSON path expressions for extracting values from the provider's response
-   - Supports both JSONPath notation (`$.field.subfield`) and gjson notation (`field.subfield`)
-   - JSONPath prefixes are automatically normalized: `$.` is trimmed, `$` becomes empty string
+1. **`paths`**: Maps field names to JSON path expressions for extracting values from the provider's response
+   - Supports both JSONPath notation (`$.field.subfield`) and gjson notation (`field.subfield`)
+   - JSONPath prefixes are automatically normalised: `$.` is trimmed; `$` is treated as the root
150-170: Align Australian English in comments (“normalised”).

Consistent en-AU spelling helps keep docs uniform.

Apply this diff:
-      model: "$.model"                    # JSONPath notation (normalized to "model")
+      model: "$.model"                    # JSONPath notation (normalised to "model")
269-275: Use en-AU (“Normalisation”) and fix spelling in bullets.

Keep spelling consistent.

Apply this diff:
-**JSONPath Normalization**: Olla automatically normalizes JSONPath-style prefixes for gjson compatibility:
-- `$.foo.bar` → `foo.bar` (leading `$.` is trimmed)
-- `$` → `` (root selector is converted to empty string)
-- `foo.bar` → `foo.bar` (already normalized paths are unchanged)
+**JSONPath Normalisation**: Olla automatically normalises JSONPath-style prefixes for gjson compatibility:
+- `$.foo.bar` → `foo.bar` (leading `$.` is trimmed)
+- `$` → `` (root selector becomes the empty string)
+- `foo.bar` → `foo.bar` (already normalised paths are unchanged)
77-82: Specify a language for fenced code block (markdownlint MD040).

Add a language hint (text) for the Prometheus sample to satisfy linters and improve rendering.

Apply this diff:
-```
+```text
 vllm:prompt_tokens_total{model="llama3.2"} 50
 vllm:generation_tokens_total{model="llama3.2"} 200
 vllm:time_to_first_token_seconds{model="llama3.2"} 0.1
 vllm:time_per_output_token_seconds{model="llama3.2"} 0.022
---

`226-236`: **Specify a language for the log block (markdownlint MD040).**

Use text for log snippets.

Apply this diff:

```diff
-```
+```text
 2025/08/15 10:00:00 DEBUG Sherpa proxy metrics
   endpoint=local-ollama
   latency_ms=5000
   provider_total_ms=5000
   provider_prompt_eval_ms=100
   provider_eval_ms=4400
   provider_prompt_tokens=50
   provider_completion_tokens=200
   provider_tokens_per_second=45.45
---

`329-332`: **Clarify phrasing: “extraction attempts in logs”.**

Minor grammar improvement.

Apply this diff:

```diff
-3. Enable debug logging to see extraction attempts
+3. Enable debug logging to see extraction attempts in logs
55-57: Avoid repeating the heading text in the first sentence.

Tighten the prose to prevent duplication (“LM Studio LM Studio …”).

Apply this diff:
-### LM Studio
-LM Studio provides timing and token metrics:
+### LM Studio
+Provides timing and token metrics:
docs/content/development/architecture.md (1)
213-231: Use the buffer pool in the Sherpa example for consistency and explain why.

The struct shows a bufferPool, but the snippet allocates a fresh 32KB slice. Consider showing pooled buffers here as well, and add a brief “why” comment (reduced GC churn on hot path) to align with en-AU guidance.

Proposed snippet adjustment:
-    buffer := make([]byte, 32*1024)
-    bytesWritten, lastChunk, err := s.streamResponse(ctx, ctx, w, resp, buffer, logger)
+    // Use pooled buffer to reduce GC pressure on the hot path
+    buf := s.bufferPool.Get()
+    defer s.bufferPool.Put(buf)
+    bytesWritten, lastChunk, err := s.streamResponse(ctx, ctx, w, resp, *buf, logger)
config/profiles/openai.yaml (1)
105-125: LGTM overall; remove trailing space and add newline at EOF.

Config structure and expressions look correct. Please tidy formatting.

Apply this diff:
-      output_tokens: "$.usage.completion_tokens" 
+      output_tokens: "$.usage.completion_tokens"
Also ensure the file ends with a single newline (POSIX-friendly).
config/profiles/lmstudio.yaml (1)

138-154: LGTM; add newline at end of file.

Mappings and calculation are consistent with the extractor behaviour. Add a final newline to satisfy linters and tooling.

config/profiles/vllm.yaml (1)

226-246: LGTM; add newline at end of file.

The mappings and guarded calculation look good for OpenAI-compatible responses. Add a trailing newline to satisfy linters.
config/profiles/ollama.yaml (1)
182-182: YAML lint: strip trailing spaces and add newline at EOF

Clean up the minor lint issues.

Apply this diff:
-    
+
@@
-      
+
@@
-      model_load_ms: "load_duration_ns / 1000000"
+      model_load_ms: "load_duration_ns / 1000000"
+
Also applies to: 196-196, 203-203
internal/adapter/metrics/extractor.go (4)
191-194: Don’t abort header extraction on invalid JSON

If the body isn’t valid JSON, we return early and skip header-based extraction. Better to continue so header-only providers/configs still work.

Apply this diff:
-		if !jsonResult.Exists() {
-			return nil
-		}
+		if !jsonResult.Exists() {
+			// Invalid/empty JSON — skip body extraction but still allow header-based extraction.
+		}
291-300: Support additional calculated fields (“prompt_ms”, “generation_ms”)

If profiles add expressions for these, they’re currently ignored. Map them when present to keep calculations flexible.

Apply this diff:
 				switch field {
 				case "tokens_per_second":
 					metrics.TokensPerSecond = float32(val)
 				case "ttft_ms":
 					metrics.TTFTMs = util.SafeInt32(int64(val))
 				case "total_ms":
 					metrics.TotalMs = util.SafeInt32(int64(val))
 				case "model_load_ms":
 					metrics.ModelLoadMs = util.SafeInt32(int64(val))
+				case "prompt_ms":
+					metrics.PromptMs = util.SafeInt32(int64(val))
+				case "generation_ms":
+					metrics.GenerationMs = util.SafeInt32(int64(val))
 				}
337-342: Typo in comment (“leasing” → “leading”)

Minor typo; keep comments tidy.

Apply this diff:
-	// remove the leasing "$." or "$"
+	// remove the leading "$." or "$"
344-349: Pre-compile regexes once (micro-optimisation, not on hot path)

These regexes are recreated each call. While this runs at validation/start-up, you can still pre-compile them once to reduce allocations.

Apply this change outside the selected range:
// top-level (package scope)
var (
	bracketIdxRe = regexp.MustCompile(`\[(\d+)\]`)
	bracketKeyRe = regexp.MustCompile(`\['([^']+)'\]`)
)
Then inside convertToGjsonPath:
-	s = regexp.MustCompile(`\[(\d+)\]`).ReplaceAllString(s, `.$1`)
+	s = bracketIdxRe.ReplaceAllString(s, `.$1`)
-	s = regexp.MustCompile(`\['([^']+)'\]`).ReplaceAllString(s, `.$1`)
+	s = bracketKeyRe.ReplaceAllString(s, `.$1`)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2dde757 and 09cac43.

📒 Files selected for processing (10)

config/profiles/lmstudio.yaml (1 hunks)
config/profiles/ollama.yaml (1 hunks)
config/profiles/openai.yaml (1 hunks)
config/profiles/vllm.yaml (1 hunks)
docs/content/api-reference/overview.md (1 hunks)
docs/content/concepts/provider-metrics.md (1 hunks)
docs/content/development/architecture.md (2 hunks)
internal/adapter/metrics/extractor.go (1 hunks)
internal/adapter/proxy/olla/service.go (5 hunks)
internal/integration/providers/vllm_integration_test.go (0 hunks)

💤 Files with no reviewable changes (1)

internal/integration/providers/vllm_integration_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

docs/content/api-reference/overview.md

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{go,md}

📄 CodeRabbit Inference Engine (CLAUDE.md)

Use Australian English for comments and documentation, and write comments explaining why rather than what

Files:

internal/adapter/proxy/olla/service.go
internal/adapter/metrics/extractor.go
docs/content/development/architecture.md
docs/content/concepts/provider-metrics.md

{internal/app/handlers/handler_proxy.go,internal/adapter/proxy/sherpa/service.go,internal/adapter/proxy/olla/service.go}

📄 CodeRabbit Inference Engine (CLAUDE.md)

Ensure responses set the headers: X-Olla-Endpoint, X-Olla-Model, X-Olla-Backend-Type, X-Olla-Request-ID, and X-Olla-Response-Time (as a trailer)

Files:

internal/adapter/proxy/olla/service.go

🧠 Learnings (1)

📚 Learning: 2025-08-11T11:48:06.628Z

Learnt from: CR
PR: thushan/olla#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-11T11:48:06.628Z
Learning: Applies to internal/app/handlers/handler_proxy.go : Proxy routes must be served under the `/olla/` prefix

Applied to files:

docs/content/development/architecture.md

🧬 Code Graph Analysis (2)

internal/adapter/proxy/olla/service.go (6)

internal/core/ports/metrics.go (1)

MetricsExtractor (11-21)

internal/adapter/proxy/core/base.go (1)

NewBaseProxyComponents (63-78)

internal/adapter/proxy/core/metrics.go (1)

ExtractProviderMetrics (17-50)

internal/adapter/proxy/sherpa/service.go (1)

Service (65-72)

internal/logger/styled.go (1)

StyledLogger (12-35)

internal/adapter/proxy/core/streaming.go (1)

AutoDetectStreamingMode (14-45)

internal/adapter/metrics/extractor.go (4)

internal/core/domain/inference_profile.go (1)

InferenceProfile (8-48)

internal/logger/styled.go (1)

StyledLogger (12-35)

internal/core/domain/provider_metrics.go (2)

ProviderMetrics (5-26)

MetricsExtractionConfig (29-43)

internal/util/safe.go (1)

SafeInt32 (24-32)

🪛 YAMLlint (1.37.1)

config/profiles/openai.yaml

[error] 116-116: trailing spaces

(trailing-spaces)

[error] 125-125: no new line character at the end of file

(new-line-at-end-of-file)

config/profiles/vllm.yaml

[error] 246-246: no new line character at the end of file

(new-line-at-end-of-file)

config/profiles/lmstudio.yaml

[error] 154-154: no new line character at the end of file

(new-line-at-end-of-file)

config/profiles/ollama.yaml

[error] 182-182: trailing spaces

(trailing-spaces)

[error] 196-196: trailing spaces

(trailing-spaces)

[error] 203-203: no new line character at the end of file

(new-line-at-end-of-file)

🪛 LanguageTool

docs/content/concepts/provider-metrics.md

[duplication] ~25-~25: Possible typo: you repeated a word.
Context: ...mance data ## Supported Providers ### Ollama Ollama provides comprehensive metrics in its r...

(ENGLISH_WORD_REPEAT_RULE)

[grammar] ~55-~55: This phrase is duplicated. You should probably use “LM Studio” only once.
Context: ... "total_tokens": 250 } } ### LM Studio LM Studio provides timing and token metrics: ...

(PHRASE_REPETITION)

[duplication] ~74-~74: Possible typo: you repeated a word.
Context: ... "predicted_ms": 4400 } } ``` ### vLLM vLLM exposes detailed metrics via Prometheus...

(ENGLISH_WORD_REPEAT_RULE)

[uncategorized] ~127-~127: Loose punctuation mark.
Context: ... ``` ### Key Components 1. paths: Maps field names to JSON path expressio...

(UNLIKELY_OPENING_PUNCTUATION)

[grammar] ~330-~330: Probably a preposition is missing after ‘attempts’.
Context: ... Enable debug logging to see extraction attempts 4. Ensure response format matches expected struct...

(ATD_VERBS_TO_COLLOCATION)

🪛 markdownlint-cli2 (0.17.2)

docs/content/concepts/provider-metrics.md

77-77: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

226-226: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Cross-Platform Build Validation (windows-latest)

🔇 Additional comments (4)

internal/adapter/proxy/olla/service.go (2)

147-149: Wiring in MetricsExtractor is correct; confirm startup validation is invoked

The constructor now accepts a MetricsExtractor. Please verify that metrics profiles are validated on startup (ports.MetricsExtractor.ValidateProfile) so misconfigurations fail-fast rather than at request-time.

Would you like me to scan the codebase for ValidateProfile calls and add a startup validation path if missing?

167-168: Base components propagation looks good

Passing the MetricsExtractor through to core.NewBaseProxyComponents is the right integration point. Nil-safe by design if extractor is absent.

config/profiles/ollama.yaml (1)

186-187: Key name alignment: extractor maps “done”, config defines “is_complete”

The extractor currently expects the field name “done” (maps to IsComplete), while this config defines “is_complete”. Without code changes, IsComplete won’t be populated. I’m proposing a code-side fix to support both keys (see extractor.go comment). If you prefer to change config instead, rename the key to “done”.

Would you like to keep “is_complete” for readability and map it in code (recommended), or rename this key to “done”?

internal/adapter/metrics/extractor.go (1)

22-26: Extraction timeout and async design — LGTM

The 10ms bound and async extraction prevent request-path blockage while still capturing metrics, with counters to track failures. Sensible trade-off.

internal/adapter/metrics/extractor.go

thushan added 8 commits August 13, 2025 17:55

refactor a bit and move to core/metrics

55ccb0f

Merge main into feature/better-stats

44a4e28

- Integrated health checking and routing improvements from main - Added provider metrics extraction functionality - Combined ProviderMetrics and RoutingDecision in RequestStats - Resolved streaming context handling conflicts

Fix compilation errors and shadow variable warning after merge

909dfb9

removed debug in hotpath and try to compile expressions at compiletime

417a76d

profiling revealed some performance issues with custom written parser…

832ced6

…, adopting gjson and expr also preallocate as many slices as we can for stats work

reorg docs and add more detail

2dde757

remove integration test no longer used post impl.

3f0e306

coderabbitai bot reviewed Aug 15, 2025

View reviewed changes

thushan added 3 commits August 15, 2025 14:43

lab test fixes for profiles

4308adb

make the jsonpath a bit more robust from lab tests

e3558a6

and fix a timer race

update docs after changes in profiles

09cac43

coderabbitai bot reviewed Aug 15, 2025

View reviewed changes

internal/adapter/metrics/extractor.go Outdated Show resolved Hide resolved

lab: float issues

982cdc6

thushan merged commit 1908f7b into main Aug 15, 2025
7 of 8 checks passed

thushan deleted the feature/better-stats branch August 15, 2025 05:10

This was referenced Feb 20, 2026

fix: Proxy Path issue & sensible defaults #107

Merged

feature: Anthropic agent fixes and improvements #113

Merged

coderabbitai bot mentioned this pull request Mar 11, 2026

Adding metrics information on token usage and performance #120

Open

Uh oh!

Conversation

thushan commented Aug 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Impact:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thushan commented Aug 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 15, 2025 •

edited

Loading