Conversation
fix broken tests :( add ollama profile update tests add tests to check for calcs hmm we need to add functionality to calc in the profiles add the ability to extract easily update olla and sherpa fix tests ahhhh more tests broken service_leak fails change the way proxy intercepts add lmstudio and openai use better json parsing logic
- Integrated health checking and routing improvements from main - Added provider metrics extraction functionality - Combined ProviderMetrics and RoutingDecision in RequestStats - Resolved streaming context handling conflicts
- Created comprehensive provider-metrics.md documentation - Updated API reference with new headers and metrics info - Added provider metrics section to monitoring guide - Updated concepts overview to include provider metrics - Added provider metrics to mkdocs.yml navigation
…, adopting gjson and expr also preallocate as many slices as we can for stats work
|
Caution Review failedThe pull request is closed. WalkthroughAdds a provider metrics extraction pipeline: new domain types and ports, a gjson/expr-based Extractor with a 10ms bound, profile-level metrics.extraction configs for multiple providers, proxy plumbing to capture the last streamed chunk and invoke extraction, startup wiring, logging/prometheus updates, docs, tests and related API signature adjustments. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant Proxy as Proxy Service
participant Upstream as Provider
participant Extractor as MetricsExtractor
participant Stats as StatsCollector
Client->>Proxy: Request (streaming)
Proxy->>Upstream: Forward request
Upstream-->>Proxy: Stream chunks
loop Stream
Proxy-->>Client: Write chunk
end
Proxy->>Extractor: ExtractFromChunk(lastChunk, providerType)
Extractor-->>Proxy: ProviderMetrics (best effort)
Proxy->>Stats: Record RequestStats (incl. ProviderMetrics)
Proxy-->>Client: Response complete
sequenceDiagram
autonumber
participant App as App Startup
participant ProfF as ProfileFactory
participant Extr as MetricsExtractor
participant ProxyF as Proxy Factory
App->>ProfF: NewFactoryWithDefaults()
ProfF-->>App: Profiles
App->>Extr: NewExtractor(ProfF)
loop For each profile
App->>Extr: ValidateProfile(profile)
Extr-->>App: ok/error
end
App->>ProxyF: NewFactory(stats, extractor, logger)
ProxyF-->>App: Factory ready
Estimated code review effort🎯 5 (Critical) | ⏱️ ~90 minutes Possibly related PRs
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. 📒 Files selected for processing (2)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Actionable comments posted: 8
🔭 Outside diff range comments (1)
internal/adapter/proxy/olla/service.go (1)
593-606: Set X-Olla-Response-Time as an HTTP trailer after streaming completes.Coding guidelines require X-Olla-Response-Time as a trailer. Ensure it’s declared before WriteHeader and assigned after streaming when latency is known.
Minimal changes:
- Declare the trailer before Line 576 (WriteHeader).
- Set the trailer value after Lines 611–613 (latency calculated).
- // Copy response headers + // Copy response headers for key, values := range resp.Header { for _, value := range values { w.Header().Add(key, value) } } + + // Ensure trailer is declared before headers are written + w.Header().Add("Trailer", "X-Olla-Response-Time") w.WriteHeader(resp.StatusCode) @@ - stats.EndTime = time.Now() - stats.Latency = duration.Milliseconds() + stats.EndTime = time.Now() + stats.Latency = duration.Milliseconds() + // Assign trailer value after latency is known + w.Header().Set("X-Olla-Response-Time", fmt.Sprintf("%dms", stats.Latency))This keeps semantics correct for streaming and meets the documented contract.
🧹 Nitpick comments (61)
internal/adapter/unifier/default_unifier.go (1)
488-490: Pre‑allocate slices after de‑dup to exact sizes; tighten comment wording (Aussie English).You de‑duplicate into maps and only then flatten into slices. Pre‑allocating here with rough heuristics risks under/over‑allocation and doesn’t help until the final append anyway. Initialise these slices as nil and pre‑allocate to the exact de‑duplicated sizes immediately before flattening. Also prefer Australian English and avoid abbreviations in comments.
Apply this change within the selected lines:
- Aliases: make([]domain.AliasEntry, 0, len(models)*2), // thinking avg 2 aliases per model - SourceEndpoints: make([]domain.SourceEndpoint, 0, len(models)), // at least 1 source per model - Capabilities: make([]string, 0, len(models)*5), // should be ~5 capabilities per model + // Leave nil here; we pre‑allocate to the exact de‑duplicated size later to minimise reallocations. + Aliases: nil, + SourceEndpoints: nil, + Capabilities: nil,And insert the following just before you start appending from the maps (above Line 531):
// Pre‑allocate exactly to de‑duplicated sizes for better allocation behaviour. merged.SourceEndpoints = make([]domain.SourceEndpoint, 0, len(sourceMap)) // optimisation: avoid growth during flatten merged.Aliases = make([]domain.AliasEntry, 0, len(aliasMap)) merged.Capabilities = make([]string, 0, len(capSet))Optional: If you want to keep capacity hints earlier, replace slang/US terms in comments (e.g., “avg”) with clearer Australian English (“average”) and consider extracting the heuristic factors as named constants to make the rationale explicit and maintainable.
internal/util/safe.go (1)
24-32: SafeInt32 clamp logic is correct; please add a doc comment and boundary testsImplementation is sound and avoids overflow/underflow. Add a doc comment (why it exists) and unit tests for MinInt32-1, MaxInt32+1, and in-range values.
Apply this diff to document the function:
- +// SafeInt32 clamps a 64-bit value into the 32-bit signed range. +// Why: prevents overflow when normalising provider metrics (e.g. token counts, durations) to int32 for lower memory use. func SafeInt32(value int64) int32 { if value < math.MinInt32 { return math.MinInt32 } if value > math.MaxInt32 { return math.MaxInt32 } return int32(value) }go.mod (1)
3-3: Go 1.24 target and new deps LGTM
- go directive set to 1.24 meets the toolchain guideline.
- expr and gjson additions align with the metrics extractor design.
Optionally consider adding a toolchain directive once your CI image is finalised (e.g. toolchain go1.24.x) to avoid local tool mismatches.
Also applies to: 7-15, 34-35
config/profiles/ollama.yaml (3)
178-182: Confirm extraction source for streaming; prefer last_chunkIf metrics are extracted from the final streamed chunk, set the source accordingly (unless the extractor already routes to the last chunk internally). Using response_body may miss streamed-only fields.
Apply this diff if the extractor supports chunk-based extraction:
- source: "response_body" + source: "last_chunk"
183-195: GJSON vs JSONPath: strip '$.' prefixes unless the extractor normalises themgjson expects paths like model, not $.model. If normalisation isn’t implemented in the extractor, these lookups will fail silently.
Apply this diff if normalisation is not already handled:
- model: "$.model" - done: "$.done" - # Token counts - input_tokens: "$.prompt_eval_count" - output_tokens: "$.eval_count" - # Timing data (in nanoseconds from Ollama) - total_duration_ns: "$.total_duration" - load_duration_ns: "$.load_duration" - prompt_duration_ns: "$.prompt_eval_duration" - eval_duration_ns: "$.eval_duration" + model: "model" + done: "done" + # Token counts + input_tokens: "prompt_eval_count" + output_tokens: "eval_count" + # Timing data (in nanoseconds from Ollama) + total_duration_ns: "total_duration" + load_duration_ns: "load_duration" + prompt_duration_ns: "prompt_eval_duration" + eval_duration_ns: "eval_duration"
182-182: Fix YAML lint issues: trailing spaces and missing newlineThere are trailing spaces and missing newline at EOF flagged by yamllint.
Apply these diffs:
- +- +And ensure the file ends with a newline after line 201.
Also applies to: 195-195, 201-201
internal/adapter/proxy/sherpa/ring_buffer.go (4)
5-7: Explain why, not what, in the commentMake the comment convey purpose (why we need this) to aid future maintainers.
Apply this diff:
-// SimpleRingBuffer implements a thread-safe ring buffer for storing the last N bytes -// This is much simpler than hand-rolling complex buffer logic +// SimpleRingBuffer stores only the last N bytes of a stream for cheap “last chunk” +// access during metrics extraction in streaming paths. Thread-safe; optimised for +// small capacities and many small writes.
13-19: Defend against negative capacitiesA negative capacity will panic at make(..., capacity). Clamp to zero to be defensive.
Apply this diff:
func NewSimpleRingBuffer(capacity int) *SimpleRingBuffer { + if capacity < 0 { + capacity = 0 + } return &SimpleRingBuffer{ data: make([]byte, 0, capacity), capacity: capacity, } }
21-55: Optional micro-optimisation: re-slice instead of copy in drop-old pathFunctionally correct. You can simplify the drop-old branch by re-slicing to the last keepBytes before appending.
Apply this diff:
- keepBytes := rb.capacity - n - if keepBytes > 0 && len(rb.data) > keepBytes { - // Shift existing data - copy(rb.data[:keepBytes], rb.data[len(rb.data)-keepBytes:]) - rb.data = rb.data[:keepBytes] - } else { - rb.data = rb.data[:0] - } - rb.data = append(rb.data, p...) + keepBytes := rb.capacity - n + if keepBytes > 0 && len(rb.data) > keepBytes { + // Keep the newest bytes, then append + rb.data = append(rb.data[:0], rb.data[len(rb.data)-keepBytes:]...) + } else { + rb.data = rb.data[:0] + } + rb.data = append(rb.data, p...)
57-69: Return empty slice instead of nil for Bytes()Returning []byte{} avoids nil checks at call sites; semantics are equivalent for most uses.
Apply this diff:
if len(rb.data) == 0 { - return nil + return []byte{} }Additionally, we can add a compile-time assertion to make the io.Writer intent explicit (outside the selected range):
// At file scope // var _ io.Writer = (*SimpleRingBuffer)(nil)config/profiles/lmstudio.yaml (2)
138-152: Empty ‘calculations’ block: make intent explicit or drop itAn empty map is fine, but some config readers treat missing vs empty differently. If unused, either omit the block or make it explicit as an empty map for clarity.
Here’s a minimal tweak to make the intent explicit:
- calculations: - # LM Studio doesn't provide timing data, so we can't calculate tokens/sec + calculations: {} + # LM Studio doesn't provide timing data, so we cannot calculate tokens/sec
152-152: Add a newline at EOF to satisfy linters and toolingAlso avoid contractions in docs/config to keep tone formal.
Apply:
- # LM Studio doesn't provide timing data, so we can't calculate tokens/sec + # LM Studio doesn't provide timing data, so we cannot calculate tokens/sec +internal/adapter/proxy/core/base.go (1)
67-68: Consider functional options to avoid constructor churnAs dependencies grow, positional parameters become brittle for call sites. A small options pattern keeps the API stable and self-documenting.
Example pattern (illustrative):
type BaseProxyOption func(*BaseProxyComponents) func WithMetricsExtractor(mx ports.MetricsExtractor) BaseProxyOption { return func(b *BaseProxyComponents) { b.MetricsExtractor = mx } } func NewBaseProxyComponents(ds ports.DiscoveryService, sel domain.EndpointSelector, sc ports.StatsCollector, logger logger.StyledLogger, opts ...BaseProxyOption) *BaseProxyComponents { b := &BaseProxyComponents{ DiscoveryService: ds, Selector: sel, StatsCollector: sc, Logger: logger, EventBus: eventbus.New[ProxyEvent](), } for _, opt := range opts { opt(b) } return b }Call sites can then opt-in: NewBaseProxyComponents(ds, sel, sc, logger, WithMetricsExtractor(mx))
Also applies to: 74-74
docs/content/concepts/overview.md (1)
75-84: Minor wording tweak“Provide” reads more formal than “give” in docs.
-Provider metrics give insights into model performance and resource usage. +Provider metrics provide insights into model performance and resource usage.docs/content/api-reference/overview.md (1)
91-94: Clarify routing header semantics with an example (operator-focused).A short example helps operators sanity-check header values during incident response and makes intent obvious.
Apply this diff to add an example block:
| `X-Olla-Response-Time` | Total processing time | | `X-Olla-Routing-Strategy` | Routing strategy used (when model routing is active) | | `X-Olla-Routing-Decision` | Routing decision made (routed/fallback/rejected) | | `X-Olla-Routing-Reason` | Human-readable reason for routing decision | + +Example: + +``` +X-Olla-Routing-Strategy: priority +X-Olla-Routing-Decision: routed +X-Olla-Routing-Reason: matched profile 'openai:fast' (latency) +```internal/adapter/discovery/repository.go (1)
58-60: Reword the comment to explain the rationale (why) rather than the action (what).Small doc tweak improves clarity per our documentation guideline.
Apply this diff to both occurrences:
-// Pre-allocate with capacity of all endpoints (worst case) +// Pre-allocate to avoid reallocations during filtering; worst case is when all endpoints matchAlso applies to: 74-76
internal/adapter/proxy/benchmark_refactor_test.go (1)
71-71: Benchmarks updated to new NewService signatures; consider a variant with a no‑op extractor.Change looks right. For completeness, consider adding a benchmark variant that passes a no‑op MetricsExtractor to measure overhead of the extraction path vs nil, ensuring both engines remain comparable under identical conditions.
Also applies to: 81-81
internal/adapter/proxy/proxy_olla_test.go (1)
32-32: Reduce duplication by centralising proxy construction in helpers.Several tests still construct Olla services directly. Prefer routing these through the existing helpers (createTestOllaProxy/createTestOllaProxyWithConfig) or a new small helper that supplies a nil metrics extractor to keep all call sites consistent and reduce churn if the constructor changes again.
Apply this pattern to the modified sites (illustrated for one call; repeat similarly elsewhere):
- proxy, err := olla.NewService(discovery, selector, config, collector, nil, createTestLogger()) + proxy, err := createTestOllaProxyWithConfig(endpoints, config)Alternatively, use a typed-nil to make intent explicit:
- proxy, err := olla.NewService(discovery, selector, config, collector, nil, createTestLogger()) + var noMetrics ports.MetricsExtractor + proxy, err := olla.NewService(discovery, selector, config, collector, noMetrics, createTestLogger())Also applies to: 43-44, 61-61, 128-128, 211-211, 270-270, 456-456, 492-492, 545-545
internal/app/handlers/handler_proxy.go (2)
201-219: Log type consistency: emit tokens_per_sec as a numeric value (not a formatted string).INFO logs currently render tokens_per_sec as a string, while DEBUG logs emit it as a numeric value. This inconsistency can complicate downstream parsing. Keep it numeric in both places.
Apply this diff:
- if pm.TokensPerSecond > 0 { - infoFields = append(infoFields, "tokens_per_sec", fmt.Sprintf("%.1f", pm.TokensPerSecond)) - } + if pm.TokensPerSecond > 0 { + infoFields = append(infoFields, "tokens_per_sec", pm.TokensPerSecond) + }
201-219: De-duplicate provider metrics field assembly via a small helper.The metrics field-building logic is duplicated between INFO (short) and DEBUG (detailed) paths. Extract it into a helper to keep keys/types aligned and simplify future changes.
Apply these diffs to use a shared helper:
- // Add provider metrics if available - if pr.stats.ProviderMetrics != nil { - pm := pr.stats.ProviderMetrics - if pm.InputTokens > 0 { - infoFields = append(infoFields, "input_tokens", pm.InputTokens) - } - if pm.OutputTokens > 0 { - infoFields = append(infoFields, "output_tokens", pm.OutputTokens) - } - if pm.TotalTokens > 0 { - infoFields = append(infoFields, "total_tokens", pm.TotalTokens) - } - if pm.TokensPerSecond > 0 { - infoFields = append(infoFields, "tokens_per_sec", pm.TokensPerSecond) - } - if pm.TTFTMs > 0 { - infoFields = append(infoFields, "ttft_ms", pm.TTFTMs) - } - } + // Add provider metrics if available + infoFields = appendProviderMetricsFields(infoFields, pr.stats.ProviderMetrics, false)- // Add provider metrics if available (detailed view) - if pr.stats.ProviderMetrics != nil { - pm := pr.stats.ProviderMetrics - if pm.InputTokens > 0 { - fields = append(fields, "input_tokens", pm.InputTokens) - } - if pm.OutputTokens > 0 { - fields = append(fields, "output_tokens", pm.OutputTokens) - } - if pm.TotalTokens > 0 { - fields = append(fields, "total_tokens", pm.TotalTokens) - } - if pm.TokensPerSecond > 0 { - fields = append(fields, "tokens_per_sec", pm.TokensPerSecond) - } - if pm.TTFTMs > 0 { - fields = append(fields, "ttft_ms", pm.TTFTMs) - } - if pm.PromptMs > 0 { - fields = append(fields, "prompt_ms", pm.PromptMs) - } - if pm.GenerationMs > 0 { - fields = append(fields, "generation_ms", pm.GenerationMs) - } - if pm.Model != "" { - fields = append(fields, "provider_model", pm.Model) - } - if pm.FinishReason != "" { - fields = append(fields, "finish_reason", pm.FinishReason) - } - } + // Add provider metrics if available (detailed view) + fields = appendProviderMetricsFields(fields, pr.stats.ProviderMetrics, true)Add this helper in the same file (outside the edited ranges):
// appendProviderMetricsFields appends provider metrics to the given field list. // If detailed is true, include extended timings and metadata. // We keep numeric types stable for easier downstream ingestion. func appendProviderMetricsFields(fields []any, pm *domain.ProviderMetrics, detailed bool) []any { if pm == nil { return fields } if pm.InputTokens > 0 { fields = append(fields, "input_tokens", pm.InputTokens) } if pm.OutputTokens > 0 { fields = append(fields, "output_tokens", pm.OutputTokens) } if pm.TotalTokens > 0 { fields = append(fields, "total_tokens", pm.TotalTokens) } if pm.TokensPerSecond > 0 { fields = append(fields, "tokens_per_sec", pm.TokensPerSecond) } if pm.TTFTMs > 0 { fields = append(fields, "ttft_ms", pm.TTFTMs) } if detailed { if pm.PromptMs > 0 { fields = append(fields, "prompt_ms", pm.PromptMs) } if pm.GenerationMs > 0 { fields = append(fields, "generation_ms", pm.GenerationMs) } if pm.Model != "" { fields = append(fields, "provider_model", pm.Model) } if pm.FinishReason != "" { fields = append(fields, "finish_reason", pm.FinishReason) } } return fields }Also applies to: 250-281
internal/adapter/proxy/sherpa/service_retry.go (2)
149-157: Best-effort metrics extraction currently skipped on error pathsThe PR objective calls out best-effort extraction from the final streamed chunk. Here, provider metrics are only extracted on the success path. If streaming fails (but we still captured lastChunk), we return early and never attempt extraction.
Recommend extracting before error handling (so both success and error paths benefit) and removing the later call to avoid duplication.
Apply this diff:
- bytesWritten, lastChunk, streamErr := s.streamResponseWithTimeout(r.Context(), upstreamCtx, w, resp, *buffer, rlog) + bytesWritten, lastChunk, streamErr := s.streamResponseWithTimeout(r.Context(), upstreamCtx, w, resp, *buffer, rlog) stats.StreamingMs = time.Since(streamStart).Milliseconds() stats.TotalBytes = bytesWritten - if streamErr != nil && !errors.Is(streamErr, context.Canceled) { - rlog.Error("streaming failed", "error", streamErr) - s.RecordFailure(ctx, endpoint, time.Since(stats.StartTime), streamErr) - return common.MakeUserFriendlyError(streamErr, time.Since(stats.StartTime), "streaming", s.configuration.GetResponseTimeout()) - } + // Extract metrics from the last chunk even if the stream errored. + // Why: We want to surface provider timing/token metrics whenever possible to aid debugging and autoscaling decisions. + core.ExtractProviderMetrics(ctx, s.MetricsExtractor, lastChunk, endpoint, stats, rlog, "Sherpa") + + if streamErr != nil && !errors.Is(streamErr, context.Canceled) { + rlog.Error("streaming failed", "error", streamErr) + s.RecordFailure(ctx, endpoint, time.Since(stats.StartTime), streamErr) + return common.MakeUserFriendlyError(streamErr, time.Since(stats.StartTime), "streaming", s.configuration.GetResponseTimeout()) + } @@ - // Extract metrics from response if available - core.ExtractProviderMetrics(ctx, s.MetricsExtractor, lastChunk, endpoint, stats, rlog, "Sherpa") + // (metrics extracted above for both success and error paths)Also applies to: 180-182
184-206: Preserve debug metrics logging even when ctx logger is absentAt completion you log metrics only if a context logger is present. Elsewhere in this function you fall back to rlog. For consistency (and to avoid silent loss of useful diagnostics), add the same fallback here.
if ctxLogger != nil { logFields = append(logFields, "request_id", middleware.GetRequestID(ctx)) ctxLogger.Debug("Sherpa proxy metrics", logFields...) - } + } else { + // Why: Keep completion diagnostics available even when no context logger is attached. + rlog.Debug("Sherpa proxy metrics", logFields...) + }internal/adapter/proxy/sherpa/service_retry_test.go (1)
123-124: Optionally exercise metrics wiring in testsConsider adding a small fake MetricsExtractor in one test to assert it’s invoked (even when extraction yields nil). This would catch regressions in constructor wiring without coupling to real provider payloads.
If you’d like, I can sketch a minimal fake extractor and an assertion pattern that doesn’t depend on specific payloads.
internal/adapter/proxy/sherpa/service_streaming.go (1)
39-46: Parameterise ring buffer capacity rather than hard-coding 8KBHard-coding 8192 works, but tying it to configuration (with a sensible default and an upper bound) improves flexibility across deployments and avoids magic numbers.
Apply this diff and helper:
-func (s *Service) streamResponseWithTimeout(clientCtx, upstreamCtx context.Context, w http.ResponseWriter, resp *http.Response, buffer []byte, rlog logger.StyledLogger) (int, []byte, error) { +func (s *Service) streamResponseWithTimeout(clientCtx, upstreamCtx context.Context, w http.ResponseWriter, resp *http.Response, buffer []byte, rlog logger.StyledLogger) (int, []byte, error) { state := &streamState{ - lastReadTime: time.Now(), - contentType: resp.Header.Get(constants.HeaderContentType), - isStreaming: core.AutoDetectStreamingMode(clientCtx, resp, s.configuration.GetProxyProfile()), - lastChunkBuffer: NewSimpleRingBuffer(8192), // Keep last 8KB for metrics + lastReadTime: time.Now(), + contentType: resp.Header.Get(constants.HeaderContentType), + isStreaming: core.AutoDetectStreamingMode(clientCtx, resp, s.configuration.GetProxyProfile()), + lastChunkBuffer: NewSimpleRingBuffer(s.lastChunkCapacity()), // Why: capacity aligned with config to balance fidelity and memory }Add this helper (same file), explaining rationale:
// lastChunkCapacity determines how many recent bytes we retain for metric extraction. // Why: Big enough to capture provider tail JSON (incl. stats), small enough to be cheap across many concurrent streams. func (s *Service) lastChunkCapacity() int { capacity := s.configuration.GetStreamBufferSize() if capacity <= 0 { capacity = 8192 // default } // Clamp to a reasonable upper bound to prevent unbounded memory usage on misconfiguration. if capacity > 64*1024 { capacity = 64 * 1024 } return capacity }internal/adapter/proxy/olla/service_retry.go (3)
188-190: Prefer version.Name over string literal "Olla"Avoid hard-coding engine name; use the centralised constant for consistency.
Apply:
- core.ExtractProviderMetrics(ctx, s.MetricsExtractor, lastChunk, endpoint, stats, rlog, "Olla") + core.ExtractProviderMetrics(ctx, s.MetricsExtractor, lastChunk, endpoint, stats, rlog, version.Name)Add import outside this hunk:
import ( // ... "github.com/thushan/olla/internal/version" )
192-205: Always emit completion metrics even without a context loggerAt present, completion metrics are only logged when ctxLogger is present; otherwise nothing is emitted. For parity with earlier logging branches, fall back to rlog.
if ctxLogger != nil { logFields = append(logFields, "request_id", middleware.GetRequestID(ctx)) ctxLogger.Debug("Olla proxy metrics", logFields...) + } else { + rlog.Debug("Olla proxy metrics", logFields...) }Also applies to: 211-214
183-187: Remove duplicate assignment to TotalBytesstats.TotalBytes is set both immediately after streaming and again after stats.EndTime; the second assignment is redundant.
stats.EndTime = time.Now() stats.Latency = stats.EndTime.Sub(stats.StartTime).Milliseconds() - stats.TotalBytes = bytesWritteninternal/core/domain/profile_config.go (1)
10-15: Tighten the comment to explain the “why”, using Australian EnglishThe comment currently explains what. Prefer explaining why we expose metrics in profiles (e.g. to standardise extraction and feed scheduling/logging) and use Australian English.
- // Metrics extraction configuration for provider responses + // Metrics extraction configuration for provider responses – enables us to surface + // provider timings and token counts without bespoke code per platform, improving + // observability and informing future scheduling/load-balancing behaviour.config/profiles/openai.yaml (2)
116-116: Trim trailing space to satisfy yamllintThere’s a trailing space after output_tokens.
- output_tokens: "$.usage.completion_tokens" + output_tokens: "$.usage.completion_tokens"
122-122: Add a newline at EOFYAML linters require a newline at the end of file.
No code diff provided as it affects the final EOF marker; please ensure the file ends with a newline.
internal/adapter/metrics/extractor_test.go (2)
22-27: Return an error for unknown profiles to fail fast in testsReturning (nil, nil) can mask mis-typed provider names and make tests pass vacuously. Prefer returning a descriptive error.
func (f *mockProfileFactory) GetProfile(name string) (domain.InferenceProfile, error) { if p, ok := f.profiles[name]; ok { return p, nil } - return nil, nil + return nil, fmt.Errorf("mock profile not found: %s", name) }Add import:
import "fmt"
91-99: Run tests in parallel to speed up the suiteThese unit tests are independent; running them in parallel reduces wall time.
func TestExtractor_ExtractMetrics_OllamaResponse(t *testing.T) { + t.Parallel() @@ func TestExtractor_ExtractMetrics_InvalidJSON(t *testing.T) { + t.Parallel() @@ func TestExtractor_ExtractFromChunk(t *testing.T) { + t.Parallel() @@ func TestExtractor_ExtractFromHeaders(t *testing.T) { + t.Parallel() @@ func TestExtractor_ValidateProfile_InvalidPath(t *testing.T) { + t.Parallel()Also applies to: 170-173, 205-213, 240-247, 279-283
internal/core/ports/metrics.go (1)
10-21: Enrich interface docs to capture rationaleA short rationale helps future maintainers understand why this interface exists (standardising metrics extraction across providers, supporting streaming and full responses).
-// MetricsExtractor extracts provider-specific metrics from responses +// MetricsExtractor standardises provider-specific metrics extraction across +// both full responses and streamed chunks. This keeps proxy code lean and +// enables consistent logging/observability (and future scheduling heuristics) +// without per-provider plumbing. type MetricsExtractor interface { // ValidateProfile validates metrics configuration at startup ValidateProfile(profile domain.InferenceProfile) error @@ // ExtractFromChunk extracts metrics from a streaming chunk (final chunk for streaming responses) ExtractFromChunk(ctx context.Context, chunk []byte, providerName string) *domain.ProviderMetrics }docs/content/concepts/provider-metrics.md (7)
77-82: Add a language to the fenced code block (markdownlint MD040).Specify a language for the vLLM metrics snippet to satisfy linters and improve rendering.
-``` +```text vllm:prompt_tokens_total{model="llama3.2"} 50 vllm:generation_tokens_total{model="llama3.2"} 200 vllm:time_to_first_token_seconds{model="llama3.2"} 0.1 vllm:time_per_output_token_seconds{model="llama3.2"} 0.022--- `223-233`: **Add a language to the log snippet (markdownlint MD040).** Mark the log block as plain text for consistency with other snippets. ```diff -``` +```text 2025/08/15 10:00:00 DEBUG Sherpa proxy metrics endpoint=local-ollama latency_ms=5000 provider_total_ms=5000 provider_prompt_eval_ms=100 provider_eval_ms=4400 provider_prompt_tokens=50 provider_completion_tokens=200 provider_tokens_per_second=45.45--- `55-56`: **Remove duplicated provider names in headings and lead-in text.** Headings already name the provider; repeating in the first sentence is redundant and flagged by LanguageTool. Tighten the phrasing. ```diff -### LM Studio -LM Studio provides timing and token metrics: +### LM Studio +Provides timing and token metrics:-### vLLM -vLLM exposes detailed metrics via Prometheus endpoint: +### vLLM +Exposes detailed metrics via a Prometheus endpoint:Also applies to: 74-76
88-102: Align metric names with exported JSON field names.Code exposes tokens_per_sec and ttft_ms (per ProviderMetrics JSON tags). Using different names in docs causes confusion for operators and tooling.
-| `tokens_per_second` | Generation speed | tokens/sec | -| `time_to_first_token_ms` | Latency to first token | milliseconds | +| `tokens_per_sec` | Generation speed | tokens/sec | +| `ttft_ms` | Latency to first token (time to first token) | milliseconds |
220-223: Heading misleads: these are logs, not response headers.The example shows application logs, not HTTP response headers. Rename to avoid operational confusion.
-### Via Response Headers -Provider metrics are included in detailed debug logs when available: +### Via Logs +Provider metrics are included in detailed debug logs when available:
299-313: Grammar: add missing preposition and minor polish.Light edits make the guidance clearer and fix the “attempts” phrasing.
- condition: tokens_per_second < 10 + condition: tokens_per_sec < 10-3. Enable debug logging to see extraction attempts +3. Enable debug logging to see extraction attempts in logs
213-216: Cross-check derived metrics exposure.Docs promise time_per_token_ms; ProviderMetrics currently has no corresponding field, so this value won’t appear in structured logs or status unless mapped into GenerationMs or custom outputs. Consider either documenting it as “provider-specific derived metric (not universally exported)” or extending ProviderMetrics.
internal/adapter/proxy/core/metrics.go (3)
26-31: Defensive: guard against a nil stats pointer.While callers should pass a non-nil stats, a cheap check avoids a panic in edge cases and keeps this helper safe to reuse.
// Early return if we can't extract metrics - most common case first - if extractor == nil || len(lastChunk) == 0 || endpoint == nil || endpoint.Type == "" { + if stats == nil || extractor == nil || len(lastChunk) == 0 || endpoint == nil || endpoint.Type == "" { // Skip logging in hot path unless needed for debugging return }
44-49: Optional: include model or provider in debug to aid triage.Adding model/provider type helps correlate metrics without enabling verbose logs elsewhere.
- rlog.Debug("Metrics extracted", - "engine", engineName, - "tokens", pm.TotalTokens) + rlog.Debug("Metrics extracted", + "engine", engineName, + "provider", endpoint.Type, + "model", pm.Model, + "tokens", pm.TotalTokens)
74-88: Append ‘total_ms’ and ‘model_load_ms’ to logs for completeness.Docs and operator expectations include total_ms and model_load_ms. Exposing them keeps logging consistent with the profile-derived metrics set.
if pm.GenerationMs > 0 { logFields = append(logFields, "generation_ms", pm.GenerationMs) } + if pm.TotalMs > 0 { + logFields = append(logFields, "total_ms", pm.TotalMs) + } + if pm.ModelLoadMs > 0 { + logFields = append(logFields, "model_load_ms", pm.ModelLoadMs) + } if pm.Model != "" { logFields = append(logFields, "provider_model", pm.Model) }internal/adapter/proxy/factory.go (1)
32-38: Optionally make transport defaults configurable rather than hard-coded.MaxIdleConns, IdleConnTimeout, and MaxConnsPerHost are currently fixed here. Surfacing them via ProxyConfiguration (or profile) improves operability across different deployments.
I can draft the minimal config plumb-through if you want to make these tunables in this PR or a follow-up.
Also applies to: 41-52, 55-69, 91-93
docs/mkdocs.yml (1)
127-133: Remove trailing spaces to satisfy yamllint and avoid CI flakinessThere are trailing spaces reported by YAMLlint on these lines. Safe to trim.
Apply this diff:
- - Health Checking: concepts/health-checking.md - - Proxy Engines: + - Health Checking: concepts/health-checking.md + - Proxy Engines: - Overview: concepts/proxy-engines.md - - Proxy Profiles: concepts/proxy-profiles.md - - Profile System: + - Proxy Profiles: concepts/proxy-profiles.md + - Profile System: - Overview: concepts/profile-system.md - Provider Metrics: concepts/provider-metrics.mdinternal/app/services/proxy.go (3)
92-126: Validate via interface; avoid downcasting to concrete extractorYou already depend on the MetricsExtractor interface which exposes ValidateProfile. The downcast to *metrics.Extractor narrows the implementation unnecessarily and will silently skip validation for alternative implementations.
Apply this diff:
- // Pre-validate profiles for metrics extraction - for _, profileName := range profileFactory.GetAvailableProfiles() { - if inferenceProfile, profileErr := profileFactory.GetProfile(profileName); profileErr == nil { - if extractor, ok := metricsExtractor.(*metrics.Extractor); ok { - if validationErr := extractor.ValidateProfile(inferenceProfile); validationErr != nil { - s.logger.Debug("Profile metrics validation failed", - "profile", profileName, "error", validationErr) - } - } - } - } + // Pre-validate profiles for metrics extraction + for _, profileName := range profileFactory.GetAvailableProfiles() { + if inferenceProfile, profileErr := profileFactory.GetProfile(profileName); profileErr == nil { + if validationErr := metricsExtractor.ValidateProfile(inferenceProfile); validationErr != nil { + s.logger.Debug("Profile metrics validation failed", + "profile", profileName, "error", validationErr) + } + } + }
100-102: Consider dropping log level to Debug to reduce startup noiseListing all available profiles on every start can be noisy in production. Debug level is typically sufficient.
- s.logger.Info("Profile factory created successfully", - "profiles", profileFactory.GetAvailableProfiles()) + s.logger.Debug("Profile factory created successfully", + "profiles", profileFactory.GetAvailableProfiles())
129-129: Use the field for consistency when constructing the proxy factoryFunctionally equivalent, but using the struct field avoids confusion with shadowed locals.
- proxyFactory := proxy.NewFactory(s.statsCollector, metricsExtractor, s.logger) + proxyFactory := proxy.NewFactory(s.statsCollector, s.metricsExtractor, s.logger)internal/adapter/metrics/benchmark_test.go (1)
49-52: Potential integer division in calculations; prefer float constantsDepending on the expression engine, dividing integers may truncate, yielding 0 for small durations. Using float constants avoids truncation and better reflects tokens/sec and ms.
- "tokens_per_second": "output_tokens / (eval_duration_ns / 1000000000)", - "ttft_ms": "prompt_duration_ns / 1000000", + "tokens_per_second": "output_tokens / (eval_duration_ns / 1e9)", + "ttft_ms": "prompt_duration_ns / 1e6",internal/adapter/proxy/olla/streaming_helpers.go (2)
24-39: Include richer metadata when publishing client disconnect eventsAdding request/endpoint context significantly improves operability when diagnosing disconnects in multi-tenant environments.
For example, if available in scope:
- s.PublishEvent(core.ProxyEvent{ - Type: core.EventTypeClientDisconnect, - Metadata: core.ProxyEventMetadata{ - BytesSent: int64(state.totalBytes), - }, - }) + s.PublishEvent(core.ProxyEvent{ + Type: core.EventTypeClientDisconnect, + Metadata: core.ProxyEventMetadata{ + BytesSent: int64(state.totalBytes), + }, + // If accessible here: + // RequestID: stats.RequestID, + // Endpoint: endpoint.URLString, + })
78-110: Handle client write errors as disconnects where appropriateWrite errors like EPIPE/ECONNRESET typically indicate the client has gone. Treating them as disconnects avoids surfacing benign errors to callers.
- if writeErr != nil { - rlog.Debug("write error during streaming", "error", writeErr, "bytes_written", state.totalBytes) - return writeErr - } + if writeErr != nil { + rlog.Debug("write error during streaming", "error", writeErr, "bytes_written", state.totalBytes) + // Treat as client disconnect and let the loop unwind gracefully + s.handleClientDisconnect(state, rlog) + return context.Canceled + }Why: reduces noisy error propagation for expected client behaviour (navigating away, cancelling requests).
docs/content/configuration/practices/monitoring.md (4)
228-243: Histograms are fine, but ensure you observe raw per-request values (not rolled-up averages).Observing aggregated averages from
/internal/statusinto a Histogram skews distribution. Ideally, observe per-request values at the point of completion. If you keep this example, consider adding a note that these are coarse-grained and meant for directional insights.Also, you’ve defined prompt_tokens and completion_tokens but don’t observe them yet—see the next comment for a minimal extension.
253-261: Also export prompt_tokens and completion_tokens from provider stats.To align with the docs and make dashboards useful, observe prompt and completion tokens when available. Example tweak:
for endpoint_name, stats in data.get('proxy', {}).get('endpoints', {}).items(): if 'avg_tokens_per_second' in stats: tokens_per_second.labels( endpoint=endpoint_name, model=stats.get('primary_model', 'unknown') ).observe(stats['avg_tokens_per_second']) + # If available, observe token usage as well. + if 'prompt_tokens' in stats: + prompt_tokens.labels( + endpoint=endpoint_name, + model=stats.get('primary_model', 'unknown') + ).observe(stats['prompt_tokens']) + if 'completion_tokens' in stats: + completion_tokens.labels( + endpoint=endpoint_name, + model=stats.get('primary_model', 'unknown') + ).observe(stats['completion_tokens'])Caveat: if these are already aggregated across requests, consider using Gauges instead. Histograms yield most value on per-request observations.
279-281: Great addition of dashboard panels. Minor note on units.Consider noting explicit units in panel descriptions (e.g., “tokens per second (tokens/s)”) and that token histograms represent counts per request. This reduces ambiguity for new users.
282-304: “Response-Time” is a trailer in streaming; call this out in docs.X-Olla-Response-Time is sent as an HTTP trailer for streaming responses. Some clients don’t surface trailers by default. Add a short note with a curl example (e.g., curl --no-buffer --raw --trailers …) to avoid confusion.
internal/adapter/metrics/integration_test.go (2)
100-168: Performance threshold might be flaky on some CI runners.A hard <50µs/op bound can intermittently fail on noisy or virtualised CI. Consider either:
- Relaxing the bound (e.g., <100µs), or
- Measuring a percentile across batches and asserting on p95/p99, or
- Gating with an env flag for CI.
- assert.Less(t, perOp, 50*time.Microsecond, + assert.Less(t, perOp, 100*time.Microsecond, "Extraction took %v per operation, expected < 50µs", perOp)
216-225: Use strings.Builder to avoid quadratic concatenation in tests.Not critical at size=1000, but easy to make linear and tidy.
func generateLargeArray(size int) string { - result := "" - for i := 0; i < size; i++ { - if i > 0 { - result += "," - } - result += "1" - } - return result + var b strings.Builder + b.Grow(size * 2) // rough capacity + for i := 0; i < size; i++ { + if i > 0 { + b.WriteByte(',') + } + b.WriteByte('1') + } + return b.String() }Note: add import "strings".
internal/core/domain/provider_metrics.go (1)
22-25: Align JSON tag with docs: tokens_per_second vs tokens_per_sec.Docs and exporter examples refer to tokens_per_second. The struct currently exposes tokens_per_sec, which can cause confusion and mismatches in JSON consumers.
- TokensPerSecond float32 `json:"tokens_per_sec,omitempty"` + TokensPerSecond float32 `json:"tokens_per_second,omitempty"`If there are existing consumers relying on tokens_per_sec, consider dual tags via a custom marshaler or a transitional compatibility layer at the API boundary.
Would you like a follow-up PR to add a backward-compatible marshaler for a deprecation window?
internal/adapter/metrics/extractor.go (3)
121-151: Timeout enforcement still runs the full extraction; consider avoiding excess goroutines.The goroutine enforces a caller timeout but cannot pre-empt doExtract. Under persistent timeouts, goroutines may pile up briefly until they complete. Given extraction is fast, this is likely OK, but you could:
- Run doExtract inline and rely on its micro-path optimisations, or
- Use a worker pool if you later add heavier calculations.
Not urgent; just calling out the trade-off.
225-254: Mapping covers key fields; consider explicit support for generation_ms/prompt_ms calculations.If profiles start emitting calculated ms fields directly, map them explicitly to avoid relying solely on expr-derived paths. This keeps the mapping robust across providers.
315-324: Header extraction is a stub—either wire it or remove config to avoid surprises.If profiles define headers in config, silently ignoring them can confuse operators. Happy to implement a minimal mapping (e.g., X-Provider-Input-Tokens → InputTokens) if you share the expected header names.
Would you like me to add a small, documented mapping for common providers and make it opt-in per profile?
| ### Provider Metrics (Debug Logs) | ||
|
|
||
| When available, provider-specific performance metrics are extracted from responses and included in debug logs: | ||
|
|
||
| | Metric | Description | Providers | | ||
| |--------|-------------|-----------| | ||
| | `provider_total_ms` | Total processing time | Ollama, LM Studio | | ||
| | `provider_prompt_tokens` | Tokens in prompt | All | | ||
| | `provider_completion_tokens` | Tokens generated | All | | ||
| | `provider_tokens_per_second` | Generation speed | Ollama, LM Studio | | ||
| | `provider_model` | Actual model used | All | | ||
|
|
||
| See [Provider Metrics](../concepts/provider-metrics.md) for detailed information. |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Add units and verify metric key naming matches logs/struct tags.
Including units avoids ambiguity, and aligning names with runtime keys prevents confusion (note: ProviderMetrics JSON tag uses tokens_per_sec).
- Suggest adding units to descriptions:
-| `provider_total_ms` | Total processing time | Ollama, LM Studio |
-| `provider_prompt_tokens` | Tokens in prompt | All |
-| `provider_completion_tokens` | Tokens generated | All |
-| `provider_tokens_per_second` | Generation speed | Ollama, LM Studio |
+| `provider_total_ms` | Total processing time (ms) | Ollama, LM Studio |
+| `provider_prompt_tokens` | Tokens in prompt (count) | All |
+| `provider_completion_tokens` | Tokens generated (count) | All |
+| `provider_tokens_per_second` | Generation speed (tokens/s) | Ollama, LM Studio |
| `provider_model` | Actual model used | All |- Please also verify naming consistency between docs, logs, and JSON tags:
🏁 Script executed:
#!/bin/bash
# Check for potential mismatches between documented and runtime keys.
rg -nP -C2 '(tokens_per_second|tokens_per_sec|provider_tokens_per_second|provider_tokens_per_sec)'Length of output: 14061
Align provider metric names with runtime keys and add units
Docs currently mix "provider_tokens_per_second" / "tokens_per_second" while runtime/logs/JSON use "tokens_per_sec" (and the extractor expects "tokens_per_second"). Add explicit units to the docs table and unify or document the mapping between external keys and internal JSON/log keys.
Files that need attention (examples found):
- docs/content/api-reference/overview.md — update table (add units + confirm key name)
- docs/content/concepts/provider-metrics.md — examples and calculations (tokens_per_second vs tokens_per_sec)
- docs/content/configuration/practices/monitoring.md — metric names used for Prometheus and examples
- internal/core/domain/provider_metrics.go — JSON tag:
TokensPerSecond float32 \json:"tokens_per_sec,omitempty"`` - internal/adapter/metrics/extractor.go — handles
"tokens_per_second"and maps to TokensPerSecond - internal/adapter/proxy/core/metrics.go and internal/app/handlers/handler_proxy.go — log fields use
"tokens_per_sec" - config/profiles/*.yaml (ollama.yaml, openai.yaml) — use
tokens_per_secondin calculations - test/scripts/streaming/test-streaming-responses.py — uses
tokens_per_second
Suggested minimal doc change (add units):
-| `provider_total_ms` | Total processing time | Ollama, LM Studio |
-| `provider_prompt_tokens` | Tokens in prompt | All |
-| `provider_completion_tokens` | Tokens generated | All |
-| `provider_tokens_per_second` | Generation speed | Ollama, LM Studio |
+| `provider_total_ms` | Total processing time (ms) | Ollama, LM Studio |
+| `provider_prompt_tokens` | Tokens in prompt (count) | All |
+| `provider_completion_tokens` | Tokens generated (count) | All |
+| `provider_tokens_per_second` | Generation speed (tokens/s) | Ollama, LM Studio |
| `provider_model` | Actual model used | All |Recommendation: pick a single canonical external metric name (e.g. "provider_tokens_per_second" or match the runtime/log key "tokens_per_sec"), update docs and examples to match, and either:
- update the extractor/logging to use the chosen external name, or
- explicitly document the mapping (external → internal JSON tag) and accept both names in parsing.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### Provider Metrics (Debug Logs) | |
| When available, provider-specific performance metrics are extracted from responses and included in debug logs: | |
| | Metric | Description | Providers | | |
| |--------|-------------|-----------| | |
| | `provider_total_ms` | Total processing time | Ollama, LM Studio | | |
| | `provider_prompt_tokens` | Tokens in prompt | All | | |
| | `provider_completion_tokens` | Tokens generated | All | | |
| | `provider_tokens_per_second` | Generation speed | Ollama, LM Studio | | |
| | `provider_model` | Actual model used | All | | |
| See [Provider Metrics](../concepts/provider-metrics.md) for detailed information. | |
| ### Provider Metrics (Debug Logs) | |
| When available, provider-specific performance metrics are extracted from responses and included in debug logs: | |
| | Metric | Description | Providers | | |
| |--------|-------------|-----------| | |
| | `provider_total_ms` | Total processing time (ms) | Ollama, LM Studio | | |
| | `provider_prompt_tokens` | Tokens in prompt (count) | All | | |
| | `provider_completion_tokens` | Tokens generated (count) | All | | |
| | `provider_tokens_per_second` | Generation speed (tokens/s) | Ollama, LM Studio | | |
| | `provider_model` | Actual model used | All | | |
| See [Provider Metrics](../concepts/provider-metrics.md) for detailed information. |
| func createTestProfileFactory() ProfileFactory { | ||
| return &mockProfileFactory{ | ||
| profiles: map[string]*mockProfile{ | ||
| "ollama": { | ||
| name: "ollama", | ||
| config: &domain.ProfileConfig{ | ||
| Metrics: domain.MetricsConfig{ | ||
| Extraction: domain.MetricsExtractionConfig{ | ||
| Enabled: true, | ||
| Source: "response_body", | ||
| Format: "json", | ||
| Paths: map[string]string{ | ||
| "model": "$.model", | ||
| "done": "$.done", | ||
| "input_tokens": "$.prompt_eval_count", | ||
| "output_tokens": "$.eval_count", | ||
| "prompt_duration_ns": "$.prompt_eval_duration", | ||
| "eval_duration_ns": "$.eval_duration", | ||
| "total_duration_ns": "$.total_duration", | ||
| "load_duration_ns": "$.load_duration", | ||
| }, | ||
| Calculations: map[string]string{ | ||
| "tokens_per_second": "output_tokens / (eval_duration_ns / 1000000000)", | ||
| "ttft_ms": "prompt_duration_ns / 1000000", | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| } | ||
| } |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Compilation blocker: missing mock types (mockProfileFactory/mockProfile) referenced here
This benchmark won’t compile as-is because mockProfileFactory and mockProfile are not defined in this file or imported. Define minimal mocks that satisfy metrics.ProfileFactory and domain.InferenceProfile (or switch to using the real registry/profile.Factory with test fixtures).
I can generate mocks once we confirm the method set of domain.InferenceProfile. Please run the script below to reveal the interface/struct so we can implement correct stubs.
If domain.InferenceProfile is an interface with GetName()/GetConfig() methods, a minimal mock could look like:
+// --- test-only mocks to satisfy metrics.ProfileFactory ---
+type mockProfileFactory struct {
+ profiles map[string]*mockProfile
+}
+
+func (m *mockProfileFactory) GetProfile(profileType string) (domain.InferenceProfile, error) {
+ p, ok := m.profiles[profileType]
+ if !ok {
+ return nil, fmt.Errorf("profile not found: %s", profileType)
+ }
+ return p, nil
+}
+
+type mockProfile struct {
+ name string
+ config *domain.ProfileConfig
+}
+
+// Adjust the method set below to match domain.InferenceProfile
+func (m *mockProfile) GetName() string { return m.name }
+func (m *mockProfile) GetConfig() *domain.ProfileConfig { return m.config }Replace/augment the methods to match the actual interface once discovered.
---
🏁 Script executed:
```shell
#!/bin/bash
set -euo pipefail
echo "Locate InferenceProfile type and its methods"
rg -nC3 -g '!**/vendor/**' -P '^\s*type\s+InferenceProfile\b' internal | sed -n '1,200p'
rg -nC2 -g '!**/vendor/**' -P '\bInferenceProfile\b' internal/core/domain | sed -n '1,200p'
Length of output: 1581
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Printing internal/core/domain/inference_profile.go"
sed -n '1,400p' internal/core/domain/inference_profile.goLength of output: 3307
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Searching for PlatformProfile and ProfileFactory definitions..."
rg -nC3 -g '!**/vendor/**' -P '^\s*type\s+PlatformProfile\b' || true
rg -nC3 -g '!**/vendor/**' -P '^\s*type\s+ProfileFactory\b' || true
# Print the files that contain PlatformProfile / ProfileFactory if found
pf_file=$(rg -n --hidden -g '!**/vendor/**' -P '^\s*type\s+PlatformProfile\b' | head -n1 | cut -d: -f1 || true)
factory_file=$(rg -n --hidden -g '!**/vendor/**' -P '^\s*type\s+ProfileFactory\b' | head -n1 | cut -d: -f1 || true)
if [ -n "$pf_file" ]; then
echo
echo "=== $pf_file ==="
sed -n '1,240p' "$pf_file"
else
echo "PlatformProfile type not found by name."
fi
if [ -n "$factory_file" ]; then
echo
echo "=== $factory_file ==="
sed -n '1,240p' "$factory_file"
else
echo "ProfileFactory type not found by name."
fi
# Also search for any symbols named ProfileFactory (usages)
echo
echo "Usages of ProfileFactory (context):"
rg -nC2 -g '!**/vendor/**' '\bProfileFactory\b' || trueLength of output: 14149
Add test-only mocks that fully implement domain.InferenceProfile / ProfileFactory (fix compilation)
The benchmark references mockProfileFactory/mockProfile but InferenceProfile embeds PlatformProfile and exposes many methods — the current mocks must implement the full interface. Add a test-only mock implementation (or use the real registry factory with fixtures).
Files to update:
- internal/adapter/metrics/benchmark_test.go — add mocks and keep createTestProfileFactory
- internal/core/domain/inference_profile.go — reference for required method set
- internal/core/domain/profile.go — reference for embedded PlatformProfile methods
Patch to add (append to the test file). Ensure the test file imports fmt and time if not already present.
+// --- test-only mocks to satisfy metrics.ProfileFactory ---
+type mockProfileFactory struct {
+ profiles map[string]*mockProfile
+}
+
+func (m *mockProfileFactory) GetProfile(profileType string) (domain.InferenceProfile, error) {
+ p, ok := m.profiles[profileType]
+ if !ok {
+ return nil, fmt.Errorf("profile not found: %s", profileType)
+ }
+ return p, nil
+}
+
+type mockProfile struct {
+ name string
+ version string
+ config *domain.ProfileConfig
+ paths []string
+}
+
+// PlatformProfile methods
+func (m *mockProfile) GetName() string { return m.name }
+func (m *mockProfile) GetVersion() string { return m.version }
+func (m *mockProfile) GetModelDiscoveryURL(baseURL string) string { return "" }
+func (m *mockProfile) GetHealthCheckPath() string { return "" }
+func (m *mockProfile) IsOpenAPICompatible() bool { return false }
+func (m *mockProfile) GetRequestParsingRules() domain.RequestParsingRules {
+ return domain.RequestParsingRules{}
+}
+func (m *mockProfile) GetModelResponseFormat() domain.ModelResponseFormat { return domain.ModelResponseFormat{} }
+func (m *mock_profile) GetDetectionHints() domain.DetectionHints { return domain.DetectionHints{} }
+func (m *mockProfile) ParseModelsResponse(data []byte) ([]*domain.ModelInfo, error) {
+ return nil, nil
+}
+func (m *mockProfile) GetPaths() []string {
+ return m.paths
+}
+func (m *mockProfile) GetPath(index int) string {
+ if index < 0 || index >= len(m.paths) {
+ return ""
+ }
+ return m.paths[index]
+}
+
+// InferenceProfile methods
+func (m *mockProfile) GetTimeout() time.Duration { return 0 }
+func (m *mock_profile) GetMaxConcurrentRequests() int { return 1 }
+func (m *mockProfile) ValidateEndpoint(endpoint *domain.Endpoint) error { return nil }
+func (m *mockProfile) GetDefaultPriority() int { return 0 }
+func (m *mockProfile) GetConfig() *domain.ProfileConfig { return m.config }
+func (m *mockProfile) GetModelCapabilities(modelName string, registry domain.ModelRegistry) domain.ModelCapabilities {
+ return domain.ModelCapabilities{}
+}
+func (m *mockProfile) IsModelSupported(modelName string, registry domain.ModelRegistry) bool { return true }
+func (m *mockProfile) TransformModelName(fromName string, toFormat string) string { return fromName }
+func (m *mock_profile) GetResourceRequirements(modelName string, registry domain.ModelRegistry) domain.ResourceRequirements {
+ return domain.ResourceRequirements{}
+}
+func (m *mockProfile) GetOptimalConcurrency(modelName string) int { return 1 }
+func (m *mockProfile) GetRoutingStrategy() domain.RoutingStrategy { return domain.RoutingStrategy{} }
+func (m *mockProfile) ShouldBatchRequests() bool { return false }
+func (m *mockProfile) GetRequestTimeout(modelName string) time.Duration { return 0 }Notes:
- Replace the stub returns where tests need specific behaviour (e.g. ParseModelsResponse, GetPaths).
- Add imports "fmt" and "time" to the test file if they are not present.
- Alternatively, wire the test to use the real profile factory from internal/adapter/registry/profile/factory.go with test fixtures if you prefer not to maintain test mocks.
Committable suggestion skipped: line range outside the PR's diff.
| // Stream with Olla's optimized streaming | ||
| // Use r.Context() for client context and upstreamCtx for upstream context | ||
| bytesWritten, streamErr := s.streamResponse(r.Context(), upstreamCtx, w, resp, *buffer, rlog) | ||
| bytesWritten, lastChunk, streamErr := s.streamResponse(r.Context(), upstreamCtx, w, resp, *buffer, rlog) |
There was a problem hiding this comment.
💡 Verification agent
🧩 Analysis chain
Verify other call sites updated to new streamResponse signature
streamResponse now returns (int, []byte, error). Please ensure all usages were updated accordingly.
Run this script to locate call sites and the definition:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Find streamResponse definition =="
rg -n -C2 -g '!**/vendor/**' -P 'func\s+\(\s*\w+\s+\*\w+\s*\)\s*streamResponse\s*\('
echo
echo "== Find all call sites =="
rg -n -C2 -g '!**/vendor/**' -P '\bstreamResponse\s*\('
echo
echo "== Grep for assignments expecting 3 return values =="
rg -n -C2 -g '!**/vendor/**' -P '^\s*(?:[\w\[\]\*\s,]+:=\s*)?[^=\n]*streamResponse\(' | sed -n 'p' Length of output: 4868
streamResponse signature verified — code updated; docs need one fix
- internal/adapter/proxy/olla/service.go — definition at line 647 returns (int, []byte, error); call at line 594 uses bytesWritten, lastChunk, streamErr := s.streamResponse(...)
- internal/adapter/proxy/olla/service_retry.go — call at line 157 uses bytesWritten, lastChunk, streamErr := s.streamResponse(...)
- docs/content/development/architecture.md — line 213 still shows
return s.streamResponse(w, resp); update the example to reflect the new (int, []byte, error) return values.
No other call sites found in the repo (excluding vendor).
🤖 Prompt for AI Agents
In internal/adapter/proxy/olla/service_retry.go around line 157 the call to
s.streamResponse was updated to return (int, []byte, error) but the docs example
in docs/content/development/architecture.md at line 213 still shows the old
signature; update the docs example to match the new call signature by showing
the three return values (e.g., bytesWritten, lastChunk, err :=
s.streamResponse(...)) and include minimal handling of the new lastChunk and err
returns (check/propagate err and document what lastChunk contains) so the
example compiles and reflects the actual function contract.
There was a problem hiding this comment.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
internal/adapter/proxy/olla/service.go (1)
567-576: Action required — declare Trailer and set final X-Olla-Response-Time trailerShort summary:
- core.SetResponseHeaders(...) does set X-Olla-Request-ID, X-Olla-Endpoint, X-Olla-Backend-Type and X-Olla-Model (confirmed in internal/adapter/proxy/core/common.go). It also currently sets X-Olla-Response-Time immediately (time.Since(stats.StartTime)).
- There is no explicit Trailer declaration before WriteHeader in internal/adapter/proxy/olla/service.go and no final trailer write after streaming completes (service.go calls SetResponseHeaders, copies resp headers, then WriteHeader, then streams and computes final duration).
What to fix (files / locations):
- internal/adapter/proxy/olla/service.go — handleSuccessfulResponse
- After core.SetResponseHeaders(w, stats, endpoint) and before calling w.WriteHeader(...), declare the trailer: w.Header().Add(constants.HeaderTrailer, constants.HeaderXOllaResponseTime).
- After streaming completes and final duration is computed, set the trailer value: w.Header().Set(constants.HeaderXOllaResponseTime, duration.String()).
- (Optional) internal/adapter/proxy/core/common.go — consider whether SetResponseHeaders should avoid setting X-Olla-Response-Time early for streaming responses to prevent duplicate header+trailer values.
Suggested minimal patch (illustrative):
core.SetResponseHeaders(w, stats, endpoint) +// Declare we'll send the final response time as a trailer (must be set before WriteHeader) +w.Header().Add(constants.HeaderTrailer, constants.HeaderXOllaResponseTime) // Copy response headers for key, values := range resp.Header { @@ w.WriteHeader(resp.StatusCode) @@ - // stats update - duration := time.Since(stats.StartTime) - s.RecordSuccess(endpoint, duration.Milliseconds(), int64(bytesWritten)) + // stats update + duration := time.Since(stats.StartTime) + // Set final response-time trailer value + w.Header().Set(constants.HeaderXOllaResponseTime, duration.String()) + s.RecordSuccess(endpoint, duration.Milliseconds(), int64(bytesWritten))Reason: HTTP trailers must be declared (Trailer header) before WriteHeader; the final X-Olla-Response-Time value can then be set after streaming completes so clients receive the accurate total time.
♻️ Duplicate comments (4)
internal/adapter/proxy/olla/service.go (1)
646-684: Timer Reset race fixed correctly (drain-before-Reset pattern)Good fix. Stopping and draining the timer channel before Reset removes the spurious wake-up race noted previously. This aligns with net/http timer best practice for long-lived streaming loops.
docs/content/concepts/provider-metrics.md (2)
118-123: Use “gjson-compatible JSON paths”, not “JSONPath expressions”.The extractor accepts gjson-compatible paths and normalises JSONPath-style prefixes. Wording this as “JSONPath expressions” is misleading.
Apply this diff:
- paths: # JSONPath expressions to extract raw values + paths: # JSON paths (gjson-compatible) to extract raw values
266-268: Tweak gjson description: it queries JSON, it doesn’t “parse JSONPath”.gjson provides fast JSON querying; we normalise JSONPath-like prefixes before querying.
Apply this diff:
-- **[gjson](https://github.com/tidwall/gjson)**: For JSON path parsing (7.6x faster than encoding/json) +- **[gjson](https://github.com/tidwall/gjson)**: For fast JSON querying (7.6x faster than encoding/json)config/profiles/ollama.yaml (1)
198-203: Zero-guard and multiply-first precision for tokens_per_second — looks goodThe expression now guards divide-by-zero and multiplies before dividing, which improves precision and avoids integer truncation. Nice one.
🧹 Nitpick comments (17)
internal/adapter/proxy/olla/service.go (1)
593-606: Metrics extraction timing/context: avoid skipping extraction on client-cancelNice capture of lastChunk and post-stream extraction. However, using the client ctx means extraction is skipped if the client disconnects (ctx cancelled) even when we have the final chunk. Prefer using the upstream context to improve best-effort extraction.
Additionally, per coding guidelines, ensure X-Olla-Response-Time is emitted as a trailer. If core.SetResponseHeaders pre-declares the Trailer header, we can safely set the trailer value here once the duration is known.
Apply this diff locally:
- // Extract metrics from response if available - core.ExtractProviderMetrics(ctx, s.MetricsExtractor, lastChunk, endpoint, stats, rlog, "Olla") + // Extract metrics from response if available. + // Use upstreamCtx so provider metrics extraction is not suppressed by client cancellation. + core.ExtractProviderMetrics(upstreamCtx, s.MetricsExtractor, lastChunk, endpoint, stats, rlog, "Olla")And set the trailer value (assumes core.SetResponseHeaders declares Trailer: X-Olla-Response-Time before WriteHeader):
@@ stats.EndTime = time.Now() stats.Latency = duration.Milliseconds() + // Populate response-time trailer (must be declared earlier). + w.Header().Set("X-Olla-Response-Time", fmt.Sprintf("%d", stats.Latency))docs/content/concepts/provider-metrics.md (7)
127-131: Prefer Australian English and clarify normalisation rules.Use “normalised/normalisation” (en-AU), and keep the note concise.
Apply this diff:
-1. **`paths`**: Maps field names to JSON path expressions for extracting values from the provider's response - - Supports both JSONPath notation (`$.field.subfield`) and gjson notation (`field.subfield`) - - JSONPath prefixes are automatically normalized: `$.` is trimmed, `$` becomes empty string +1. **`paths`**: Maps field names to JSON path expressions for extracting values from the provider's response + - Supports both JSONPath notation (`$.field.subfield`) and gjson notation (`field.subfield`) + - JSONPath prefixes are automatically normalised: `$.` is trimmed; `$` is treated as the root
150-170: Align Australian English in comments (“normalised”).Consistent en-AU spelling helps keep docs uniform.
Apply this diff:
- model: "$.model" # JSONPath notation (normalized to "model") + model: "$.model" # JSONPath notation (normalised to "model")
269-275: Use en-AU (“Normalisation”) and fix spelling in bullets.Keep spelling consistent.
Apply this diff:
-**JSONPath Normalization**: Olla automatically normalizes JSONPath-style prefixes for gjson compatibility: -- `$.foo.bar` → `foo.bar` (leading `$.` is trimmed) -- `$` → `` (root selector is converted to empty string) -- `foo.bar` → `foo.bar` (already normalized paths are unchanged) +**JSONPath Normalisation**: Olla automatically normalises JSONPath-style prefixes for gjson compatibility: +- `$.foo.bar` → `foo.bar` (leading `$.` is trimmed) +- `$` → `` (root selector becomes the empty string) +- `foo.bar` → `foo.bar` (already normalised paths are unchanged)
77-82: Specify a language for fenced code block (markdownlint MD040).Add a language hint (text) for the Prometheus sample to satisfy linters and improve rendering.
Apply this diff:
-``` +```text vllm:prompt_tokens_total{model="llama3.2"} 50 vllm:generation_tokens_total{model="llama3.2"} 200 vllm:time_to_first_token_seconds{model="llama3.2"} 0.1 vllm:time_per_output_token_seconds{model="llama3.2"} 0.022--- `226-236`: **Specify a language for the log block (markdownlint MD040).** Use text for log snippets. Apply this diff: ```diff -``` +```text 2025/08/15 10:00:00 DEBUG Sherpa proxy metrics endpoint=local-ollama latency_ms=5000 provider_total_ms=5000 provider_prompt_eval_ms=100 provider_eval_ms=4400 provider_prompt_tokens=50 provider_completion_tokens=200 provider_tokens_per_second=45.45--- `329-332`: **Clarify phrasing: “extraction attempts in logs”.** Minor grammar improvement. Apply this diff: ```diff -3. Enable debug logging to see extraction attempts +3. Enable debug logging to see extraction attempts in logs
55-57: Avoid repeating the heading text in the first sentence.Tighten the prose to prevent duplication (“LM Studio LM Studio …”).
Apply this diff:
-### LM Studio -LM Studio provides timing and token metrics: +### LM Studio +Provides timing and token metrics:docs/content/development/architecture.md (1)
213-231: Use the buffer pool in the Sherpa example for consistency and explain why.The struct shows a bufferPool, but the snippet allocates a fresh 32KB slice. Consider showing pooled buffers here as well, and add a brief “why” comment (reduced GC churn on hot path) to align with en-AU guidance.
Proposed snippet adjustment:
- buffer := make([]byte, 32*1024) - bytesWritten, lastChunk, err := s.streamResponse(ctx, ctx, w, resp, buffer, logger) + // Use pooled buffer to reduce GC pressure on the hot path + buf := s.bufferPool.Get() + defer s.bufferPool.Put(buf) + bytesWritten, lastChunk, err := s.streamResponse(ctx, ctx, w, resp, *buf, logger)config/profiles/openai.yaml (1)
105-125: LGTM overall; remove trailing space and add newline at EOF.Config structure and expressions look correct. Please tidy formatting.
Apply this diff:
- output_tokens: "$.usage.completion_tokens" + output_tokens: "$.usage.completion_tokens"Also ensure the file ends with a single newline (POSIX-friendly).
config/profiles/lmstudio.yaml (1)
138-154: LGTM; add newline at end of file.Mappings and calculation are consistent with the extractor behaviour. Add a final newline to satisfy linters and tooling.
config/profiles/vllm.yaml (1)
226-246: LGTM; add newline at end of file.The mappings and guarded calculation look good for OpenAI-compatible responses. Add a trailing newline to satisfy linters.
config/profiles/ollama.yaml (1)
182-182: YAML lint: strip trailing spaces and add newline at EOFClean up the minor lint issues.
Apply this diff:
- + @@ - + @@ - model_load_ms: "load_duration_ns / 1000000" + model_load_ms: "load_duration_ns / 1000000" +Also applies to: 196-196, 203-203
internal/adapter/metrics/extractor.go (4)
191-194: Don’t abort header extraction on invalid JSONIf the body isn’t valid JSON, we return early and skip header-based extraction. Better to continue so header-only providers/configs still work.
Apply this diff:
- if !jsonResult.Exists() { - return nil - } + if !jsonResult.Exists() { + // Invalid/empty JSON — skip body extraction but still allow header-based extraction. + }
291-300: Support additional calculated fields (“prompt_ms”, “generation_ms”)If profiles add expressions for these, they’re currently ignored. Map them when present to keep calculations flexible.
Apply this diff:
switch field { case "tokens_per_second": metrics.TokensPerSecond = float32(val) case "ttft_ms": metrics.TTFTMs = util.SafeInt32(int64(val)) case "total_ms": metrics.TotalMs = util.SafeInt32(int64(val)) case "model_load_ms": metrics.ModelLoadMs = util.SafeInt32(int64(val)) + case "prompt_ms": + metrics.PromptMs = util.SafeInt32(int64(val)) + case "generation_ms": + metrics.GenerationMs = util.SafeInt32(int64(val)) }
337-342: Typo in comment (“leasing” → “leading”)Minor typo; keep comments tidy.
Apply this diff:
- // remove the leasing "$." or "$" + // remove the leading "$." or "$"
344-349: Pre-compile regexes once (micro-optimisation, not on hot path)These regexes are recreated each call. While this runs at validation/start-up, you can still pre-compile them once to reduce allocations.
Apply this change outside the selected range:
// top-level (package scope) var ( bracketIdxRe = regexp.MustCompile(`\[(\d+)\]`) bracketKeyRe = regexp.MustCompile(`\['([^']+)'\]`) )Then inside convertToGjsonPath:
- s = regexp.MustCompile(`\[(\d+)\]`).ReplaceAllString(s, `.$1`) + s = bracketIdxRe.ReplaceAllString(s, `.$1`) - s = regexp.MustCompile(`\['([^']+)'\]`).ReplaceAllString(s, `.$1`) + s = bracketKeyRe.ReplaceAllString(s, `.$1`)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (10)
config/profiles/lmstudio.yaml(1 hunks)config/profiles/ollama.yaml(1 hunks)config/profiles/openai.yaml(1 hunks)config/profiles/vllm.yaml(1 hunks)docs/content/api-reference/overview.md(1 hunks)docs/content/concepts/provider-metrics.md(1 hunks)docs/content/development/architecture.md(2 hunks)internal/adapter/metrics/extractor.go(1 hunks)internal/adapter/proxy/olla/service.go(5 hunks)internal/integration/providers/vllm_integration_test.go(0 hunks)
💤 Files with no reviewable changes (1)
- internal/integration/providers/vllm_integration_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- docs/content/api-reference/overview.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{go,md}
📄 CodeRabbit Inference Engine (CLAUDE.md)
Use Australian English for comments and documentation, and write comments explaining why rather than what
Files:
internal/adapter/proxy/olla/service.gointernal/adapter/metrics/extractor.godocs/content/development/architecture.mddocs/content/concepts/provider-metrics.md
{internal/app/handlers/handler_proxy.go,internal/adapter/proxy/sherpa/service.go,internal/adapter/proxy/olla/service.go}
📄 CodeRabbit Inference Engine (CLAUDE.md)
Ensure responses set the headers: X-Olla-Endpoint, X-Olla-Model, X-Olla-Backend-Type, X-Olla-Request-ID, and X-Olla-Response-Time (as a trailer)
Files:
internal/adapter/proxy/olla/service.go
🧠 Learnings (1)
📚 Learning: 2025-08-11T11:48:06.628Z
Learnt from: CR
PR: thushan/olla#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-11T11:48:06.628Z
Learning: Applies to internal/app/handlers/handler_proxy.go : Proxy routes must be served under the `/olla/` prefix
Applied to files:
docs/content/development/architecture.md
🧬 Code Graph Analysis (2)
internal/adapter/proxy/olla/service.go (6)
internal/core/ports/metrics.go (1)
MetricsExtractor(11-21)internal/adapter/proxy/core/base.go (1)
NewBaseProxyComponents(63-78)internal/adapter/proxy/core/metrics.go (1)
ExtractProviderMetrics(17-50)internal/adapter/proxy/sherpa/service.go (1)
Service(65-72)internal/logger/styled.go (1)
StyledLogger(12-35)internal/adapter/proxy/core/streaming.go (1)
AutoDetectStreamingMode(14-45)
internal/adapter/metrics/extractor.go (4)
internal/core/domain/inference_profile.go (1)
InferenceProfile(8-48)internal/logger/styled.go (1)
StyledLogger(12-35)internal/core/domain/provider_metrics.go (2)
ProviderMetrics(5-26)MetricsExtractionConfig(29-43)internal/util/safe.go (1)
SafeInt32(24-32)
🪛 YAMLlint (1.37.1)
config/profiles/openai.yaml
[error] 116-116: trailing spaces
(trailing-spaces)
[error] 125-125: no new line character at the end of file
(new-line-at-end-of-file)
config/profiles/vllm.yaml
[error] 246-246: no new line character at the end of file
(new-line-at-end-of-file)
config/profiles/lmstudio.yaml
[error] 154-154: no new line character at the end of file
(new-line-at-end-of-file)
config/profiles/ollama.yaml
[error] 182-182: trailing spaces
(trailing-spaces)
[error] 196-196: trailing spaces
(trailing-spaces)
[error] 203-203: no new line character at the end of file
(new-line-at-end-of-file)
🪛 LanguageTool
docs/content/concepts/provider-metrics.md
[duplication] ~25-~25: Possible typo: you repeated a word.
Context: ...mance data ## Supported Providers ### Ollama Ollama provides comprehensive metrics in its r...
(ENGLISH_WORD_REPEAT_RULE)
[grammar] ~55-~55: This phrase is duplicated. You should probably use “LM Studio” only once.
Context: ... "total_tokens": 250 } } ### LM Studio LM Studio provides timing and token metrics: ...
(PHRASE_REPETITION)
[duplication] ~74-~74: Possible typo: you repeated a word.
Context: ... "predicted_ms": 4400 } } ``` ### vLLM vLLM exposes detailed metrics via Prometheus...
(ENGLISH_WORD_REPEAT_RULE)
[uncategorized] ~127-~127: Loose punctuation mark.
Context: ... ``` ### Key Components 1. paths: Maps field names to JSON path expressio...
(UNLIKELY_OPENING_PUNCTUATION)
[grammar] ~330-~330: Probably a preposition is missing after ‘attempts’.
Context: ... Enable debug logging to see extraction attempts 4. Ensure response format matches expected struct...
(ATD_VERBS_TO_COLLOCATION)
🪛 markdownlint-cli2 (0.17.2)
docs/content/concepts/provider-metrics.md
77-77: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
226-226: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Cross-Platform Build Validation (windows-latest)
🔇 Additional comments (4)
internal/adapter/proxy/olla/service.go (2)
147-149: Wiring in MetricsExtractor is correct; confirm startup validation is invokedThe constructor now accepts a MetricsExtractor. Please verify that metrics profiles are validated on startup (ports.MetricsExtractor.ValidateProfile) so misconfigurations fail-fast rather than at request-time.
Would you like me to scan the codebase for ValidateProfile calls and add a startup validation path if missing?
167-168: Base components propagation looks goodPassing the MetricsExtractor through to core.NewBaseProxyComponents is the right integration point. Nil-safe by design if extractor is absent.
config/profiles/ollama.yaml (1)
186-187: Key name alignment: extractor maps “done”, config defines “is_complete”The extractor currently expects the field name “done” (maps to IsComplete), while this config defines “is_complete”. Without code changes, IsComplete won’t be populated. I’m proposing a code-side fix to support both keys (see extractor.go comment). If you prefer to change config instead, rename the key to “done”.
Would you like to keep “is_complete” for readability and map it in code (recommended), or rename this key to “done”?
internal/adapter/metrics/extractor.go (1)
22-26: Extraction timeout and async design — LGTMThe 10ms bound and async extraction prevent request-path blockage while still capturing metrics, with counters to track failures. Sensible trade-off.
This PR includes improved stats for LLM queries, we pull the stats out of each providers payloads instead of manually calculating it.
The addition of the Metrics Extraction Framework means we can (technically) have a best effort extraction from the final response chunk.
Supports:
eval_duration,prompt_eval_durationprompt_tokens&completion_tokensThese are all wrapped into the provider profile.
For example,
ollama.yamlhas this (right now):It's got support for basic math operators which should suffice. This therefore adds the following metrics:
Metrics Available:
Performance Impact:
Memory and CPU profiling helped and took time. Doing so also meant we tuned a few other things...
This means now, logs will also tell you at the request completed flag:

In the future we'll make use of these in a new load balancer. For now, just logging them.
Summary by CodeRabbit
New Features
Documentation
Refactor