feat: add prometheus metrics to track token and latency by rootfs · Pull Request #432 · envoyproxy/ai-gateway

rootfs · 2025-02-26T14:06:39Z

Commit Message
Add prometheus metrics to measure request count and latency,
and token count by backend and model.

Related Issues/PRs (if applicable)
#316

Special notes for reviewers (if applicable)
This is a refactoring based on the previous PR. Note, since the metrics is only applied to processor, token level latency metrics won't be available at the moment. They need to be implemented at the translator level.

Tested on an ollama backend:

# curl localhost:9190/metrics
# HELP aigateway_first_token_latency_seconds Time to receive first token in streaming responses
# TYPE aigateway_first_token_latency_seconds histogram
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="0.1"} 0
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="0.25"} 0
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="0.5"} 0
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="1"} 0
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="2.5"} 1
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="5"} 1
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="10"} 1
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="+Inf"} 2
aigateway_first_token_latency_seconds_sum{backend="ollama",model="phi4"} 16.131805063999998
aigateway_first_token_latency_seconds_count{backend="ollama",model="phi4"} 2
# HELP aigateway_model_tokens_total Total number of tokens processed by model and type
# TYPE aigateway_model_tokens_total counter
aigateway_model_tokens_total{backend="ollama",model="phi4",type="completion"} 521
aigateway_model_tokens_total{backend="ollama",model="phi4",type="prompt"} 17
aigateway_model_tokens_total{backend="ollama",model="phi4",type="total"} 538
# HELP aigateway_requests_total Total number of requests processed
# TYPE aigateway_requests_total counter
aigateway_requests_total{backend="ollama",model="phi4",status="success"} 2
# HELP aigateway_total_latency_seconds Time spent processing request
# TYPE aigateway_total_latency_seconds histogram
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="0.1"} 0
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="0.5"} 0
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="1"} 0
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="2.5"} 1
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="5"} 1
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="10"} 1
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="20"} 2
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="30"} 2
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="60"} 2
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="+Inf"} 2
aigateway_total_latency_seconds_sum{backend="ollama",model="phi4",status="success"} 16.132017332
aigateway_total_latency_seconds_count{backend="ollama",model="phi4",status="success"} 2

rootfs · 2025-02-26T14:12:52Z

@mathetake @nacx @yuzisun This is a cleanup of #316. As requested, this time the metrics are added to the processor. In this implementation, the request latency can be counted, yet token latency (especially inter token latency) is not there yet. We need to discuss what interface in translator can be abstracted to support that type of metrics.

**Commit Message** Add prometheus metrics to measure request count and latency, and token count by backend and model. Signed-off-by: Huamin Chen <hchen@redhat.com>

internal/extproc/chatcompletion_processor.go

internal/extproc/chatcompletion_processor_test.go

mathetake

finally getting close to what i wanted/expected to see!

internal/metrics/metrics.go

internal/metrics/token_metrics.go

mathetake · 2025-02-26T16:24:07Z

fwiw some of them overlaps with https://github.com/open-telemetry/semantic-conventions/blob/main/docs/gen-ai/gen-ai-metrics.md cc @yuzisun

Signed-off-by: Huamin Chen <hchen@redhat.com>

rootfs · 2025-02-26T19:34:51Z

finally getting close to what i wanted/expected to see!

@mathetake good to know :D

site/docs/capabilities/index.md

site/docs/capabilities/monitoring.md

internal/metrics/token_metrics.go

Signed-off-by: Huamin Chen <hchen@redhat.com>

cmd/extproc/main.go

Signed-off-by: Huamin Chen <hchen@redhat.com>

mathetake · 2025-02-27T16:10:50Z

cmd/extproc/mainlib/main.go

 		<-ctx.Done()
 		s.GracefulStop()
+
+		shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)


not sure what is this for. why 5 sec and close? can you just remove and use the ctx (signal handle already is handled) and pass to Shutdown below.

The ctx is already done (a couple lines above), so probably we shouldn't use it here? I don't know if we need this timeout, but the idea is to not use the already completed `ctx.

mathetake · 2025-02-27T16:10:58Z

cmd/extproc/mainlib/main.go

 		"info",
 		"log level for the external processor. One of 'debug', 'info', 'warn', or 'error'.",
 	)
+	fs.StringVar(&flags.metricsAddr, "metricsAddr", ":9190", "HTTP address for the metrics server.")


mathetake · 2025-02-27T16:11:07Z

cmd/extproc/mainlib/main.go

 }
+
+// startMetricsServer starts the HTTP server for Prometheus metrics.
+func startMetricsServer(addr string, logger *slog.Logger) *http.Server {


needs unit tests

mathetake · 2025-02-27T16:11:24Z

internal/extproc/chatcompletion_processor.go

+// Copyright Envoy AI Gateway Authors.
+// SPDX-License-Identifier: Apache-2.0.
 // The full text of the Apache license is available in the LICENSE file at


🙅‍♂️

mathetake · 2025-02-27T16:12:32Z

site/docs/capabilities/index.md

unnecessary change ?

mathetake · 2025-02-27T16:19:44Z

internal/extproc/chatcompletion_processor_test.go

+// Copyright Envoy AI Gateway Authors.
+// SPDX-License-Identifier: Apache-2.0.


mathetake · 2025-02-27T16:54:30Z

let's make sure make precommit not failing before pushing ;)

nacx · 2025-02-27T19:18:15Z

internal/extproc/chatcompletion_processor.go

 	model, body, err := parseOpenAIChatCompletionBody(rawBody)
 	if err != nil {
-		return nil, fmt.Errorf("failed to parse request body: %w", err)
+		return nil, c.recordErrorAndReturn("failed to parse request body: %w", err)


So, we still have the problem of having the metrics recording thing everywhere, which is error-prone. Everytime in the future where we evolve the logic of the processor we'll have to remember to add the metric recording and keep copying this code.

I've asked, at least 3 times on previous reviews, why not use a deferred statement to have this code only once. That would cover all current and future cases and we'll never miss adding the metric recording.

Instead of just ignoring the comment again, if you really feel strongly against adopting the suggestion, provide an answer explaining why the current approach is better, and better maintainable than the suggested one.

nacx · 2025-02-27T19:22:33Z

internal/extproc/chatcompletion_processor_test.go

 }

+// TestMetricsAreRecorded tests that metrics are properly recorded.
+func TestMetricsAreRecorded(t *testing.T) {


This test is the manifestation of the issue of not using an interface for the metrics:

Since you're using the type and not an interface, you cannot properly mock the behavior. The result is that we now have the "processor unit test" completely coupled to a concrete implementation of the metrics. The processor unit test needs to know if a particular metric is a counter, an histogram, etc. Everything is coupled.

I asked at least 2 times in past reviews to change that to an interface that could be mocked. Going for that recommendation would allow us to:

Properly mock the methods so that the processor unit test can just check that the right methods were called.

The processor unit tests are not coupled anymore to the concrete implementation of the metrics, making the code much easier to maintain and evolve. The metrics part can keep evolving independently more easily as the project matures.

The testing of the metrics would be scoped to just the metrics package, keeping everything properly encapsulated and properly tested.

Instead of just ignoring the comment again, if you really feel strongly against adopting the suggestion, provide an answer explaining why the current approach is better, and better maintainable than the suggested one.

mathetake · 2025-02-27T19:38:58Z

Let's address @nacx comments not only mine, otherwise i don't think i can approve and merge

mathetake · 2025-03-04T04:43:44Z

Ok several people asked about this and i think we have to prioritize. @rootfs if you cannot iterate on this anymore me or @nacx will rework to the completion. Thanks for the work so far anyways

**Commit Message** extproc: add GenAI metrics to track token usage and latency Adds GenAI metrics according to the OpenTelemetry Semantic Conventions for Generative AI Metrics [1]. Note those metrics are still in experimental phase and may still be subject to change. 1: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/ **Related Issues/PRs (if applicable)** This is a follow-up of #432, implementing the remaining review comments. --------- Signed-off-by: Huamin Chen <hchen@redhat.com> Signed-off-by: Ignasi Barrera <ignasi@tetrate.io>

mathetake · 2025-03-05T16:43:34Z

superseded by #459

**Commit Message** extproc: add GenAI metrics to track token usage and latency Adds GenAI metrics according to the OpenTelemetry Semantic Conventions for Generative AI Metrics [1]. Note those metrics are still in experimental phase and may still be subject to change. 1: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/ **Related Issues/PRs (if applicable)** This is a follow-up of #432, implementing the remaining review comments. --------- Signed-off-by: Huamin Chen <hchen@redhat.com> Signed-off-by: Ignasi Barrera <ignasi@tetrate.io>

rootfs requested a review from a team as a code owner February 26, 2025 14:06

rootfs force-pushed the prom-ref branch from 79e9d4e to b8c246f Compare February 26, 2025 14:15

feat: add prometheus metrics to track token and latency

6d0a12c

**Commit Message** Add prometheus metrics to measure request count and latency, and token count by backend and model. Signed-off-by: Huamin Chen <hchen@redhat.com>

rootfs force-pushed the prom-ref branch from b8c246f to 6d0a12c Compare February 26, 2025 14:19

rootfs mentioned this pull request Feb 26, 2025

extproc: wip proposal for chat completion instrumentation #431

Closed

nacx reviewed Feb 26, 2025

View reviewed changes