Skip to content

feat: add prometheus metrics to track token and latency#432

Closed
rootfs wants to merge 5 commits intoenvoyproxy:mainfrom
rootfs:prom-ref
Closed

feat: add prometheus metrics to track token and latency#432
rootfs wants to merge 5 commits intoenvoyproxy:mainfrom
rootfs:prom-ref

Conversation

@rootfs
Copy link
Copy Markdown

@rootfs rootfs commented Feb 26, 2025

Commit Message
Add prometheus metrics to measure request count and latency,
and token count by backend and model.

Related Issues/PRs (if applicable)
#316

Special notes for reviewers (if applicable)
This is a refactoring based on the previous PR. Note, since the metrics is only applied to processor, token level latency metrics won't be available at the moment. They need to be implemented at the translator level.

Tested on an ollama backend:

# curl localhost:9190/metrics
# HELP aigateway_first_token_latency_seconds Time to receive first token in streaming responses
# TYPE aigateway_first_token_latency_seconds histogram
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="0.1"} 0
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="0.25"} 0
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="0.5"} 0
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="1"} 0
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="2.5"} 1
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="5"} 1
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="10"} 1
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="+Inf"} 2
aigateway_first_token_latency_seconds_sum{backend="ollama",model="phi4"} 16.131805063999998
aigateway_first_token_latency_seconds_count{backend="ollama",model="phi4"} 2
# HELP aigateway_model_tokens_total Total number of tokens processed by model and type
# TYPE aigateway_model_tokens_total counter
aigateway_model_tokens_total{backend="ollama",model="phi4",type="completion"} 521
aigateway_model_tokens_total{backend="ollama",model="phi4",type="prompt"} 17
aigateway_model_tokens_total{backend="ollama",model="phi4",type="total"} 538
# HELP aigateway_requests_total Total number of requests processed
# TYPE aigateway_requests_total counter
aigateway_requests_total{backend="ollama",model="phi4",status="success"} 2
# HELP aigateway_total_latency_seconds Time spent processing request
# TYPE aigateway_total_latency_seconds histogram
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="0.1"} 0
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="0.5"} 0
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="1"} 0
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="2.5"} 1
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="5"} 1
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="10"} 1
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="20"} 2
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="30"} 2
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="60"} 2
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="+Inf"} 2
aigateway_total_latency_seconds_sum{backend="ollama",model="phi4",status="success"} 16.132017332
aigateway_total_latency_seconds_count{backend="ollama",model="phi4",status="success"} 2

@rootfs rootfs requested a review from a team as a code owner February 26, 2025 14:06
@rootfs
Copy link
Copy Markdown
Author

rootfs commented Feb 26, 2025

@mathetake @nacx @yuzisun This is a cleanup of #316. As requested, this time the metrics are added to the processor. In this implementation, the request latency can be counted, yet token latency (especially inter token latency) is not there yet. We need to discuss what interface in translator can be abstracted to support that type of metrics.

**Commit Message**
Add prometheus metrics to measure request count and latency,
and token count by backend and model.

Signed-off-by: Huamin Chen <hchen@redhat.com>
Copy link
Copy Markdown
Member

@mathetake mathetake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally getting close to what i wanted/expected to see!

@mathetake
Copy link
Copy Markdown
Member

Signed-off-by: Huamin Chen <hchen@redhat.com>
@rootfs
Copy link
Copy Markdown
Author

rootfs commented Feb 26, 2025

finally getting close to what i wanted/expected to see!

@mathetake good to know :D

Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
<-ctx.Done()
s.GracefulStop()

shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what is this for. why 5 sec and close? can you just remove and use the ctx (signal handle already is handled) and pass to Shutdown below.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ctx is already done (a couple lines above), so probably we shouldn't use it here? I don't know if we need this timeout, but the idea is to not use the already completed `ctx.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah

"info",
"log level for the external processor. One of 'debug', 'info', 'warn', or 'error'.",
)
fs.StringVar(&flags.metricsAddr, "metricsAddr", ":9190", "HTTP address for the metrics server.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs test

}

// startMetricsServer starts the HTTP server for Prometheus metrics.
func startMetricsServer(addr string, logger *slog.Logger) *http.Server {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs unit tests

Comment on lines +1 to 3
// Copyright Envoy AI Gateway Authors.
// SPDX-License-Identifier: Apache-2.0.
// The full text of the Apache license is available in the LICENSE file at
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙅‍♂️

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary change ?

Comment on lines +1 to +2
// Copyright Envoy AI Gateway Authors.
// SPDX-License-Identifier: Apache-2.0.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙅

@mathetake
Copy link
Copy Markdown
Member

let's make sure make precommit not failing before pushing ;)

model, body, err := parseOpenAIChatCompletionBody(rawBody)
if err != nil {
return nil, fmt.Errorf("failed to parse request body: %w", err)
return nil, c.recordErrorAndReturn("failed to parse request body: %w", err)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we still have the problem of having the metrics recording thing everywhere, which is error-prone. Everytime in the future where we evolve the logic of the processor we'll have to remember to add the metric recording and keep copying this code.

I've asked, at least 3 times on previous reviews, why not use a deferred statement to have this code only once. That would cover all current and future cases and we'll never miss adding the metric recording.

Instead of just ignoring the comment again, if you really feel strongly against adopting the suggestion, provide an answer explaining why the current approach is better, and better maintainable than the suggested one.

}

// TestMetricsAreRecorded tests that metrics are properly recorded.
func TestMetricsAreRecorded(t *testing.T) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is the manifestation of the issue of not using an interface for the metrics:

Since you're using the type and not an interface, you cannot properly mock the behavior. The result is that we now have the "processor unit test" completely coupled to a concrete implementation of the metrics. The processor unit test needs to know if a particular metric is a counter, an histogram, etc. Everything is coupled.

I asked at least 2 times in past reviews to change that to an interface that could be mocked. Going for that recommendation would allow us to:

  • Properly mock the methods so that the processor unit test can just check that the right methods were called.
  • The processor unit tests are not coupled anymore to the concrete implementation of the metrics, making the code much easier to maintain and evolve. The metrics part can keep evolving independently more easily as the project matures.
  • The testing of the metrics would be scoped to just the metrics package, keeping everything properly encapsulated and properly tested.

Instead of just ignoring the comment again, if you really feel strongly against adopting the suggestion, provide an answer explaining why the current approach is better, and better maintainable than the suggested one.

@mathetake
Copy link
Copy Markdown
Member

Let's address @nacx comments not only mine, otherwise i don't think i can approve and merge

@mathetake
Copy link
Copy Markdown
Member

Ok several people asked about this and i think we have to prioritize. @rootfs if you cannot iterate on this anymore me or @nacx will rework to the completion. Thanks for the work so far anyways

mathetake pushed a commit that referenced this pull request Mar 5, 2025
**Commit Message**

extproc: add GenAI metrics to track token usage and latency

Adds GenAI metrics according to the OpenTelemetry Semantic Conventions
for Generative AI Metrics [1].
Note those metrics are still in experimental phase and may still be
subject to change.

1: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/

**Related Issues/PRs (if applicable)**

This is a follow-up of
#432, implementing the
remaining review comments. 

---------

Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Ignasi Barrera <ignasi@tetrate.io>
@mathetake
Copy link
Copy Markdown
Member

superseded by #459

@mathetake mathetake closed this Mar 5, 2025
aabchoo pushed a commit that referenced this pull request Mar 14, 2025
**Commit Message**

extproc: add GenAI metrics to track token usage and latency

Adds GenAI metrics according to the OpenTelemetry Semantic Conventions
for Generative AI Metrics [1].
Note those metrics are still in experimental phase and may still be
subject to change.

1: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/

**Related Issues/PRs (if applicable)**

This is a follow-up of
#432, implementing the
remaining review comments.

---------

Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Ignasi Barrera <ignasi@tetrate.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants