feat: add prometheus metrics to track token and latency#432
feat: add prometheus metrics to track token and latency#432rootfs wants to merge 5 commits intoenvoyproxy:mainfrom
Conversation
|
@mathetake @nacx @yuzisun This is a cleanup of #316. As requested, this time the metrics are added to the processor. In this implementation, the request latency can be counted, yet token latency (especially inter token latency) is not there yet. We need to discuss what interface in translator can be abstracted to support that type of metrics. |
**Commit Message** Add prometheus metrics to measure request count and latency, and token count by backend and model. Signed-off-by: Huamin Chen <hchen@redhat.com>
mathetake
left a comment
There was a problem hiding this comment.
finally getting close to what i wanted/expected to see!
|
fwiw some of them overlaps with https://github.com/open-telemetry/semantic-conventions/blob/main/docs/gen-ai/gen-ai-metrics.md cc @yuzisun |
Signed-off-by: Huamin Chen <hchen@redhat.com>
@mathetake good to know :D |
Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
| <-ctx.Done() | ||
| s.GracefulStop() | ||
|
|
||
| shutdownCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) |
There was a problem hiding this comment.
not sure what is this for. why 5 sec and close? can you just remove and use the ctx (signal handle already is handled) and pass to Shutdown below.
There was a problem hiding this comment.
The ctx is already done (a couple lines above), so probably we shouldn't use it here? I don't know if we need this timeout, but the idea is to not use the already completed `ctx.
| "info", | ||
| "log level for the external processor. One of 'debug', 'info', 'warn', or 'error'.", | ||
| ) | ||
| fs.StringVar(&flags.metricsAddr, "metricsAddr", ":9190", "HTTP address for the metrics server.") |
| } | ||
|
|
||
| // startMetricsServer starts the HTTP server for Prometheus metrics. | ||
| func startMetricsServer(addr string, logger *slog.Logger) *http.Server { |
| // Copyright Envoy AI Gateway Authors. | ||
| // SPDX-License-Identifier: Apache-2.0. | ||
| // The full text of the Apache license is available in the LICENSE file at |
| // Copyright Envoy AI Gateway Authors. | ||
| // SPDX-License-Identifier: Apache-2.0. |
|
let's make sure |
| model, body, err := parseOpenAIChatCompletionBody(rawBody) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to parse request body: %w", err) | ||
| return nil, c.recordErrorAndReturn("failed to parse request body: %w", err) |
There was a problem hiding this comment.
So, we still have the problem of having the metrics recording thing everywhere, which is error-prone. Everytime in the future where we evolve the logic of the processor we'll have to remember to add the metric recording and keep copying this code.
I've asked, at least 3 times on previous reviews, why not use a deferred statement to have this code only once. That would cover all current and future cases and we'll never miss adding the metric recording.
Instead of just ignoring the comment again, if you really feel strongly against adopting the suggestion, provide an answer explaining why the current approach is better, and better maintainable than the suggested one.
| } | ||
|
|
||
| // TestMetricsAreRecorded tests that metrics are properly recorded. | ||
| func TestMetricsAreRecorded(t *testing.T) { |
There was a problem hiding this comment.
This test is the manifestation of the issue of not using an interface for the metrics:
Since you're using the type and not an interface, you cannot properly mock the behavior. The result is that we now have the "processor unit test" completely coupled to a concrete implementation of the metrics. The processor unit test needs to know if a particular metric is a counter, an histogram, etc. Everything is coupled.
I asked at least 2 times in past reviews to change that to an interface that could be mocked. Going for that recommendation would allow us to:
- Properly mock the methods so that the processor unit test can just check that the right methods were called.
- The processor unit tests are not coupled anymore to the concrete implementation of the metrics, making the code much easier to maintain and evolve. The metrics part can keep evolving independently more easily as the project matures.
- The testing of the metrics would be scoped to just the metrics package, keeping everything properly encapsulated and properly tested.
Instead of just ignoring the comment again, if you really feel strongly against adopting the suggestion, provide an answer explaining why the current approach is better, and better maintainable than the suggested one.
|
Let's address @nacx comments not only mine, otherwise i don't think i can approve and merge |
**Commit Message** extproc: add GenAI metrics to track token usage and latency Adds GenAI metrics according to the OpenTelemetry Semantic Conventions for Generative AI Metrics [1]. Note those metrics are still in experimental phase and may still be subject to change. 1: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/ **Related Issues/PRs (if applicable)** This is a follow-up of #432, implementing the remaining review comments. --------- Signed-off-by: Huamin Chen <hchen@redhat.com> Signed-off-by: Ignasi Barrera <ignasi@tetrate.io>
|
superseded by #459 |
**Commit Message** extproc: add GenAI metrics to track token usage and latency Adds GenAI metrics according to the OpenTelemetry Semantic Conventions for Generative AI Metrics [1]. Note those metrics are still in experimental phase and may still be subject to change. 1: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/ **Related Issues/PRs (if applicable)** This is a follow-up of #432, implementing the remaining review comments. --------- Signed-off-by: Huamin Chen <hchen@redhat.com> Signed-off-by: Ignasi Barrera <ignasi@tetrate.io>
Commit Message
Add prometheus metrics to measure request count and latency,
and token count by backend and model.
Related Issues/PRs (if applicable)
#316
Special notes for reviewers (if applicable)
This is a refactoring based on the previous PR. Note, since the metrics is only applied to processor, token level latency metrics won't be available at the moment. They need to be implemented at the translator level.
Tested on an ollama backend: