feat: add extproc prometheus metrics by rootfs · Pull Request #316 · envoyproxy/ai-gateway

rootfs · 2025-02-10T18:25:15Z

Commit Message
Add prometheus metrics to extproc to track data plan metrics. These metrics help understand requests or token level usage on models and backends.

Commit Message

Related Issues/PRs (if applicable)

Special notes for reviewers (if applicable)

Below is an example output from extproc:

# curl localhost:9190/metrics  
# HELP aigateway_first_token_latency_seconds Time to receive first token in streaming responses
# TYPE aigateway_first_token_latency_seconds histogram
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="0.1"} 0
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="0.25"} 0
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="0.5"} 0
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="1"} 1
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="2.5"} 1
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="5"} 1
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="10"} 1
aigateway_first_token_latency_seconds_bucket{backend="ollama",model="phi4",le="+Inf"} 1
aigateway_first_token_latency_seconds_sum{backend="ollama",model="phi4"} 0.566842292
aigateway_first_token_latency_seconds_count{backend="ollama",model="phi4"} 1
# HELP aigateway_inter_token_latency_seconds Time between consecutive tokens in streaming responses
# TYPE aigateway_inter_token_latency_seconds histogram
aigateway_inter_token_latency_seconds_bucket{backend="ollama",model="phi4",le="0.1"} 7
aigateway_inter_token_latency_seconds_bucket{backend="ollama",model="phi4",le="0.25"} 7
aigateway_inter_token_latency_seconds_bucket{backend="ollama",model="phi4",le="0.5"} 7
aigateway_inter_token_latency_seconds_bucket{backend="ollama",model="phi4",le="1"} 7
aigateway_inter_token_latency_seconds_bucket{backend="ollama",model="phi4",le="2.5"} 7
aigateway_inter_token_latency_seconds_bucket{backend="ollama",model="phi4",le="5"} 7
aigateway_inter_token_latency_seconds_bucket{backend="ollama",model="phi4",le="10"} 7
aigateway_inter_token_latency_seconds_bucket{backend="ollama",model="phi4",le="+Inf"} 7
aigateway_inter_token_latency_seconds_sum{backend="ollama",model="phi4"} 4.837e-05
aigateway_inter_token_latency_seconds_count{backend="ollama",model="phi4"} 7
# HELP aigateway_model_tokens_total Total number of tokens processed by model and type
# TYPE aigateway_model_tokens_total counter
aigateway_model_tokens_total{backend="ollama",model="phi4",type="completion"} 1048
aigateway_model_tokens_total{backend="ollama",model="phi4",type="prompt"} 34
aigateway_model_tokens_total{backend="ollama",model="phi4",type="total"} 1082
# HELP aigateway_requests_total Total number of requests processed
# TYPE aigateway_requests_total counter
aigateway_requests_total{backend="ollama",model="phi4",status="success"} 2
# HELP aigateway_total_latency_seconds Time spent processing request
# TYPE aigateway_total_latency_seconds histogram
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="0.1"} 0
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="0.5"} 0
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="1"} 1
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="2.5"} 1
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="5"} 1
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="10"} 1
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="20"} 2
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="30"} 2
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="60"} 2
aigateway_total_latency_seconds_bucket{backend="ollama",model="phi4",status="success",le="+Inf"} 2
aigateway_total_latency_seconds_sum{backend="ollama",model="phi4",status="success"} 14.449215641
aigateway_total_latency_seconds_count{backend="ollama",model="phi4",status="success"} 2

internal/metrics/metrics.go

mathetake

thanks! Left comments mostly about the testability

internal/extproc/translator/openai_openai.go

internal/extproc/translator/openai_awsbedrock.go

internal/extproc/processor.go

mathetake · 2025-02-10T18:42:12Z

I have only tested on OpenAI compatible platforms. If anybody can test aws bedrock, that'll be great

this is why at least you need unit tests ;) anyways the coverage check is failing as expected

nacx · 2025-02-12T21:46:29Z

internal/extproc/processor.go

 	model, body, err := p.config.bodyParser(path, rawBody)
 	if err != nil {
+		// Add to metrics tracker
+		metrics.RequestsTotal.WithLabelValues("unknown", "unknown", "error").Inc()


I see that we're setting some metrics when there are errors, but not on all of them. Is it intentional? If not, this looks a bit error-prone, and easy to miss setting the error metric when adding new conditionals to the code int he future...

WDYT about refactoring the metrics functionality out to a processor that could wrap this one? something like an InstrumentedProcessor that wraps this one and emits the metrics when needed? This way it's easier to always emit the error metrics (when the inner processor returned an error), and don't miss that.
Also, the backend and model are published in the headers, so they should be accessible to the wrapping processor.

I don't know if given the current code and execution flows it is possible, but if it is, it would probably be cleaner

The metrics is created in its own factory. When the extproc process starts, the metrics is there. If there is a misconstructed body, tracking it in the metrics is more helpful than dropping it silently or burry it in logs.

Agree, but this does not answer my question. Let me re-ask:

I see that we're setting some metrics when there are errors, but not on all of them. Is it intentional?

As an example, in the processor.go:102 you're incrementing the error counter, but then in lines 109, 115, 127 (it's an example, there are many others), you're not incrementing that error count.
Could you explain why? Under which circumstances should the error count be incremented and under which circumstances it should not? There needs to be a proper reasoning for that, also reflected in the comments, so that future changes to the processor can properly account for metrics.

You also did not provide an answer to the question about the Decorator pattern instead of embedding everything in the current logic. Could you also give your thoughts on that? Using a decoration pattern would help a lot not miss adding metrics in conditional cases (such as errors, etc).

@nacx i see what you meant. Yes, all errors should be tracked. The codebase is rapidly changing to catch up these errors :D I'll make sure the metrics are all the places you mentioned during next rebase.

This is not resolved

rootfs · 2025-02-14T15:27:18Z

@mathetake PR updated with metrics builder and more test coverage. PTAL and I'll rebase. Thanks

internal/extproc/processor.go

internal/metrics/metrics.go

yuzisun · 2025-02-16T16:02:03Z

internal/extproc/translator/openai_awsbedrock.go

 		return nil, nil, nil, fmt.Errorf("failed to marshal body: %w", err)
 	}
 	setContentLength(headerMutation, mut.Body)
+	o.requestStart = time.Now()


Move this line upfront as there are early returns, also request starts is counted from requestHeader not body @mathetake ?

i don't see a requestHeader call here. But it can definitely move there once the API is implemented @mathetake

I agree with @yuzisun. The main issue is that there is a mix of lifecycles here:

The request lifecycle is owned by the processor. That is the one that has the OnRequestHeaders, OnResponseBody, etc.

This is "translator code", which basically transforms request/responses from/to the specific selected backend formats.

With this, I think there a re a couple options:

If we want the translator code to be recording metrics, then some context about the initial request (such as the request creation time) should be passed to it when it's being instantiated.

If we want to keep the translator focused on just translating (this is probably what I think would be cleaner), we should think about a way the translator could provide information about internal processing to the calling processor, so that the processor can emit the right metrics.

yuzisun · 2025-02-16T16:12:59Z

internal/extproc/translator/openai_awsbedrock.go

+				// Since we are only interested in the time between tokens, and openai streaming is by chunk,
+				// we can calculate the time between tokens by the time between the last token and the current token, divided by the number of tokens.
+				// And in some cases, the number of tokens can be 0, so we need to check for that.
+				div := tokenUsage.OutputTokens


need to confirm if the token count reported is accumulative or just the output tokens in the trunk, it may differ between different backends

internal/metrics/metrics.go

internal/extproc/translator/openai_openai.go

mathetake · 2025-02-24T22:11:44Z

@rootfs would it be possible to resolve the conflicts and address the existing comments? @nacx and I really would like to see this finish

rootfs · 2025-02-24T22:27:51Z

@mathetake sure thing, back to work today, will rebase soon.

cmd/extproc/mainlib/main.go

nacx · 2025-02-25T18:24:36Z

internal/extproc/chatcompletion_processor.go


 // ProcessRequestBody implements [Processor.ProcessRequestBody].
 func (c *chatCompletionProcessor) ProcessRequestBody(ctx context.Context, rawBody *extprocv3.HttpBody) (res *extprocv3.ProcessingResponse, err error) {
+	c.requestStart = time.Now() // Track the start time of the request


Should we set the requestStart in the OnRequestHeaders method?

i don't see this method yet

Sorry, I meant ProcessRequestHeaders

internal/extproc/chatcompletion_processor.go

internal/extproc/processor.go

nacx · 2025-02-25T18:40:34Z

internal/extproc/translator/openai_awsbedrock.go

 		return nil, nil, nil, fmt.Errorf("failed to marshal body: %w", err)
 	}
 	setContentLength(headerMutation, mut.Body)
+	o.requestStart = time.Now()


I agree with @yuzisun. The main issue is that there is a mix of lifecycles here:

The request lifecycle is owned by the processor. That is the one that has the OnRequestHeaders, OnResponseBody, etc.

This is "translator code", which basically transforms request/responses from/to the specific selected backend formats.

With this, I think there a re a couple options:

If we want the translator code to be recording metrics, then some context about the initial request (such as the request creation time) should be passed to it when it's being instantiated.

If we want to keep the translator focused on just translating (this is probably what I think would be cleaner), we should think about a way the translator could provide information about internal processing to the calling processor, so that the processor can emit the right metrics.

internal/extproc/translator/openai_awsbedrock.go

internal/extproc/translator/openai_openai.go

internal/metrics/metrics.go

**Commit Message** Add prometheus metrics to extproc to track data plan metrics. These metrics help understand requests or token level usage on models and backends. Signed-off-by: Huamin Chen <hchen@redhat.com>

rootfs · 2025-02-25T21:05:16Z

@mathetake @nacx @yuzisun at the moment, there are still questions on entrypoint of recording the start of the request header. I don't see request header entrypoint in main branch yet. But the current code is flexible to move to new entrypoints when you have them.

Signed-off-by: Huamin Chen <hchen@redhat.com>

mathetake · 2025-02-25T21:29:41Z

I would strongly suggest to not have a metrics context tied to translator. The metric "sink" should be coupled with the kind of Processor, not translator. To do so,
1. Define ChatCompletionsMetrics and make it owned by ChatCompletionsProcessor.
2. Pass it from processor to the OpenAIChatCompletionTranslator methods as an argument or pass it to NewChatCompletionOpenAIToOpenAITranslator and NewChatCompletionOpenAIToAWSBedrockTranslator. Not initialize the metric stuff in translator's constructor. (extproc: renames translator interface #425 will be merged in a sec)
Then, the request time initialization should start at RequestHeaders method

**Commit Message** After #325, the translator is not where the abstraction over :path is implemented, but it moved to the choice of processor. As a result, the translator interface has become effectively tied to a specific endpoint, notably /v1/chat/completion. This renames the `translator.Translator` interface accordingly not only to remove the unnecessary code path but also facilitates the implementation of additional features like metrics. **Related Issues/PRs (if applicable)** Follow up on #325 and contribute to #316 --------- Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>

nacx · 2025-02-25T21:37:43Z

cmd/extproc/mainlib/main.go

+	fs.StringVar(&flags.promAddr,
+		"promAddr",
+		":9190",
+		"address for prometheus metrics, default is :9190",


The flag library already includes the default when printing the flags. If we keep it here, there will be duplicate text.

Suggested change

"address for prometheus metrics, default is :9190",

"address for prometheus metrics",

nacx · 2025-02-25T21:38:31Z

cmd/extproc/mainlib/main.go

 	go func() {
 		<-ctx.Done()
 		s.GracefulStop()
+		if err := promServer.Shutdown(ctx); err != nil {


Probably it's better to use context.Background() here, as ctx is already terminated at this point.

nacx · 2025-02-25T21:40:50Z

internal/extproc/chatcompletion_processor.go

 	// cost is the cost of the request that is accumulated during the processing of the response.
 	costs translator.LLMTokenUsage
+	// for metrics
+	metrics     *metrics.TokenMetrics


Let's extract TokenMetrics as an interface and keep the implementation type private to the metrics package. Keep the NewTokenMetrics() constructor but probably just returning the interface int he signature. This will make this more flexible and easier to test injecting convenient mocks tailored to the needs of each test.

nacx · 2025-02-25T21:42:21Z

internal/extproc/chatcompletion_processor.go

+	c.modelName = "unknown"
 	model, body, err := parseOpenAIChatCompletionBody(rawBody)
 	if err != nil {
+		c.metrics.UpdateRequestMetrics(c.backendName, c.modelName, "error")


Can you remove these error metrics from every conditional and instead just do it once on a deferred function at the beginning of the method, as suggested in the previous review?
Doing it in a deferred function ensures that we will not miss recording the error metric in the future if we add other error blocks and forget to copy the metric recording. It is overall less error-prone.

nacx · 2025-02-25T21:43:07Z

internal/extproc/chatcompletion_processor.go

 	}
+
+	// Track the backend and model name for metrics
+	c.metrics.StartRequest(c.backendName, c.modelName)


This should be probably moved to the ProcessRequestHeaders method.

nacx · 2025-02-25T21:51:09Z

internal/extproc/chatcompletion_processor_test.go

 			selectedBackendHeaderKey: "x-ai-gateway-backend-key",
 			modelNameHeaderKey:       "x-ai-gateway-model-key",
-		}, requestHeaders: headers, logger: slog.Default(), translator: mt}
+		}, requestHeaders: headers, logger: slog.Default(), translator: mt, metrics: metrics.NewTokenMetrics()}


I think this test file should contain assertions to verify that the right metrics methods were called.

If you turn the metrics property into an interface as suggested above, you can use a mock implementation that just records the number of times each method is called, and after each individual test you can verify that the right methods have been called the right number of times.

THis way the unit tests will check that the processor behaves as expected but also that it calls the logic to record metrics and it's not missing any, or calling more than one by mistake (this would catch missing metric recording in error blocks or duplicate metric recording in translator/processor, and make the code more robust and more future-change-proof).

nacx · 2025-02-25T21:52:08Z

internal/extproc/translator/openai_awsbedrock.go

 		return nil, nil, nil, fmt.Errorf("failed to marshal body: %w", err)
 	}
 	setContentLength(headerMutation, mut.Body)
+	o.tokenMetrics.StartRequest(o.backendName, o.modelName)


If we use here the same instance of the token metrics than the one used by the processor, then the request recording as already started and the translator shouldn't need to do this.

nacx · 2025-02-25T21:54:40Z

internal/extproc/chatcompletion_processor.go

+	backendName string
+	modelName   string


Could these fields be moved to the TokenMetrics implementation, and provide methods to set them?
I think it makes sense and keeps the processor state cleaner, plus these values will be already available to the methods called by the translator, and we won't need to have those in the translator either. I think that would be cleaner.
@mathetake WDYT?

nacx · 2025-02-25T21:57:08Z

internal/extproc/processor.go

+	requestStart                                 time.Time
+	modelName                                    string
+	backendName                                  string
+	metrics                                      *metrics.Metrics


Are these actually used?

In any case, when you see the same properties/state being used in N different places (processor, translator, this type), it is a sign that things could be better encapsulated.

These fields should be probably all moved inside the TokenMetrics implementation, and the "tokenMetrics" instance should be the only thing that is shared and passed between processor, translator, etc.

nacx · 2025-02-25T22:00:18Z

internal/extproc/translator/openai_awsbedrock.go

 	}
 	headerMutation = &extprocv3.HeaderMutation{}
 	setContentLength(headerMutation, mut.Body)
+	o.tokenMetrics.UpdateTokenMetrics(o.backendName, o.modelName, tokenUsage.OutputTokens, tokenUsage.InputTokens, tokenUsage.TotalTokens)


As mentioned in the previous review, these metrics are already recorded by the calling processor in its ProcessResponseBody method, once the translator returns.

It looks like the values will be incremened twice, which is not correct?

This is why it is important to have the tests I mention above, to make sure we're only recording metrics when needed, and not doing it N times by mistake.

This also applies to the other translators.

rootfs · 2025-02-26T00:25:35Z

I would strongly suggest to not have a metrics context tied to translator. The metric "sink" should be coupled with the kind of Processor, not translator. To do so,

Define ChatCompletionsMetrics and make it owned by ChatCompletionsProcessor.

Pass it from processor to the OpenAIChatCompletionTranslator methods as an argument or pass it to NewChatCompletionOpenAIToOpenAITranslator and NewChatCompletionOpenAIToAWSBedrockTranslator. Not initialize the metric stuff in translator's constructor. (extproc: renames translator interface #425 will be merged in a sec)

Then, the request time initialization should start at RequestHeaders method

@mathetake ok, that'll be a major refactor. I am going to do it in a clean way then. Closing this PR and will submit a new one based on the updated interface.

mathetake · 2025-02-26T00:45:56Z

No reason to close the PR though ...

nacx · 2025-02-26T09:00:49Z

Agree. No need to close the PR. You're doing an amazing job on this one, keeping up with all the reviews, etc. We can merge this and tackle that comment in a follow-up PR!

rootfs requested a review from a team as a code owner February 10, 2025 18:25

rootfs changed the title ~~feat: add aigateway prometheus metrics to track model, request, and token usage~~ feat: add extproc prometheus metrics Feb 10, 2025

mathetake reviewed Feb 10, 2025

View reviewed changes

internal/metrics/metrics.go Outdated Show resolved Hide resolved

mathetake reviewed Feb 10, 2025

View reviewed changes

nacx reviewed Feb 12, 2025

View reviewed changes

yuzisun reviewed Feb 16, 2025

View reviewed changes

internal/extproc/processor.go Outdated Show resolved Hide resolved

yuzisun reviewed Feb 16, 2025

View reviewed changes

internal/extproc/processor.go Outdated Show resolved Hide resolved

yuzisun reviewed Feb 16, 2025

View reviewed changes

internal/metrics/metrics.go Outdated Show resolved Hide resolved

yuzisun reviewed Feb 16, 2025

View reviewed changes

mathetake assigned nacx and mathetake Feb 24, 2025

rootfs force-pushed the prom branch 2 times, most recently from d24a773 to 7e9800c Compare February 25, 2025 15:58

nacx reviewed Feb 25, 2025

View reviewed changes

rootfs force-pushed the prom branch 2 times, most recently from 23d6403 to 7141702 Compare February 25, 2025 20:58

feat: add aigateway prometheus metrics to extproc

3f67ffa

**Commit Message** Add prometheus metrics to extproc to track data plan metrics. These metrics help understand requests or token level usage on models and backends. Signed-off-by: Huamin Chen <hchen@redhat.com>

review feedback

7c0abe5

Signed-off-by: Huamin Chen <hchen@redhat.com>

rootfs force-pushed the prom branch from 7141702 to 7c0abe5 Compare February 25, 2025 21:19

mathetake mentioned this pull request Feb 25, 2025

extproc: renames translator interface #425

Merged

nacx reviewed Feb 25, 2025

View reviewed changes

rootfs closed this Feb 26, 2025

nacx mentioned this pull request Feb 26, 2025

extproc: wip proposal for chat completion instrumentation #431

Closed

rootfs mentioned this pull request Feb 26, 2025

feat: add prometheus metrics to track token and latency #432

Closed

	"address for prometheus metrics, default is :9190",
	"address for prometheus metrics",

Conversation

rootfs commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mathetake left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mathetake commented Feb 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nacx Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rootfs commented Feb 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mathetake commented Feb 24, 2025

Uh oh!

rootfs commented Feb 24, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rootfs commented Feb 25, 2025

Uh oh!

mathetake commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

rootfs commented Feb 10, 2025 •

edited

Loading

nacx Feb 14, 2025 •

edited

Loading

mathetake commented Feb 25, 2025 •

edited

Loading