Skip to content

extproc: add genai metrics to track token usage and latency#459

Merged
mathetake merged 10 commits intoenvoyproxy:mainfrom
nacx:genai-metrics
Mar 5, 2025
Merged

extproc: add genai metrics to track token usage and latency#459
mathetake merged 10 commits intoenvoyproxy:mainfrom
nacx:genai-metrics

Conversation

@nacx
Copy link
Copy Markdown
Member

@nacx nacx commented Mar 5, 2025

Commit Message

extproc: add GenAI metrics to track token usage and latency

Adds GenAI metrics according to the OpenTelemetry Semantic Conventions for Generative AI Metrics [1].
Note those metrics are still in experimental phase and may still be subject to change.

1: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/

Related Issues/PRs (if applicable)

This is a follow-up of #432, implementing the remaining review comments.

Special notes for reviewers (if applicable)

This PR contains all the commits in the original PR intact, and the only added piece are the last two commits: 8a09826 and 3f5dde2

Thr first commit contains:

  • Addresses recording the metrics in a deferred function to make the recording less error-prone.
  • Creates interfaces for the metrics to decouple the metrics logic from the processor and tests.
  • Removes the global state on metrics and injects the metrics instance in the processor at startup time.
  • Refactors the metrics to align with the OpenTelemetry Semantic Conventions for GenAI.

The second commit contains:

A refactoring of the above to use the OpenTelemetry SDK instead of the Prometheus one, to decouple the core from Prometheus. I left this in a separate commit because I don't know if we really care about this?

Example metrics:

# HELP gen_ai_client_token_usage Number of tokens processed.
# TYPE gen_ai_client_token_usage histogram
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="1"} 1
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="4"} 1
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="16"} 2
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="64"} 3
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="256"} 3
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="1024"} 3
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="4096"} 3
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="16384"} 3
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="65536"} 3
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="262144"} 3
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="1.048576e+06"} 3
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="4.194304e+06"} 3
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="1.6777216e+07"} 3
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="6.7108864e+07"} 3
gen_ai_client_token_usage_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input",le="+Inf"} 3
gen_ai_client_token_usage_sum{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input"} 51
gen_ai_client_token_usage_count{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",gen_ai_token_type="input"} 3
# HELP gen_ai_server_request_duration Time spent processing request.
# TYPE gen_ai_server_request_duration histogram
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.01"} 2
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.02"} 2
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.04"} 2
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.08"} 2
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.16"} 2
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.32"} 2
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.64"} 2
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="1.28"} 2
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="2.56"} 3
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="5.12"} 3
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="10.24"} 3
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="20.48"} 3
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="40.96"} 3
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="81.92"} 3
gen_ai_server_request_duration_bucket{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="+Inf"} 3
gen_ai_server_request_duration_sum{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock"} 1.413312417
gen_ai_server_request_duration_count{error_type="",gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock"} 3
# HELP gen_ai_server_time_to_first_token Time to receive first token in streaming responses.
# TYPE gen_ai_server_time_to_first_token histogram
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.001"} 0
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.005"} 2
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.01"} 2
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.02"} 2
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.04"} 2
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.06"} 2
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.08"} 2
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.1"} 2
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.25"} 2
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.5"} 2
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="0.75"} 2
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="1"} 2
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="2.5"} 3
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="5"} 3
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="7.5"} 3
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="10"} 3
gen_ai_server_time_to_first_token_bucket{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock",le="+Inf"} 3
gen_ai_server_time_to_first_token_sum{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock"} 1.4131814169999999
gen_ai_server_time_to_first_token_count{gen_ai_operation_name="chat",gen_ai_request_model="something",gen_ai_response_model="something",gen_ai_system="aws.bedrock"} 3

@nacx nacx requested a review from a team as a code owner March 5, 2025 10:01
rootfs and others added 7 commits March 5, 2025 16:27
**Commit Message**
Add prometheus metrics to measure request count and latency,
and token count by backend and model.

Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Ignasi Barrera <ignasi@tetrate.io>
Signed-off-by: Ignasi Barrera <ignasi@tetrate.io>
Copy link
Copy Markdown
Member

@mathetake mathetake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

}

// startMetricsServer starts the HTTP server for Prometheus metrics.
func startMetricsServer(addr string, logger *slog.Logger) (*http.Server, metric.Meter) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a unit test for this

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


var _ extprocv3.ExternalProcessor_ProcessServer = &mockExternalProcessingStream{}

type mockChatCompletionMetrics struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add comments like elsewhere

Suggested change
type mockChatCompletionMetrics struct {
// mockChatCompletionMetrics implements ...
type mockChatCompletionMetrics struct {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Ignasi Barrera <ignasi@tetrate.io>
Signed-off-by: Ignasi Barrera <ignasi@tetrate.io>
@nacx nacx requested a review from mathetake March 5, 2025 16:32
@mathetake mathetake enabled auto-merge (squash) March 5, 2025 16:32
@mathetake mathetake merged commit ccf13b8 into envoyproxy:main Mar 5, 2025
15 checks passed
yuzisun pushed a commit that referenced this pull request Mar 8, 2025
**Commit Message**

This changes the stat collection behavior so that token latency metrics
are only recorded on stream=true requests. This was brought up in an
offline discussion and otherwise the metrics doesn't make sense.

**Related Issues/PRs (if applicable)**

#459

Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
aabchoo pushed a commit that referenced this pull request Mar 14, 2025
**Commit Message**

extproc: add GenAI metrics to track token usage and latency

Adds GenAI metrics according to the OpenTelemetry Semantic Conventions
for Generative AI Metrics [1].
Note those metrics are still in experimental phase and may still be
subject to change.

1: https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-metrics/

**Related Issues/PRs (if applicable)**

This is a follow-up of
#432, implementing the
remaining review comments.

---------

Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Ignasi Barrera <ignasi@tetrate.io>
aabchoo added a commit that referenced this pull request Mar 14, 2025
**Commit Message**

PR to backport `mockChatCompletionMetrics`, chat completion stream fix,
and openai content type.

Including:
- #459 (468 uses mock components introduced here)
- #468 
- #486

---------

Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Ignasi Barrera <ignasi@tetrate.io>
Signed-off-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Signed-off-by: Aaron Choo <achoo30@bloomberg.net>
Co-authored-by: Ignasi Barrera <ignasi@tetrate.io>
Co-authored-by: Takeshi Yoneda <t.y.mathetake@gmail.com>
Co-authored-by: Dan Sun <dsun20@bloomberg.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants