Skip to content

aigw: default session.id request header mapping#1808

Merged
mathetake merged 2 commits intoenvoyproxy:mainfrom
codefromthecrypt:normalize-req-attrs
Jan 26, 2026
Merged

aigw: default session.id request header mapping#1808
mathetake merged 2 commits intoenvoyproxy:mainfrom
codefromthecrypt:normalize-req-attrs

Conversation

@codefromthecrypt
Copy link
Copy Markdown
Contributor

@codefromthecrypt codefromthecrypt commented Jan 23, 2026

Description

Default span/log request‑header mappings to agent-session-id:session.id so agent frameworks like Goose get session correlation with zero config, while still allowing explicit overrides (different mapping or empty to disable). Metrics never default to session IDs because they are high cardinality.

The default mapping is in the new ENV variable OTEL_AIGW_REQUEST_HEADER_ATTRIBUTES, so those who want no agent-session-id:session.id should set OTEL_AIGW_REQUEST_HEADER_ATTRIBUTES= (empty string) to clear it.

Refactor request‑header mapping handling so defaults/merging live only in extproc; aigw and controller/helm just pass flags through. Ordering is normalized everywhere (request → span → metrics → log) and docs/examples describe defaults without explicitly setting agent-session-id:session.id.

Related Issues/PRs (if applicable)

Related: #1797

Special notes for reviewers (if applicable)

Ran the examples/goose with OTEL console env and --debug.

Since goose now propagates agent-session-id by default, we can see in the telemetry agent-session-id=20260123_19:

MCP span (tools/list) showing session.id set from the header:

{"Name":"ListTools","SpanContext":{"TraceID":"6412e470a771ede413f3318b984b65f5","SpanID":"bf7a7b2b20ce85a9","TraceFlags":"01","TraceState":"","Remote":false},"Parent":{"TraceID":"00000000000000000000000000000000","SpanID":"0000000000000000","TraceFlags":"00","TraceState":"","Remote":false},"SpanKind":3,"StartTime":"2026-01-23T16:01:20.669792+09:00","EndTime":"2026-01-23T16:01:21.313477459+09:00","Attributes":[{"Key":"mcp.protocol.version","Value":{"Type":"STRING","Value":"2025-06-18"}},{"Key":"mcp.transport","Value":{"Type":"STRING","Value":"http"}},{"Key":"mcp.request.id","Value":{"Type":"STRING","Value":"{1}"}},{"Key":"mcp.method.name","Value":{"Type":"STRING","Value":"tools/list"}},{"Key":"session.id","Value":{"Type":"STRING","Value":"20260123_19"}}],"Events":[{"Name":"route to backend","Attributes":[{"Key":"mcp.backend.name","Value":{"Type":"STRING","Value":"kiwi"}},{"Key":"mcp.session.id","Value":{"Type":"STRING","Value":"f9e80f73-bc48-4797-afae-045ef0e57e7d"}},{"Key":"mcp.session.new","Value":{"Type":"BOOL","Value":false}}],"DroppedAttributeCount":0,"Time":"2026-01-23T16:01:21.303264+09:00"}],"Links":null,"Status":{"Code":"Ok","Description":""},"DroppedAttributes":0,"DroppedEvents":0,"DroppedLinks":0,"ChildSpanCount":0,"Resource":[{"Key":"service.name","Value":{"Type":"STRING","Value":"ai-gateway"}},{"Key":"telemetry.sdk.language","Value":{"Type":"STRING","Value":"go"}},{"Key":"telemetry.sdk.name","Value":{"Type":"STRING","Value":"opentelemetry"}},{"Key":"telemetry.sdk.version","Value":{"Type":"STRING","Value":"1.39.0"}}],"InstrumentationScope":{"Name":"envoyproxy/ai-gateway","Version":"","SchemaURL":"","Attributes":null},"InstrumentationLibrary":{"Name":"envoyproxy/ai-gateway","Version":"","SchemaURL":"","Attributes":null}}

MCP access log showing session.id on a tool call:

{"bytes_received":341,"bytes_sent":8720,"connection_termination_details":null,"downstream_local_address":"127.0.0.1:10088","downstream_remote_address":"127.0.0.1:50643","duration":1247,"jsonrpc.request.id":"4","mcp.method.name":"tools/call","mcp.provider.name":"kiwi","mcp.session.id":"f9e80f73-bc48-4797-afae-045ef0e57e7d","method":"POST","request.path":"/","response_code":200,"session.id":"20260123_19","start_time":"2026-01-23T07:01:33.553Z","upstream_cluster":"httproute/default/ai-eg-mcp-br-mcp-route-kiwi/rule/0","upstream_host":"146.75.115.52:443","upstream_local_address":"192.168.23.60:50644","upstream_transport_failure_reason":null,"user-agent":"Go-http-client/1.1","x-envoy-origin-path":"/mcp","x-envoy-upstream-service-time":"613","x-forwarded-for":null,"x-request-id":"bd29074f-3ab0-41b3-a184-e0ec87a3809b"}

LLM access log showing session.id on a chat completion:

{"bytes_received":14807,"bytes_sent":47214,"connection_termination_details":null,"downstream_local_address":"127.0.0.1:1975","downstream_remote_address":"127.0.0.1:50651","duration":3560,"gen_ai.provider.name":"default/openai/route/aigw-run/rule/0/ref/0","gen_ai.request.model":"qwen3:1.7b","gen_ai.response.model":"qwen3:1.7b","gen_ai.usage.input_tokens":3227,"gen_ai.usage.output_tokens":253,"method":"POST","request.path":"/v1/chat/completions","response_code":200,"session.id":"20260123_19","start_time":"2026-01-23T07:01:29.980Z","upstream_cluster":"httproute/default/aigw-run/rule/0","upstream_host":"127.0.0.1:11434","upstream_local_address":"127.0.0.1:50653","upstream_transport_failure_reason":null,"user-agent":null,"x-envoy-origin-path":"/v1/chat/completions","x-envoy-upstream-service-time":null,"x-forwarded-for":"192.168.23.60","x-request-id":"2b430167-040d-43ef-a48e-de0ebaa0fcdc"}

Minor improvements:

  • normalized all example header/attributes and order of trace, metrics and logs
  • Add AIGW_DEBUG so docker compose examples can actually show debug output
  • align data‑plane tests to use the aigw func‑e download location instead of re-downloading

@codefromthecrypt
Copy link
Copy Markdown
Contributor Author

after this I would like to switch to otlp hopefully if EG completes merging my outstanding PR 🤞

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 96.72131% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.12%. Comparing base (6c351f0) to head (084c668).

Files with missing lines Patch % Lines
internal/extensionserver/extensionserver.go 66.66% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1808      +/-   ##
==========================================
+ Coverage   84.08%   84.12%   +0.04%     
==========================================
  Files         118      119       +1     
  Lines       13235    13283      +48     
==========================================
+ Hits        11128    11174      +46     
- Misses       1434     1435       +1     
- Partials      673      674       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@codefromthecrypt codefromthecrypt marked this pull request as ready for review January 23, 2026 08:21
@codefromthecrypt codefromthecrypt requested a review from a team as a code owner January 23, 2026 08:21
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Jan 23, 2026
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Jan 23, 2026
@codefromthecrypt codefromthecrypt force-pushed the normalize-req-attrs branch 3 times, most recently from 95bd3b8 to 6dc63e3 Compare January 26, 2026 05:47
Copy link
Copy Markdown
Contributor Author

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notes

- AIGW_DEBUG
# session.id is used in logs and traces (not metrics; high-cardinality)
- OTEL_AIGW_REQUEST_HEADER_ATTRIBUTES=x-user-id:user.id
- OTEL_AIGW_LOG_REQUEST_HEADER_ATTRIBUTES=x-session-id:session.id
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spans and logs are fine with request scope, so that's why session.id makes sense

// cmdRun corresponds to `aigw run` command.
cmdRun struct {
Debug bool `help:"Enable debug logging emitted to stderr."`
Debug bool `env:"AIGW_DEBUG" help:"Enable debug logging emitted to stderr."`
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker compose up cannot add args, so our instructions were busted. env is the easy way out

// ```
// Then, with the following BackendTrafficPolicy of Envoy Gateway, you can have three
// rate limit buckets for each unique x-user-id header value. One bucket is for the input token,
// rate limit buckets for each unique x-tenant-id header value. One bucket is for the input token,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our examples used a combination of x-user-id, x-team-id, x-tenant and the most common attribute is x-tenant-id so settled on this for a coarse grained example

Copy link
Copy Markdown
Member

@mathetake mathetake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not following on why the pointer to String is necessary but harmless it seems

@mathetake
Copy link
Copy Markdown
Member

E2E rate limit failure seems legit

model","gen_ai_token_type":"cached_input"},"value":[1769407055.349,"20"]},{"metric":{"gen_ai_request_model":"rate-limit-funky-model","gen_ai_token_type":"input"},"value":[1769407055.349,"10000"]},{"metric":{"gen_ai_request_model":"rate-limit-funky-model","gen_ai_token_type":"output"},"value":[1769407055.349,"10020"]}]}}
2026-01-26T05:57:35.3561114Z     token_ratelimit_test.go:196: 
2026-01-26T05:57:35.3562199Z         	Error Trace:	/home/runner/work/ai-gateway/ai-gateway/tests/e2e/token_ratelimit_test.go:196
2026-01-26T05:57:35.3564076Z         	            				/opt/hostedtoolcache/go/1.25.6/x64/src/runtime/asm_amd64.s:1693
2026-01-26T05:57:35.3564840Z         	Error:      	Should be true
2026-01-26T05:57:35.3565581Z         	Test:       	Test_Examples_TokenRateLimit
2026-01-26T05:57:35.3566340Z         	Messages:   	team_id should be present in the metric
2026-01-26T05:59:35.1544412Z     token_ratelimit_test.go:153: 
2026-01-26T05:59:35.1547945Z         	Error Trace:	/home/runner/work/ai-gateway/ai-gateway/tests/e2e/token_ratelimit_test.go:153
2026-01-26T05:59:35.1556398Z         	Error:      	Condition never satisfied
2026-01-26T05:59:35.1560438Z         	Test:       	Test_Examples_TokenRateLimit
2026-01-26T05:59:35.2244768Z Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
2026-01-26T05:59:35.2391895Z namespace "redis-system" force deleted
2026-01-26T05:59:35.2494692Z service "redis" force deleted from redis-system namespace
2026-01-26T05:59:35.2533523Z deployment.apps "redis" force deleted from redis-system namespace
2026-01-26T05:59:40.4901923Z Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
2026-01-26T05:59:40.5063737Z gatewayclass.gateway.networking.k8s.io "envoy-ai-gateway-token-ratelimit" force deleted
2026-01-26T05:59:40.5155514Z gateway.gateway.networking.k8s.io "envoy-ai-gateway-token-ratelimit" force deleted from default namespace
2026-01-26T05:59:40.5247621Z aigatewayroute.aigateway.envoyproxy.io "envoy-ai-gateway-token-ratelimit" force deleted from default namespace
2026-01-26T05:59:40.5356794Z aiservicebackend.aigateway.envoyproxy.io "envoy-ai-gateway-token-ratelimit-testupstream" force deleted from default namespace
2026-01-26T05:59:40.5484656Z backend.gateway.envoyproxy.io "envoy-ai-gateway-token-ratelimit-testupstream" force deleted from default namespace
2026-01-26T05:59:40.5612382Z backendtrafficpolicy.gateway.envoyproxy.io "envoy-ai-gateway-token-ratelimit-policy" force deleted from default namespace
2026-01-26T05:59:40.5733996Z deployment.apps "envoy-ai-gateway-token-ratelimit-tesetupstream" force deleted from default namespace
2026-01-26T05:59:40.6086493Z service "envoy-ai-gateway-token-ratelimit-tesetupstream" force deleted from default namespace
2026-01-26T05:59:40.6154408Z envoyproxy.gateway.envoyproxy.io "envoy-ai-gateway-token-ratelimit" force deleted from default namespace
2026-01-26T05:59:40.6533064Z --- FAIL: Test_Examples_TokenRateLimit (171.31s)

@codefromthecrypt
Copy link
Copy Markdown
Contributor Author

Not following on why the pointer to String is necessary but harmless it seems

it is to know the difference between not set and set. for example, if you want no defaults, you set the header to empty. Without handling tri-state boolean it would be hard to unset everything.

Signed-off-by: Adrian Cole <adrian@tetrate.io>
Signed-off-by: Adrian Cole <adrian@tetrate.io>
@codefromthecrypt
Copy link
Copy Markdown
Contributor Author

E2E rate limit failure seems legit

yep sorry about that, missed a find/replace

@codefromthecrypt
Copy link
Copy Markdown
Contributor Author

updated the PR desc on how to clear the default (via OTEL_AIGW_REQUEST_HEADER_ATTRIBUTES= empty string)

@mathetake mathetake enabled auto-merge (squash) January 26, 2026 21:49
@mathetake mathetake merged commit 3b16648 into envoyproxy:main Jan 26, 2026
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants