Skip to content

otel: Segregate client and server RPCInfo used for metrics and traces#9081

Merged
mbissa merged 7 commits into
grpc:masterfrom
mbissa:otel-context-overwritting-fix
Apr 28, 2026
Merged

otel: Segregate client and server RPCInfo used for metrics and traces#9081
mbissa merged 7 commits into
grpc:masterfrom
mbissa:otel-context-overwritting-fix

Conversation

@mbissa

@mbissa mbissa commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Fixes #9053
Client and Server metrics and traces handling code during RPC was using the same RPCInfo. This lead to overwriting of context when an application was acting as server and then initiating other grpc call as a client. This PR segregates the rpcInfoKey context key into clientRPCInfoKey and serverRPCInfoKey so that it is not reused/overwritten between server and client code.

RELEASE NOTES:

  • otel: Segregate client and server RPCInfo used for metrics and traces.

@mbissa mbissa added this to the 1.81 Release milestone Apr 21, 2026
@mbissa

mbissa commented Apr 21, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request resolves telemetry metadata collisions in relay scenarios by separating client and server RPC information within the context using distinct keys and helper functions. It also includes end-to-end tests to ensure that metrics and traces are correctly isolated when a single context is shared between server and client roles. The review feedback suggests further refining getOrCreateCallInfo to check for method changes during context reuse and identifies several redundant calls to context-setting functions that can be removed to optimize performance and reduce unnecessary allocations.

Comment thread stats/opentelemetry/client_metrics.go Outdated
Comment thread stats/opentelemetry/client_metrics.go Outdated
Comment thread stats/opentelemetry/client_tracing.go Outdated
Comment thread stats/opentelemetry/server_metrics.go Outdated
Comment thread stats/opentelemetry/server_tracing.go Outdated
@codecov

codecov Bot commented Apr 21, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.18%. Comparing base (a21508e) to head (5a353b5).
⚠️ Report is 35 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9081      +/-   ##
==========================================
+ Coverage   83.06%   83.18%   +0.11%     
==========================================
  Files         413      413              
  Lines       33135    33484     +349     
==========================================
+ Hits        27523    27852     +329     
- Misses       4199     4218      +19     
- Partials     1413     1414       +1     
Files with missing lines Coverage Δ
stats/opentelemetry/client_metrics.go 88.59% <100.00%> (-0.38%) ⬇️
stats/opentelemetry/client_tracing.go 89.09% <100.00%> (ø)
stats/opentelemetry/opentelemetry.go 78.24% <100.00%> (+2.14%) ⬆️
stats/opentelemetry/server_metrics.go 88.09% <100.00%> (+0.09%) ⬆️
stats/opentelemetry/server_tracing.go 76.00% <100.00%> (ø)

... and 52 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@arjan-bal arjan-bal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed you went with the first approach using separate client and server keys. Just wanted to quickly circle back to the second option: updating the logic from getOrCreateRPCAttemptInfo to createRPCAttemptInfo.

Did you run into any specific issues or blockers when looking into that second approach? I just want to make sure we've fully explored or ruled out that path since it avoids relying on the context chain restrictions.

@arjan-bal arjan-bal assigned mbissa and unassigned arjan-bal Apr 22, 2026
@arjan-bal

Copy link
Copy Markdown
Contributor

Please update the release notes to describe user visible behaviour, omitting the internal implementation. For e.g: Fix bug causing .....

@mbissa

mbissa commented Apr 22, 2026

Copy link
Copy Markdown
Contributor Author

I noticed you went with the first approach using separate client and server keys. Just wanted to quickly circle back to the second option: updating the logic from getOrCreateRPCAttemptInfo to createRPCAttemptInfo.

Did you run into any specific issues or blockers when looking into that second approach? I just want to make sure we've fully explored or ruled out that path since it avoids relying on the context chain restrictions.

The general principle here is that we need segregated rpcInfo on opposite sides of an RPC (Server vs. Client), but we must allow sibling stats handlers (like metrics and tracing) on the same side to share that state.

Furthermore, in cases of relay scenarios, we need both states to coexist in the context without risking one corrupting or shadowing the other when we have these RPCs running in parallel.

Using segregated context keys intuitively solves both requirements. Also in the discussion with leads, the inclination was towards this approach, as it aligns much better with our existing patterns for handling metadata.

@mbissa mbissa assigned arjan-bal and unassigned mbissa Apr 22, 2026
Comment thread stats/opentelemetry/opentelemetry.go Outdated
Comment thread stats/opentelemetry/opentelemetry.go Outdated
Comment thread stats/opentelemetry/opentelemetry.go Outdated
Comment thread stats/opentelemetry/opentelemetry.go Outdated
Comment thread stats/opentelemetry/opentelemetry.go Outdated
Comment thread stats/opentelemetry/opentelemetry.go Outdated
@arjan-bal arjan-bal assigned mbissa and unassigned arjan-bal Apr 23, 2026
Comment thread stats/opentelemetry/e2e_test.go Outdated
Comment thread stats/opentelemetry/e2e_test.go Outdated
Comment thread stats/opentelemetry/e2e_test.go Outdated
Comment thread stats/opentelemetry/e2e_test.go Outdated
Comment thread stats/opentelemetry/e2e_test.go Outdated
Comment thread stats/opentelemetry/e2e_test.go Outdated
@mbissa mbissa assigned arjan-bal and unassigned mbissa Apr 23, 2026
Comment thread stats/opentelemetry/e2e_test.go Outdated
return nil, fmt.Errorf("failed to create relay client: %v", err)
}
defer relayCC.Close()
err = relayCC.Invoke(ctx, "/grpc.testing.TestService/UnregisteredCall", in, &testpb.SimpleResponse{})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason for using the low-level Invoke method instead of creating a test service client to call UnaryCall or EmptyCall? Typically, Invoke is reserved for generated code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent was just to have a different method name for the outbound call to easily detect attribute leakage. I can switch the outbound call to use the generated client to call EmptyCall (or another available method) instead of using Invoke with an unregistered name. That would keep the names distinct while using the generated client.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using EmptyCall and UnaryCall. Since both are unary methods, the implementation size would remain the same.

There are other tests using Invoke in this file, most probably because they're testing the case where the server doesn't have a handler for method being invoked by the client.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment thread stats/opentelemetry/e2e_test.go
Comment thread stats/opentelemetry/e2e_test.go Outdated
Comment thread stats/opentelemetry/e2e_test.go Outdated
@arjan-bal arjan-bal assigned mbissa and unassigned arjan-bal Apr 24, 2026
@mbissa mbissa assigned arjan-bal and unassigned mbissa Apr 28, 2026
@arjan-bal arjan-bal assigned mbissa and unassigned arjan-bal Apr 28, 2026
@arjan-bal

Copy link
Copy Markdown
Contributor

In the release notes, you can omit internal details and focus only on the user visible impact, i.e. "prevent outbound client calls from overwriting telemetry metadata in relay scenarios".

@mbissa mbissa enabled auto-merge (squash) April 28, 2026 11:47

@arjan-bal arjan-bal left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can cherry-pick this into the 1.81.x branch.

@mbissa mbissa merged commit 3d82ab3 into grpc:master Apr 28, 2026
14 checks passed
mbissa added a commit that referenced this pull request May 4, 2026
Original PR: #9081

RELEASE NOTES:
* otel: Segregate client and server RPCInfo used for metrics and traces.
eleboucher pushed a commit to eleboucher/talos-mcp that referenced this pull request May 18, 2026
This PR contains the following updates:

| Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) |
|---|---|---|---|
| [google.golang.org/grpc](https://github.com/grpc/grpc-go) | `v1.81.0` → `v1.81.1` | ![age](https://developer.mend.io/api/mc/badges/age/go/google.golang.org%2fgrpc/v1.81.1?slim=true) | ![confidence](https://developer.mend.io/api/mc/badges/confidence/go/google.golang.org%2fgrpc/v1.81.0/v1.81.1?slim=true) |

---

### Release Notes

<details>
<summary>grpc/grpc-go (google.golang.org/grpc)</summary>

### [`v1.81.1`](https://github.com/grpc/grpc-go/releases/tag/v1.81.1): Release 1.81.1

[Compare Source](grpc/grpc-go@v1.81.0...v1.81.1)

### Security

- xds/rbac: Fix a potential authorization bypass caused by incorrectly falling through URI/DNS SANs to Subject Distinguished Name (DN) when matching the authenticated principal name. With this fix, only the first non-empty identity source will be used, as per [gRFC A41](https://github.com/grpc/proposal/blob/master/A41-xds-rbac.md). ([#&#8203;9111](grpc/grpc-go#9111))
  - Special Thanks: [@&#8203;al4an444](https://github.com/al4an444)

### Bug Fixes

- otel: Segregate client and server RPC information used for metrics and traces, to avoid one overwriting the other. ([#&#8203;9081](grpc/grpc-go#9081))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xMDEuMSIsInVwZGF0ZWRJblZlciI6IjQzLjEwMS4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJ0eXBlL3BhdGNoIl19-->

Reviewed-on: https://git.erwanleboucher.dev/eleboucher/talos-mcp/pulls/2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

grpc Server Metrics rpcInfo gets overwritten by grpc client invocations

2 participants