xds: leaf clusters provide the handshake info instead of top level cluster by eshitachandwani · Pull Request #8956 · grpc/grpc-go

eshitachandwani · 2026-03-06T09:52:22Z

This PR is part of gRFC A74. The changes in this PR are :

Ensures the handshake uses the security configuration defined at the leaf cluster level, rather than defaulting to the top-level aggregate cluster configuration.
Previously, errors returned by priority.UpdateClientConnState to update the clusterimpl's state were silently suppressed. This has been changed to ensure these errors are properly propagated, triggering a Transient Failure (TF) state when an error is returned.
Added a test case to verify that leaf cluster security configurations take precedence over the top-level aggregate cluster. The test uses a top-level cluster with an invalid SAN matcher (which passes xDS validation but fails at the handshake level) and a leaf cluster with a valid configuration. Confirmed that RPCs now succeed by correctly utilizing the leaf config; verified the test fails on master but passes with this PR.

RELEASE NOTES:

xds: Fixed an issue where security config from the top-level aggregate cluster were used instead of the leaf cluster for handshake.

codecov · 2026-03-06T09:56:36Z

Codecov Report

❌ Patch coverage is 85.71429% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.25%. Comparing base (8360b4c) to head (c05225b).
⚠️ Report is 7 commits behind head on master.

Files with missing lines	Patch %	Lines
internal/xds/balancer/clusterimpl/clusterimpl.go	87.09%	4 Missing and 4 partials ⚠️
credentials/xds/xds.go	33.33%	1 Missing and 1 partial ⚠️
internal/xds/balancer/cdsbalancer/cdsbalancer.go	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8956      +/-   ##
==========================================
+ Coverage   83.23%   83.25%   +0.02%     
==========================================
  Files         410      410              
  Lines       32572    32576       +4     
==========================================
+ Hits        27111    27122      +11     
+ Misses       4066     4063       -3     
+ Partials     1395     1391       -4

Files with missing lines	Coverage Δ
internal/credentials/xds/handshake_info.go	`93.24% <100.00%> (ø)`
internal/xds/balancer/priority/balancer_child.go	`91.42% <100.00%> (+0.65%)`	⬆️
internal/xds/balancer/cdsbalancer/cdsbalancer.go	`62.62% <66.66%> (-7.52%)`	⬇️
credentials/xds/xds.go	`88.88% <33.33%> (-2.13%)`	⬇️
internal/xds/balancer/clusterimpl/clusterimpl.go	`86.34% <87.09%> (+0.22%)`	⬆️

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

easwars · 2026-03-06T23:58:55Z

+		// If the security config is invalid, for example, if the provider
+		// instance is not found in the bootstrap config, we need to put the
+		// channel in transient failure.
+		return fmt.Errorf("received Cluster resource that contains invalid security config: %v", err)


Do we need to return ErrBadResolver state here?

I am not sure. My understanding was that we return ErrBadResolver when something that resolver has created or added is not right. Like xdsConfig is not added as an attribute of something like that. But in this case is handleSecurityConfig returns an error that mean the security config or the CDS sent by the management server is not correct or has error. That has nothing to do with resolver. That is why I did not return ErrBadResolver here.
WDYT?

I was under the view that any data that an LB policy gets from its parent is considered name resolver data and therefore if the LB policy doesn't like it, it should return ErrBadResolverState. But looking at the code and reading the comments in resolver.go, it looks like it shouldn't matter for the xDS resolver since it is a watch based resolver and not a polling resolver.

// If an error is returned, the resolver should try to resolve the // target again. The resolver should use a backoff timer to prevent // overloading the server with requests. If a resolver is certain that // reresolving will not change the result, e.g. because it is // a watch-based resolver, returned errors can be ignored.

As long as none of the LB policies in our tree actually looks into the error value returned by its child, and acts differently based on whether it is ErrBadResolverState or not, we should be fine. I checked the code, and it doesn't look like there is any policy that does this. But you should also double check this. Thanks.

The only place I could see the ErrBadResolverState being used is in lazy balancer here . And the lazy balancer is used in ringhash. It looks like it will call re-resolve when receiving ErrBadResolverState. But as you said returning ErrBadResolverState might not make sense because xDS resolver is watch based and if it got a bad security config once, only when mgmt server sends a new update itself, it might have a good security config. Let me know what you think.

…clusterimpl

easwars · 2026-03-10T19:33:59Z

+	hiPtr := xdsinternal.GetHandshakeInfo(chi.Attributes)
+	hi := (*xdsinternal.HandshakeInfo)(hiPtr.Load())


I'm pretty sure we don't need this type assertion, since GetHandshakeInfo returns an atomic.Pointer[HandshakeInfo].

This can be replaced with a single line:
hi := xdsinternal.GetHandshakeInfo(chi.Attributes).Load()

easwars · 2026-03-10T19:35:13Z


 // GetHandshakeInfo returns a pointer to the *HandshakeInfo stored in attr.
-func GetHandshakeInfo(attr *attributes.Attributes) *unsafe.Pointer {
+func GetHandshakeInfo(attr *attributes.Attributes) *atomic.Pointer[HandshakeInfo] {


While you are here, do you mind removing the Get prefix from this function's name.

Go doesnt allow function and variable to have same name and since the struct is also called HandshakeInfo , simply removing the Get prefix is not allowed. Changed to HandshakeInfoFromAttribute , let me know what you think?

easwars · 2026-03-10T19:37:01Z

+	if !b.xdsCredsInUse {
+		return nil
+	}
+	var xdsHI *xds.HandshakeInfo


Nit: We can get rid of this local variable and inline the call to xds.NewHandshakeInfo within the call to b.xdsHIPtr.Store.

easwars · 2026-03-10T19:49:56Z

+		// If the security config is invalid, for example, if the provider
+		// instance is not found in the bootstrap config, we need to put the
+		// channel in transient failure.
+		return fmt.Errorf("received Cluster resource that contains invalid security config: %v", err)


I was under the view that any data that an LB policy gets from its parent is considered name resolver data and therefore if the LB policy doesn't like it, it should return ErrBadResolverState. But looking at the code and reading the comments in resolver.go, it looks like it shouldn't matter for the xDS resolver since it is a watch based resolver and not a polling resolver.

// If an error is returned, the resolver should try to resolve the // target again. The resolver should use a backoff timer to prevent // overloading the server with requests. If a resolver is certain that // reresolving will not change the result, e.g. because it is // a watch-based resolver, returned errors can be ignored.

As long as none of the LB policies in our tree actually looks into the error value returned by its child, and acts differently based on whether it is ErrBadResolverState or not, we should be fine. I checked the code, and it doesn't look like there is any policy that does this. But you should also double check this. Thanks.

easwars

LGTM, modulo minor comments

easwars · 2026-03-10T19:50:55Z

/gemini review

gemini-code-assist

Code Review

This pull request refactors security configuration handling for xDS aggregate clusters, moving the logic from the top-level cdsbalancer to the leaf clusterimpl balancer, aligning with gRFC A74. It also addresses an issue with priority.UpdateClientConnState error propagation. However, two potential nil pointer dereference vulnerabilities were identified that could lead to a Denial of Service (panic) when retrieving certificate provider configurations and handshake information without proper validation. A comprehensive test case has been added to validate the new behavior with aggregate clusters.

gemini-code-assist · 2026-03-10T20:00:46Z

+	hiPtr := xdsinternal.GetHandshakeInfo(chi.Attributes)
+	hi := (*xdsinternal.HandshakeInfo)(hiPtr.Load())


In ClientHandshake, a nil pointer dereference vulnerability can occur if xdsinternal.GetHandshakeInfo(chi.Attributes) returns nil, leading to a panic. This happens if handshakeAttrKey is missing. A check for nil hiPtr should be added to fall back to the default credentials. Additionally, the type cast to (*xdsinternal.HandshakeInfo) is redundant as hiPtr.Load() already returns the correct type.

Suggested change

hiPtr := xdsinternal.GetHandshakeInfo(chi.Attributes)

hi := (*xdsinternal.HandshakeInfo)(hiPtr.Load())

hiPtr := xdsinternal.GetHandshakeInfo(chi.Attributes)

if hiPtr == nil {

return c.fallback.ClientHandshake(ctx, authority, rawConn)

}

hi := (*xdsinternal.HandshakeInfo)(hiPtr.Load())

eshitachandwani and others added 2 commits March 6, 2026 15:08

security change

1c162d4

Merge branch 'master' into move_security_to_clusterimpl

c222254

eshitachandwani added this to the 1.81 Release milestone Mar 6, 2026

eshitachandwani added the Type: Bug label Mar 6, 2026

eshitachandwani requested a review from easwars March 6, 2026 10:28

eshitachandwani assigned easwars Mar 6, 2026

easwars reviewed Mar 7, 2026

View reviewed changes

easwars assigned eshitachandwani and unassigned easwars Mar 7, 2026

eshitachandwani added 2 commits March 9, 2026 10:45

address review coments

0087171

Merge remote-tracking branch 'upstream/master' into move_security_to_…

57ad815

…clusterimpl

eshitachandwani requested a review from easwars March 9, 2026 05:54

eshitachandwani assigned easwars and unassigned eshitachandwani Mar 9, 2026

eshitachandwani mentioned this pull request Mar 9, 2026

xds: Add SNI related field in handshake info #8965

Merged

easwars reviewed Mar 10, 2026

View reviewed changes

easwars approved these changes Mar 10, 2026

View reviewed changes

easwars assigned eshitachandwani and unassigned easwars Mar 10, 2026

gemini-code-assist Bot reviewed Mar 10, 2026

View reviewed changes

make changes

c05225b

eshitachandwani merged commit fd53961 into grpc:master Mar 11, 2026
14 checks passed

eshitachandwani deleted the move_security_to_clusterimpl branch March 11, 2026 04:30

		hiPtr := xdsinternal.GetHandshakeInfo(chi.Attributes)
		hi := (*xdsinternal.HandshakeInfo)(hiPtr.Load())

Conversation

eshitachandwani commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eshitachandwani Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

easwars left a comment

Choose a reason for hiding this comment

Uh oh!

easwars commented Mar 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eshitachandwani commented Mar 6, 2026 •

edited

Loading

codecov Bot commented Mar 6, 2026 •

edited

Loading

eshitachandwani Mar 10, 2026 •

edited

Loading