Skip to content

xds: leaf clusters provide the handshake info instead of top level cluster#8956

Merged
eshitachandwani merged 5 commits into
grpc:masterfrom
eshitachandwani:move_security_to_clusterimpl
Mar 11, 2026
Merged

xds: leaf clusters provide the handshake info instead of top level cluster#8956
eshitachandwani merged 5 commits into
grpc:masterfrom
eshitachandwani:move_security_to_clusterimpl

Conversation

@eshitachandwani

@eshitachandwani eshitachandwani commented Mar 6, 2026

Copy link
Copy Markdown
Member

Fixes: #8599

This PR is part of gRFC A74. The changes in this PR are :

  1. Ensures the handshake uses the security configuration defined at the leaf cluster level, rather than defaulting to the top-level aggregate cluster configuration.
  2. Previously, errors returned by priority.UpdateClientConnState to update the clusterimpl's state were silently suppressed. This has been changed to ensure these errors are properly propagated, triggering a Transient Failure (TF) state when an error is returned.
  3. Added a test case to verify that leaf cluster security configurations take precedence over the top-level aggregate cluster. The test uses a top-level cluster with an invalid SAN matcher (which passes xDS validation but fails at the handshake level) and a leaf cluster with a valid configuration. Confirmed that RPCs now succeed by correctly utilizing the leaf config; verified the test fails on master but passes with this PR.

RELEASE NOTES:

  • xds: Fixed an issue where security config from the top-level aggregate cluster were used instead of the leaf cluster for handshake.

@codecov

codecov Bot commented Mar 6, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.71429% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.25%. Comparing base (8360b4c) to head (c05225b).
⚠️ Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
internal/xds/balancer/clusterimpl/clusterimpl.go 87.09% 4 Missing and 4 partials ⚠️
credentials/xds/xds.go 33.33% 1 Missing and 1 partial ⚠️
internal/xds/balancer/cdsbalancer/cdsbalancer.go 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8956      +/-   ##
==========================================
+ Coverage   83.23%   83.25%   +0.02%     
==========================================
  Files         410      410              
  Lines       32572    32576       +4     
==========================================
+ Hits        27111    27122      +11     
+ Misses       4066     4063       -3     
+ Partials     1395     1391       -4     
Files with missing lines Coverage Δ
internal/credentials/xds/handshake_info.go 93.24% <100.00%> (ø)
internal/xds/balancer/priority/balancer_child.go 91.42% <100.00%> (+0.65%) ⬆️
internal/xds/balancer/cdsbalancer/cdsbalancer.go 62.62% <66.66%> (-7.52%) ⬇️
credentials/xds/xds.go 88.88% <33.33%> (-2.13%) ⬇️
internal/xds/balancer/clusterimpl/clusterimpl.go 86.34% <87.09%> (+0.22%) ⬆️

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread internal/xds/balancer/clusterimpl/clusterimpl.go Outdated
Comment thread internal/xds/balancer/clusterimpl/clusterimpl.go Outdated
Comment thread internal/xds/balancer/clusterimpl/clusterimpl.go Outdated
// If the security config is invalid, for example, if the provider
// instance is not found in the bootstrap config, we need to put the
// channel in transient failure.
return fmt.Errorf("received Cluster resource that contains invalid security config: %v", err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return ErrBadResolver state here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure. My understanding was that we return ErrBadResolver when something that resolver has created or added is not right. Like xdsConfig is not added as an attribute of something like that. But in this case is handleSecurityConfig returns an error that mean the security config or the CDS sent by the management server is not correct or has error. That has nothing to do with resolver. That is why I did not return ErrBadResolver here.
WDYT?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the view that any data that an LB policy gets from its parent is considered name resolver data and therefore if the LB policy doesn't like it, it should return ErrBadResolverState. But looking at the code and reading the comments in resolver.go, it looks like it shouldn't matter for the xDS resolver since it is a watch based resolver and not a polling resolver.

	// If an error is returned, the resolver should try to resolve the
	// target again. The resolver should use a backoff timer to prevent
	// overloading the server with requests. If a resolver is certain that
	// reresolving will not change the result, e.g. because it is
	// a watch-based resolver, returned errors can be ignored.

As long as none of the LB policies in our tree actually looks into the error value returned by its child, and acts differently based on whether it is ErrBadResolverState or not, we should be fine. I checked the code, and it doesn't look like there is any policy that does this. But you should also double check this. Thanks.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place I could see the ErrBadResolverState being used is in lazy balancer here . And the lazy balancer is used in ringhash. It looks like it will call re-resolve when receiving ErrBadResolverState. But as you said returning ErrBadResolverState might not make sense because xDS resolver is watch based and if it got a bad security config once, only when mgmt server sends a new update itself, it might have a good security config. Let me know what you think.

Comment thread internal/xds/balancer/clusterimpl/clusterimpl.go Outdated
Comment thread internal/xds/balancer/clusterimpl/clusterimpl.go Outdated
Comment thread internal/xds/balancer/clusterimpl/tests/balancer_test.go Outdated
Comment thread internal/xds/balancer/priority/balancer_child.go
@easwars easwars assigned eshitachandwani and unassigned easwars Mar 7, 2026
Comment thread credentials/xds/xds.go Outdated
Comment on lines +116 to +117
hiPtr := xdsinternal.GetHandshakeInfo(chi.Attributes)
hi := (*xdsinternal.HandshakeInfo)(hiPtr.Load())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we don't need this type assertion, since GetHandshakeInfo returns an atomic.Pointer[HandshakeInfo].

This can be replaced with a single line:
hi := xdsinternal.GetHandshakeInfo(chi.Attributes).Load()

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// GetHandshakeInfo returns a pointer to the *HandshakeInfo stored in attr.
func GetHandshakeInfo(attr *attributes.Attributes) *unsafe.Pointer {
func GetHandshakeInfo(attr *attributes.Attributes) *atomic.Pointer[HandshakeInfo] {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you are here, do you mind removing the Get prefix from this function's name.

@eshitachandwani eshitachandwani Mar 10, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go doesnt allow function and variable to have same name and since the struct is also called HandshakeInfo , simply removing the Get prefix is not allowed. Changed to HandshakeInfoFromAttribute , let me know what you think?

if !b.xdsCredsInUse {
return nil
}
var xdsHI *xds.HandshakeInfo

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can get rid of this local variable and inline the call to xds.NewHandshakeInfo within the call to b.xdsHIPtr.Store.

// If the security config is invalid, for example, if the provider
// instance is not found in the bootstrap config, we need to put the
// channel in transient failure.
return fmt.Errorf("received Cluster resource that contains invalid security config: %v", err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the view that any data that an LB policy gets from its parent is considered name resolver data and therefore if the LB policy doesn't like it, it should return ErrBadResolverState. But looking at the code and reading the comments in resolver.go, it looks like it shouldn't matter for the xDS resolver since it is a watch based resolver and not a polling resolver.

	// If an error is returned, the resolver should try to resolve the
	// target again. The resolver should use a backoff timer to prevent
	// overloading the server with requests. If a resolver is certain that
	// reresolving will not change the result, e.g. because it is
	// a watch-based resolver, returned errors can be ignored.

As long as none of the LB policies in our tree actually looks into the error value returned by its child, and acts differently based on whether it is ErrBadResolverState or not, we should be fine. I checked the code, and it doesn't look like there is any policy that does this. But you should also double check this. Thanks.

@easwars easwars left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, modulo minor comments

@easwars

easwars commented Mar 10, 2026

Copy link
Copy Markdown
Contributor

/gemini review

@easwars easwars assigned eshitachandwani and unassigned easwars Mar 10, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors security configuration handling for xDS aggregate clusters, moving the logic from the top-level cdsbalancer to the leaf clusterimpl balancer, aligning with gRFC A74. It also addresses an issue with priority.UpdateClientConnState error propagation. However, two potential nil pointer dereference vulnerabilities were identified that could lead to a Denial of Service (panic) when retrieving certificate provider configurations and handshake information without proper validation. A comprehensive test case has been added to validate the new behavior with aggregate clusters.

Comment thread internal/xds/balancer/clusterimpl/clusterimpl.go
Comment thread credentials/xds/xds.go Outdated
Comment on lines +116 to +117
hiPtr := xdsinternal.GetHandshakeInfo(chi.Attributes)
hi := (*xdsinternal.HandshakeInfo)(hiPtr.Load())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

In ClientHandshake, a nil pointer dereference vulnerability can occur if xdsinternal.GetHandshakeInfo(chi.Attributes) returns nil, leading to a panic. This happens if handshakeAttrKey is missing. A check for nil hiPtr should be added to fall back to the default credentials. Additionally, the type cast to (*xdsinternal.HandshakeInfo) is redundant as hiPtr.Load() already returns the correct type.

Suggested change
hiPtr := xdsinternal.GetHandshakeInfo(chi.Attributes)
hi := (*xdsinternal.HandshakeInfo)(hiPtr.Load())
hiPtr := xdsinternal.GetHandshakeInfo(chi.Attributes)
if hiPtr == nil {
return c.fallback.ClientHandshake(ctx, authority, rawConn)
}
hi := (*xdsinternal.HandshakeInfo)(hiPtr.Load())

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOne.

@eshitachandwani eshitachandwani merged commit fd53961 into grpc:master Mar 11, 2026
14 checks passed
@eshitachandwani eshitachandwani deleted the move_security_to_clusterimpl branch March 11, 2026 04:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cds: leaf cluster security config must not be inherited from aggregate cluster

2 participants