xds: leaf clusters provide the handshake info instead of top level cluster#8956
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8956 +/- ##
==========================================
+ Coverage 83.23% 83.25% +0.02%
==========================================
Files 410 410
Lines 32572 32576 +4
==========================================
+ Hits 27111 27122 +11
+ Misses 4066 4063 -3
+ Partials 1395 1391 -4
🚀 New features to boost your workflow:
|
| // If the security config is invalid, for example, if the provider | ||
| // instance is not found in the bootstrap config, we need to put the | ||
| // channel in transient failure. | ||
| return fmt.Errorf("received Cluster resource that contains invalid security config: %v", err) |
There was a problem hiding this comment.
Do we need to return ErrBadResolver state here?
There was a problem hiding this comment.
I am not sure. My understanding was that we return ErrBadResolver when something that resolver has created or added is not right. Like xdsConfig is not added as an attribute of something like that. But in this case is handleSecurityConfig returns an error that mean the security config or the CDS sent by the management server is not correct or has error. That has nothing to do with resolver. That is why I did not return ErrBadResolver here.
WDYT?
There was a problem hiding this comment.
I was under the view that any data that an LB policy gets from its parent is considered name resolver data and therefore if the LB policy doesn't like it, it should return ErrBadResolverState. But looking at the code and reading the comments in resolver.go, it looks like it shouldn't matter for the xDS resolver since it is a watch based resolver and not a polling resolver.
// If an error is returned, the resolver should try to resolve the
// target again. The resolver should use a backoff timer to prevent
// overloading the server with requests. If a resolver is certain that
// reresolving will not change the result, e.g. because it is
// a watch-based resolver, returned errors can be ignored.
As long as none of the LB policies in our tree actually looks into the error value returned by its child, and acts differently based on whether it is ErrBadResolverState or not, we should be fine. I checked the code, and it doesn't look like there is any policy that does this. But you should also double check this. Thanks.
There was a problem hiding this comment.
The only place I could see the ErrBadResolverState being used is in lazy balancer here . And the lazy balancer is used in ringhash. It looks like it will call re-resolve when receiving ErrBadResolverState. But as you said returning ErrBadResolverState might not make sense because xDS resolver is watch based and if it got a bad security config once, only when mgmt server sends a new update itself, it might have a good security config. Let me know what you think.
| hiPtr := xdsinternal.GetHandshakeInfo(chi.Attributes) | ||
| hi := (*xdsinternal.HandshakeInfo)(hiPtr.Load()) |
There was a problem hiding this comment.
I'm pretty sure we don't need this type assertion, since GetHandshakeInfo returns an atomic.Pointer[HandshakeInfo].
This can be replaced with a single line:
hi := xdsinternal.GetHandshakeInfo(chi.Attributes).Load()
|
|
||
| // GetHandshakeInfo returns a pointer to the *HandshakeInfo stored in attr. | ||
| func GetHandshakeInfo(attr *attributes.Attributes) *unsafe.Pointer { | ||
| func GetHandshakeInfo(attr *attributes.Attributes) *atomic.Pointer[HandshakeInfo] { |
There was a problem hiding this comment.
While you are here, do you mind removing the Get prefix from this function's name.
There was a problem hiding this comment.
Go doesnt allow function and variable to have same name and since the struct is also called HandshakeInfo , simply removing the Get prefix is not allowed. Changed to HandshakeInfoFromAttribute , let me know what you think?
| if !b.xdsCredsInUse { | ||
| return nil | ||
| } | ||
| var xdsHI *xds.HandshakeInfo |
There was a problem hiding this comment.
Nit: We can get rid of this local variable and inline the call to xds.NewHandshakeInfo within the call to b.xdsHIPtr.Store.
| // If the security config is invalid, for example, if the provider | ||
| // instance is not found in the bootstrap config, we need to put the | ||
| // channel in transient failure. | ||
| return fmt.Errorf("received Cluster resource that contains invalid security config: %v", err) |
There was a problem hiding this comment.
I was under the view that any data that an LB policy gets from its parent is considered name resolver data and therefore if the LB policy doesn't like it, it should return ErrBadResolverState. But looking at the code and reading the comments in resolver.go, it looks like it shouldn't matter for the xDS resolver since it is a watch based resolver and not a polling resolver.
// If an error is returned, the resolver should try to resolve the
// target again. The resolver should use a backoff timer to prevent
// overloading the server with requests. If a resolver is certain that
// reresolving will not change the result, e.g. because it is
// a watch-based resolver, returned errors can be ignored.
As long as none of the LB policies in our tree actually looks into the error value returned by its child, and acts differently based on whether it is ErrBadResolverState or not, we should be fine. I checked the code, and it doesn't look like there is any policy that does this. But you should also double check this. Thanks.
easwars
left a comment
There was a problem hiding this comment.
LGTM, modulo minor comments
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors security configuration handling for xDS aggregate clusters, moving the logic from the top-level cdsbalancer to the leaf clusterimpl balancer, aligning with gRFC A74. It also addresses an issue with priority.UpdateClientConnState error propagation. However, two potential nil pointer dereference vulnerabilities were identified that could lead to a Denial of Service (panic) when retrieving certificate provider configurations and handshake information without proper validation. A comprehensive test case has been added to validate the new behavior with aggregate clusters.
| hiPtr := xdsinternal.GetHandshakeInfo(chi.Attributes) | ||
| hi := (*xdsinternal.HandshakeInfo)(hiPtr.Load()) |
There was a problem hiding this comment.
In ClientHandshake, a nil pointer dereference vulnerability can occur if xdsinternal.GetHandshakeInfo(chi.Attributes) returns nil, leading to a panic. This happens if handshakeAttrKey is missing. A check for nil hiPtr should be added to fall back to the default credentials. Additionally, the type cast to (*xdsinternal.HandshakeInfo) is redundant as hiPtr.Load() already returns the correct type.
| hiPtr := xdsinternal.GetHandshakeInfo(chi.Attributes) | |
| hi := (*xdsinternal.HandshakeInfo)(hiPtr.Load()) | |
| hiPtr := xdsinternal.GetHandshakeInfo(chi.Attributes) | |
| if hiPtr == nil { | |
| return c.fallback.ClientHandshake(ctx, authority, rawConn) | |
| } | |
| hi := (*xdsinternal.HandshakeInfo)(hiPtr.Load()) |
Fixes: #8599
This PR is part of gRFC A74. The changes in this PR are :
priority.UpdateClientConnStateto update the clusterimpl's state were silently suppressed. This has been changed to ensure these errors are properly propagated, triggering a Transient Failure (TF) state when an error is returned.RELEASE NOTES: