Conversation
|
Fixes #56923 |
|
Converting to draft as I stress test this and remove the debug log lines |
5937554 to
13cadf9
Compare
Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
2913c40 to
fac0725
Compare
Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
fac0725 to
e15dc42
Compare
|
/retest |
Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
keithmattix
left a comment
There was a problem hiding this comment.
Commented the PR a bit for easier review
| existed := i.dependencyState.collectionDependencies.InsertContains(d.id) | ||
| i.mu.Unlock() | ||
| // For any new collections we depend on, start watching them if its the first time we have watched them. | ||
| if !i.dependencyState.collectionDependencies.InsertContains(d.id) { |
There was a problem hiding this comment.
This was previous non-locked contested access leading to the data races on postsubmit
| EndpointCount int | ||
| } | ||
|
|
||
| type EventMatcher struct { |
There was a problem hiding this comment.
Just makes tests a bit less flaky
| func(hc krt.HandlerContext) *network.ID { | ||
| nw := ptr.OrEmpty(krt.FetchOne(hc, globalNetworks.LocalSystemNamespace.AsCollection())) | ||
| return network.ID(nw) | ||
| return ptr.Of(network.ID(nw)) |
There was a problem hiding this comment.
Using a pointer to better detect when we haven't ascertained the remote cluster's network yet
| func(hc krt.HandlerContext) network.ID { | ||
| nwPtr := krt.FetchOne(ctx, globalNetworks.RemoteSystemNamespaceNetworks, krt.FilterIndex(globalNetworks.SystemNamespaceNetworkByCluster, c.ID)) | ||
| func(hc krt.HandlerContext) *network.ID { | ||
| nwPtr := krt.FetchOne(hc, globalNetworks.RemoteSystemNamespaceNetworks, krt.FilterIndex(globalNetworks.RemoteSystemNamespaceNetworkByCluster, c.ID)) |
There was a problem hiding this comment.
Make sure we're using the correct handler context
| SystemNamespaceNetworkByCluster krt.Index[cluster.ID, krt.Singleton[string]] | ||
| NetworkGateways krt.Collection[NetworkGateway] | ||
| GatewaysByNetwork krt.Index[network.ID, NetworkGateway] | ||
| RemoteSystemNamespaceNetworkByCluster krt.Index[cluster.ID, krt.Singleton[string]] |
There was a problem hiding this comment.
Most of the diff is due to this; it's a rename to make clearer to callers that this is only remote network data
| trafficType string | ||
| podAssertion func(s *ambientTestServer) | ||
| svcAssertion func(s *ambientTestServer) | ||
| podAssertion func(s *ambientTestServer, c cluster.ID) |
There was a problem hiding this comment.
Used with the matching added later on so we can be more specific about what we DON'T want to see
Please provide a description of this PR:
The merge logic in the join/nestedjoin collections had a TOCTOU bug where we were looking at the present state of its set of collections to understand what the state was at the time the original event was triggered. This could cause panics like what was observed in #56923 if the event was e.g. a stale update, but the key had already been removed from all of our collections.To fix this, we move to a key based event handling model where we utilize our merge cache as the sole source of truth for diffing purposes. This means we may miss some intermediate steps, but the end result should be equivalent.Reverted the above in the second commit since it doesn't seem to play well with split horizon for some reason. Now, this PR just fixes some bugs and flakes