Skip to content

Fix Data Races#56926

Closed
keithmattix wants to merge 3 commits intoistio:masterfrom
keithmattix:fix-nestedjoin-bugs
Closed

Fix Data Races#56926
keithmattix wants to merge 3 commits intoistio:masterfrom
keithmattix:fix-nestedjoin-bugs

Conversation

@keithmattix
Copy link
Copy Markdown
Contributor

@keithmattix keithmattix commented Jul 8, 2025

Please provide a description of this PR:
The merge logic in the join/nestedjoin collections had a TOCTOU bug where we were looking at the present state of its set of collections to understand what the state was at the time the original event was triggered. This could cause panics like what was observed in #56923 if the event was e.g. a stale update, but the key had already been removed from all of our collections.

To fix this, we move to a key based event handling model where we utilize our merge cache as the sole source of truth for diffing purposes. This means we may miss some intermediate steps, but the end result should be equivalent.

Reverted the above in the second commit since it doesn't seem to play well with split horizon for some reason. Now, this PR just fixes some bugs and flakes

@keithmattix keithmattix requested review from a team as code owners July 8, 2025 22:21
@istio-policy-bot istio-policy-bot added area/ambient Issues related to ambient mesh release-notes-none Indicates a PR that does not require release notes. labels Jul 8, 2025
@istio-testing istio-testing added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 8, 2025
@keithmattix
Copy link
Copy Markdown
Contributor Author

keithmattix commented Jul 8, 2025

Fixes #56923

@keithmattix keithmattix added the cherrypick/release-1.27 Set this label on a PR to auto-merge it to the release-1.27 branch label Jul 9, 2025
@istio-testing istio-testing added needs-rebase Indicates a PR needs to be rebased before being merged size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 10, 2025
@keithmattix keithmattix marked this pull request as draft July 10, 2025 21:31
@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Jul 10, 2025
@keithmattix
Copy link
Copy Markdown
Contributor Author

Converting to draft as I stress test this and remove the debug log lines

@keithmattix keithmattix force-pushed the fix-nestedjoin-bugs branch from 5937554 to 13cadf9 Compare July 11, 2025 00:03
@istio-testing istio-testing removed the needs-rebase Indicates a PR needs to be rebased before being merged label Jul 11, 2025
@keithmattix keithmattix marked this pull request as ready for review July 11, 2025 00:05
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Jul 11, 2025
Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
@keithmattix keithmattix force-pushed the fix-nestedjoin-bugs branch from 2913c40 to fac0725 Compare July 11, 2025 00:22
Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
@keithmattix keithmattix force-pushed the fix-nestedjoin-bugs branch from fac0725 to e15dc42 Compare July 11, 2025 00:23
@zirain
Copy link
Copy Markdown
Member

zirain commented Jul 11, 2025

/retest

Signed-off-by: Keith Mattix II <keithmattix@microsoft.com>
@istio-testing istio-testing added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 11, 2025
Copy link
Copy Markdown
Contributor Author

@keithmattix keithmattix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented the PR a bit for easier review

existed := i.dependencyState.collectionDependencies.InsertContains(d.id)
i.mu.Unlock()
// For any new collections we depend on, start watching them if its the first time we have watched them.
if !i.dependencyState.collectionDependencies.InsertContains(d.id) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previous non-locked contested access leading to the data races on postsubmit

EndpointCount int
}

type EventMatcher struct {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just makes tests a bit less flaky

func(hc krt.HandlerContext) *network.ID {
nw := ptr.OrEmpty(krt.FetchOne(hc, globalNetworks.LocalSystemNamespace.AsCollection()))
return network.ID(nw)
return ptr.Of(network.ID(nw))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a pointer to better detect when we haven't ascertained the remote cluster's network yet

func(hc krt.HandlerContext) network.ID {
nwPtr := krt.FetchOne(ctx, globalNetworks.RemoteSystemNamespaceNetworks, krt.FilterIndex(globalNetworks.SystemNamespaceNetworkByCluster, c.ID))
func(hc krt.HandlerContext) *network.ID {
nwPtr := krt.FetchOne(hc, globalNetworks.RemoteSystemNamespaceNetworks, krt.FilterIndex(globalNetworks.RemoteSystemNamespaceNetworkByCluster, c.ID))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure we're using the correct handler context

SystemNamespaceNetworkByCluster krt.Index[cluster.ID, krt.Singleton[string]]
NetworkGateways krt.Collection[NetworkGateway]
GatewaysByNetwork krt.Index[network.ID, NetworkGateway]
RemoteSystemNamespaceNetworkByCluster krt.Index[cluster.ID, krt.Singleton[string]]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the diff is due to this; it's a rename to make clearer to callers that this is only remote network data

trafficType string
podAssertion func(s *ambientTestServer)
svcAssertion func(s *ambientTestServer)
podAssertion func(s *ambientTestServer, c cluster.ID)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used with the matching added later on so we can be more specific about what we DON'T want to see

@keithmattix keithmattix changed the title Move to a key-based reconciliation model for merged collections Fix Data Races Jul 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/ambient Issues related to ambient mesh cherrypick/release-1.27 Set this label on a PR to auto-merge it to the release-1.27 branch release-notes-none Indicates a PR that does not require release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants