-
Notifications
You must be signed in to change notification settings - Fork 18.9k
libnetwork/networkdb: make TestNetworkDBIslands not flaky #50343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes flakiness in the TestNetworkDBIslands integration test and corrects the cluster rejoin logic to distinguish nodes by both IP and port.
- Renamed and updated
TestNetworkDBIslandsto accept nodes marked as left or failed in any combination. - Switched bootstrap parsing in
rejoinClusterBootStrapto usenetip.ParseAddrPortand compare full address-port pairs.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| libnetwork/networkdb/networkdb_test.go | Renamed test, unified left/failed count checks, and updated log messages |
| libnetwork/networkdb/cluster.go | Replaced net.SplitHostPort/net.ParseIP usage with netip parsing and full port comparisons |
The rejoinClusterBootStrap periodic task rejoins with the bootstrap nodes if none of them are members of the cluster. It correlates the cluster nodes with the bootstrap list by comparing IP addresses, ignoring ports. In normal operation this works out fine as every node has a unique IP address, but in unit tests every node listens on a distinct port of 127.0.0.1. This situation causes the check to incorrectly filter out all nodes from the list, mistaking them for the local node. Filter out the local node using pointer equality of the *node to avoid any ambiguity. Correlate the remote nodes by IP:port so that the check behaves the same in tests and in production. Signed-off-by: Cory Snider <csnider@mirantis.com>
With rejoinClusterBootStrap fixed in tests, split clusters should reliably self-heal in tests as well as production. Work around the other source of flakiness in TestNetworkDBIslands: timing out waiting for a failed node to transition to gracefully left. This flake happens when one of the leaving nodes sends its NodeLeft message to the other leaving node, and the second is shut down before it has a chance to rebroadcast the message to the remaining nodes. The proper fix would be to leverage memberlist's own bookkeeping instead of duplicating it poorly with user messages, but doing so requires a change in the memberlist module. Instead have the test check that the sum of failed+left nodes is expected instead of waiting for all nodes to have failed==3 && left==0. Signed-off-by: Cory Snider <csnider@mirantis.com>
d8032a7 to
aff444d
Compare
| // bootstrap IPs are usually IP:port from the Join | ||
| var bootstrapIP net.IP | ||
| ipStr, _, err := net.SplitHostPort(bootIP) | ||
| bootstrapIP, err := netip.ParseAddrPort(bootIP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL netip.ParseAddrPort, nice!
(Looks like it only is for ip-address:port, not hostname:port though, so I guess in many of our cases, we'd still need net.SplitHostPort or variants thereof)
|
Had to kick CI because GitHub was mis-behaving. This one is unrelated (also known flaky test); |
vvoland
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
thaJeztah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- What I did
- How I did it
I identified a bug in the rejoin logic that TestNetworkDBIslands is supposed to be validating which only manifests in tests as that is the only time where more than one node can be advertising the same IP address, only differing by port. I updated the logic to use both the IP address and port number when identifying nodes so that it behaves the same in tests as in production. I also updated the test itself to be more tolerant of nodes being spuriously marked as failed.
- How to verify it
go test -count=...- Human readable description for the release notes
- A picture of a cute animal (not mandatory but encouraged)