Skip to content

Conversation

@corhere
Copy link
Contributor

@corhere corhere commented Jul 7, 2025

- What I did

- How I did it
I identified a bug in the rejoin logic that TestNetworkDBIslands is supposed to be validating which only manifests in tests as that is the only time where more than one node can be advertising the same IP address, only differing by port. I updated the logic to use both the IP address and port number when identifying nodes so that it behaves the same in tests as in production. I also updated the test itself to be more tolerant of nodes being spuriously marked as failed.

- How to verify it
go test -count=...

- Human readable description for the release notes

- A picture of a cute animal (not mandatory but encouraged)

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes flakiness in the TestNetworkDBIslands integration test and corrects the cluster rejoin logic to distinguish nodes by both IP and port.

  • Renamed and updated TestNetworkDBIslands to accept nodes marked as left or failed in any combination.
  • Switched bootstrap parsing in rejoinClusterBootStrap to use netip.ParseAddrPort and compare full address-port pairs.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
libnetwork/networkdb/networkdb_test.go Renamed test, unified left/failed count checks, and updated log messages
libnetwork/networkdb/cluster.go Replaced net.SplitHostPort/net.ParseIP usage with netip parsing and full port comparisons

corhere added 2 commits July 7, 2025 14:09
The rejoinClusterBootStrap periodic task rejoins with the bootstrap
nodes if none of them are members of the cluster. It correlates the
cluster nodes with the bootstrap list by comparing IP addresses,
ignoring ports. In normal operation this works out fine as every node
has a unique IP address, but in unit tests every node listens on a
distinct port of 127.0.0.1. This situation causes the check to
incorrectly filter out all nodes from the list, mistaking them for the
local node.

Filter out the local node using pointer equality of the *node to avoid
any ambiguity. Correlate the remote nodes by IP:port so that the check
behaves the same in tests and in production.

Signed-off-by: Cory Snider <csnider@mirantis.com>
With rejoinClusterBootStrap fixed in tests, split clusters should
reliably self-heal in tests as well as production. Work around the other
source of flakiness in TestNetworkDBIslands: timing out waiting for a
failed node to transition to gracefully left. This flake happens when
one of the leaving nodes sends its NodeLeft message to the other leaving
node, and the second is shut down before it has a chance to rebroadcast
the message to the remaining nodes. The proper fix would be to leverage
memberlist's own bookkeeping instead of duplicating it poorly with user
messages, but doing so requires a change in the memberlist module.
Instead have the test check that the sum of failed+left nodes is
expected instead of waiting for all nodes to have failed==3 && left==0.

Signed-off-by: Cory Snider <csnider@mirantis.com>
@corhere corhere force-pushed the libn/networkdb-rejoin-tests branch from d8032a7 to aff444d Compare July 7, 2025 18:09
@thaJeztah thaJeztah changed the title libnetwork/networkdb: make TestNetworkDBIslands not flay libnetwork/networkdb: make TestNetworkDBIslands not flaky Jul 8, 2025
// bootstrap IPs are usually IP:port from the Join
var bootstrapIP net.IP
ipStr, _, err := net.SplitHostPort(bootIP)
bootstrapIP, err := netip.ParseAddrPort(bootIP)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL netip.ParseAddrPort, nice!

(Looks like it only is for ip-address:port, not hostname:port though, so I guess in many of our cases, we'd still need net.SplitHostPort or variants thereof)

@thaJeztah
Copy link
Member

Had to kick CI because GitHub was mis-behaving.

This one is unrelated (also known flaky test);

=== FAIL: github.com/docker/docker/integration/networking TestFlakyPortMappedHairpinWindows (13.19s)
    nat_windows_test.go:110: assertion failed: error is not nil: Post "http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.51/containers/40851ef9e2d5547472cea9120b7399a533fcec0b200119986b448189fbabc510/start": context deadline exceeded
    panic.go:636: assertion failed: error is not nil: Error response from daemon: error while removing network: network clientnet has active endpoints (name:"quirky_wilbur" id:"4c0041f5bd55")

Copy link
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah thaJeztah merged commit 1de9f39 into moby:master Jul 8, 2025
350 of 359 checks passed
@corhere corhere deleted the libn/networkdb-rejoin-tests branch July 24, 2025 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: libnetwork/TestNetworkDBIslands

3 participants