Fix flaky libnetwork/networkdb tests by rvolosatovs · Pull Request #42625 · moby/moby

rvolosatovs · 2021-07-12T15:48:25Z

- What I did
Make TestNetworkDBIslands run faster and respect -timeout go test flag value
(Partially?) address #42459
Fix TestNetworkDBNodeJoinLeaveIteration and TestNetworkDBCRUDTableEntries as well, since they now fail constantly and break CI. Potentially, that can be extracted into another PR, which this one would be blocked on, but I don't think it's worth it, since the changes are minimal, let me know if you think otherwise.

- How I did it

Make rejoin intervals configurable, specify a lower interval during test and account for test.Deadline value
Update github.com/hashicorp/memberlist. Note that currently used version dates back to 2017.
There have been a few improvements and bugfixes made since, in particular, the logic handling left/failed nodes was improved.

- How to verify it
I believe this commit is of most interest hashicorp/memberlist@237d410, but I have not made a thorough analysis, whether this is indeed the "fix" we need.

Note, I have encountered the following failure only once in ~100 runs:

=== Failed
=== FAIL: libnetwork/networkdb TestNetworkDBIslands (22.03s)
time="2021-07-12T15:12:17Z" level=info msg="New memberlist node - Node:node1 will use memberlist nodeID:f135e2881fc7 with config:&{NodeID:f135e2881fc7 Hostname:node1 BindAddr:0.0.0.0 AdvertiseAddr: BindPort:10001 Keys:[] PacketBufferSize:1400 reapEntryInterval:1800000000000 reapNetworkInterval:1825000000000 rejoinClusterDuration:1000000000 rejoinClusterInterval:6000000000 StatsPrintPeriod:5m0s HealthPrintPeriod:1m0s}"
time="2021-07-12T15:12:17Z" level=info msg="Node f135e2881fc7/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:17Z" level=info msg="Node f135e2881fc7/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:17Z" level=info msg="New memberlist node - Node:node2 will use memberlist nodeID:12d1d234a539 with config:&{NodeID:12d1d234a539 Hostname:node2 BindAddr:0.0.0.0 AdvertiseAddr: BindPort:10002 Keys:[] PacketBufferSize:1400 reapEntryInterval:1800000000000 reapNetworkInterval:1825000000000 rejoinClusterDuration:1000000000 rejoinClusterInterval:6000000000 StatsPrintPeriod:5m0s HealthPrintPeriod:1m0s}"
time="2021-07-12T15:12:17Z" level=info msg="Node 12d1d234a539/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:17Z" level=info msg="Node 12d1d234a539/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:17Z" level=info msg="The new bootstrap node list is:[localhost:10001]"
time="2021-07-12T15:12:17Z" level=debug msg="memberlist: Stream connection from=[::1]:36282"
time="2021-07-12T15:12:17Z" level=debug msg="memberlist: Initiating push/pull sync with:  [::1]:10001"
time="2021-07-12T15:12:17Z" level=info msg="Node 12d1d234a539/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:17Z" level=info msg="Node 12d1d234a539/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:17Z" level=info msg="Node f135e2881fc7/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:17Z" level=info msg="Node f135e2881fc7/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:17Z" level=debug msg="memberlist: Initiating push/pull sync with:  127.0.0.1:10001"
time="2021-07-12T15:12:17Z" level=debug msg="memberlist: Stream connection from=127.0.0.1:46710"
time="2021-07-12T15:12:18Z" level=info msg="New memberlist node - Node:node3 will use memberlist nodeID:e9889fb524ad with config:&{NodeID:e9889fb524ad Hostname:node3 BindAddr:0.0.0.0 AdvertiseAddr: BindPort:10003 Keys:[] PacketBufferSize:1400 reapEntryInterval:1800000000000 reapNetworkInterval:1825000000000 rejoinClusterDuration:1000000000 rejoinClusterInterval:6000000000 StatsPrintPeriod:5m0s HealthPrintPeriod:1m0s}"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="The new bootstrap node list is:[localhost:10002]"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Initiating push/pull sync with:  [::1]:10002"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Stream connection from=[::1]:58118"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node f135e2881fc7/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node f135e2881fc7/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node 12d1d234a539/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node 12d1d234a539/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Initiating push/pull sync with:  127.0.0.1:10002"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Stream connection from=127.0.0.1:43476"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="New memberlist node - Node:node4 will use memberlist nodeID:dd853f955091 with config:&{NodeID:dd853f955091 Hostname:node4 BindAddr:0.0.0.0 AdvertiseAddr: BindPort:10004 Keys:[] PacketBufferSize:1400 reapEntryInterval:1800000000000 reapNetworkInterval:1825000000000 rejoinClusterDuration:1000000000 rejoinClusterInterval:6000000000 StatsPrintPeriod:5m0s HealthPrintPeriod:1m0s}"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="The new bootstrap node list is:[localhost:10003]"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Initiating push/pull sync with:  [::1]:10003"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Stream connection from=[::1]:39720"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node 12d1d234a539/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node 12d1d234a539/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node f135e2881fc7/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node f135e2881fc7/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Initiating push/pull sync with:  127.0.0.1:10003"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Stream connection from=127.0.0.1:34386"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="New memberlist node - Node:node5 will use memberlist nodeID:644827c34b56 with config:&{NodeID:644827c34b56 Hostname:node5 BindAddr:0.0.0.0 AdvertiseAddr: BindPort:10005 Keys:[] PacketBufferSize:1400 reapEntryInterval:1800000000000 reapNetworkInterval:1825000000000 rejoinClusterDuration:1000000000 rejoinClusterInterval:6000000000 StatsPrintPeriod:5m0s HealthPrintPeriod:1m0s}"
time="2021-07-12T15:12:18Z" level=info msg="Node 644827c34b56/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node 644827c34b56/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="The new bootstrap node list is:[localhost:10004]"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Initiating push/pull sync with:  [::1]:10004"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Stream connection from=[::1]:35756"
time="2021-07-12T15:12:18Z" level=info msg="Node 644827c34b56/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node 644827c34b56/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node f135e2881fc7/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node f135e2881fc7/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node 12d1d234a539/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node 12d1d234a539/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Initiating push/pull sync with:  127.0.0.1:10004"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Stream connection from=127.0.0.1:43650"
time="2021-07-12T15:12:19Z" level=info msg="Node 644827c34b56/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:19Z" level=info msg="Node 644827c34b56/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:19Z" level=info msg="Node 644827c34b56/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:19Z" level=info msg="Node 644827c34b56/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:23Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:24Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:24Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:24Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:29Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:30Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:30Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:30Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:35Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:36Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:36Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:36Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
    networkdb_test.go:839: timeout hit after 20s: node3:Waiting for cluser peers to be established

That might be just an unfortunate slowness of my system, but it may be another source of flakiness, which has to be investigated still. I propose not to close the original issue just yet, and wait for a few days/weeks to see if the test is still flaky and make the decision then.

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

rvolosatovs · 2021-07-12T16:11:11Z

CI Failure:

[2021-07-12T15:59:32.921Z] === Failed

[2021-07-12T15:59:32.921Z] === FAIL: github.com/docker/docker/libnetwork/networkdb TestNetworkDBCRUDTableEntries (7.63s)

[2021-07-12T15:59:32.921Z]     networkdb_test.go:304: Entry existence verification test failed for node2(c079a6e0a803)

[2021-07-12T15:59:32.921Z] 2021/07/12 15:58:01 Closing DB instances...

~~Might be~~ caused by the update ~~, will investigate~~

I managed to consistently reproduce the error locally with updated dependency (i.e. each run, which effectively breaks CI)
Fixed it in 0e6706d

rvolosatovs · 2021-07-12T16:56:09Z

libnetwork/networkdb/networkdb_test.go

See commit message, perhaps someone more familiar with the codebase can confirm that we don't need to alter the existing networkdb implementation to account for any potential change in memberlist?
@thaJeztah ?

Note, that if this test was flaky before, then we should not have anything to worry about (that would mean that the update did not cause the issue, but perhaps just made it fail more often, due to e.g. improved performance)

Not very familiar with this either, but seems to make sense.

Perhaps @arkodg (if he's still around) would know

thaJeztah · 2021-07-12T16:56:47Z

TestNetworkDBCRUDTableEntries

There's another test that is known to be flaky, and showed a similar error, so it's possible that it may not be related (but good to check); #42484

rvolosatovs · 2021-07-12T17:10:26Z

Another failure:

[2021-07-12T17:04:41.212Z] === Failed

[2021-07-12T17:04:41.212Z] === FAIL: github.com/docker/docker/libnetwork/networkdb TestNetworkDBNodeJoinLeaveIteration (7.04s)

[2021-07-12T17:04:41.212Z]     networkdb_test.go:523: The networkNodes list has to have be 2 instead of 1 - [a2b0de7a54bf]

[2021-07-12T17:04:41.212Z] 

[2021-07-12T17:04:41.212Z] 

[2021-07-12T17:04:41.212Z] DONE 2027 tests, 69 skipped, 1 failure in 314.563s

Looks like another race condition, I have a strong feeling that something like 0e6706d would fix it as well.
The test is definitely flaky, since it usually passes locally, but I managed to reproduce it.

This allows the rejoin intervals to be chosen according to the context within which the component is used, and, in particular, this allows lower intervals to be used within TestNetworkDBIslands test. Signed-off-by: Roman Volosatovs <roman.volosatovs@docker.com>

Upstream update fixes the issue where left node would be marked as failed, which caused `TestNetworkDBIslands` to occasionally fail. Signed-off-by: Roman Volosatovs <roman.volosatovs@docker.com>

`github.com/hashicorp/memberlist` update caused `TestNetworkDBCRUDTableEntries` to occasionally fail, because the test would try to check whether an entry write is propagated to all nodes, but it would not wait for all nodes to be available before performing the write. It could be that the failure is caused simply by improved performance of the dependency - it could also be that some connectivity guarantee the test depended on is not provided by the dependency anymore. The same fix is applied to `TestNetworkDBNodeJoinLeaveIteration` due to same issue. Signed-off-by: Roman Volosatovs <roman.volosatovs@docker.com>

thaJeztah

SGTM (but not an expert on this 😅)

thaJeztah · 2021-07-16T16:47:07Z

libnetwork/networkdb/networkdb_test.go

Not very familiar with this either, but seems to make sense.

Perhaps @arkodg (if he's still around) would know

rvolosatovs force-pushed the fix_network_db_islands branch from 682063f to 3e3b4e9 Compare July 12, 2021 15:49

rvolosatovs force-pushed the fix_network_db_islands branch from 78e92be to 0e6706d Compare July 12, 2021 16:53

rvolosatovs commented Jul 12, 2021

View reviewed changes

rvolosatovs force-pushed the fix_network_db_islands branch from 0e6706d to f250696 Compare July 12, 2021 17:20

rvolosatovs changed the title ~~Fix flaky TestNetworkDBIslands~~ Fix flaky libnetwork/networkdb tests Jul 12, 2021

thaJeztah added status/2-code-review area/networking Networking area/testing labels Jul 12, 2021

Roman Volosatovs added 3 commits July 12, 2021 19:25

vendor: hashicorp/memberlist, google/btree (dep) update

cdd04a9

Upstream update fixes the issue where left node would be marked as failed, which caused `TestNetworkDBIslands` to occasionally fail. Signed-off-by: Roman Volosatovs <roman.volosatovs@docker.com>

rvolosatovs force-pushed the fix_network_db_islands branch from f250696 to 2837fba Compare July 12, 2021 17:26

rvolosatovs marked this pull request as ready for review July 12, 2021 18:42

thaJeztah mentioned this pull request Jul 13, 2021

Update hcsshim vendor to v0.8.20 #42611

Merged

thaJeztah approved these changes Jul 16, 2021

View reviewed changes

justincormack approved these changes Jul 19, 2021

View reviewed changes

justincormack merged commit 3aa7d80 into moby:master Jul 19, 2021

rvolosatovs deleted the fix_network_db_islands branch July 19, 2021 13:55

thaJeztah added this to the 21.xx milestone Jul 20, 2021

thaJeztah mentioned this pull request Jul 21, 2021

storage-driver: promote overlay2, make Btrfs and ZFS opt-in #42661

Merged

thaJeztah mentioned this pull request Jul 29, 2021

Flaky test: TestNetworkDBNodeJoinLeaveIteration #42698

Closed

rvolosatovs mentioned this pull request Aug 1, 2021

libnetwork: wait until t.Deadline() instead of hardcoded value #42703

Merged

thaJeztah mentioned this pull request Aug 7, 2021

API: discard /system/df type parameter pre-1.42 #42716

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky libnetwork/networkdb tests#42625

Fix flaky libnetwork/networkdb tests#42625
justincormack merged 3 commits intomoby:masterfrom
rvolosatovs:fix_network_db_islands

rvolosatovs commented Jul 12, 2021 •

edited

Loading

Uh oh!

rvolosatovs commented Jul 12, 2021 •

edited

Loading

Uh oh!

rvolosatovs Jul 12, 2021 •

edited

Loading

Uh oh!

thaJeztah Jul 16, 2021

Uh oh!

thaJeztah commented Jul 12, 2021

Uh oh!

rvolosatovs commented Jul 12, 2021 •

edited

Loading

Uh oh!

thaJeztah left a comment

Uh oh!

thaJeztah Jul 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rvolosatovs commented Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rvolosatovs commented Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rvolosatovs Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thaJeztah Jul 16, 2021

Choose a reason for hiding this comment

Uh oh!

thaJeztah commented Jul 12, 2021

Uh oh!

rvolosatovs commented Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

thaJeztah Jul 16, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rvolosatovs commented Jul 12, 2021 •

edited

Loading

rvolosatovs commented Jul 12, 2021 •

edited

Loading

rvolosatovs Jul 12, 2021 •

edited

Loading

rvolosatovs commented Jul 12, 2021 •

edited

Loading