Skip to content

Fix flaky libnetwork/networkdb tests#42625

Merged
justincormack merged 3 commits intomoby:masterfrom
rvolosatovs:fix_network_db_islands
Jul 19, 2021
Merged

Fix flaky libnetwork/networkdb tests#42625
justincormack merged 3 commits intomoby:masterfrom
rvolosatovs:fix_network_db_islands

Conversation

@rvolosatovs
Copy link
Copy Markdown

@rvolosatovs rvolosatovs commented Jul 12, 2021

- What I did
Make TestNetworkDBIslands run faster and respect -timeout go test flag value
(Partially?) address #42459
Fix TestNetworkDBNodeJoinLeaveIteration and TestNetworkDBCRUDTableEntries as well, since they now fail constantly and break CI. Potentially, that can be extracted into another PR, which this one would be blocked on, but I don't think it's worth it, since the changes are minimal, let me know if you think otherwise.

- How I did it

  • Make rejoin intervals configurable, specify a lower interval during test and account for test.Deadline value
  • Update github.com/hashicorp/memberlist. Note that currently used version dates back to 2017.
    There have been a few improvements and bugfixes made since, in particular, the logic handling left/failed nodes was improved.

- How to verify it
I believe this commit is of most interest hashicorp/memberlist@237d410, but I have not made a thorough analysis, whether this is indeed the "fix" we need.

Note, I have encountered the following failure only once in ~100 runs:

=== Failed
=== FAIL: libnetwork/networkdb TestNetworkDBIslands (22.03s)
time="2021-07-12T15:12:17Z" level=info msg="New memberlist node - Node:node1 will use memberlist nodeID:f135e2881fc7 with config:&{NodeID:f135e2881fc7 Hostname:node1 BindAddr:0.0.0.0 AdvertiseAddr: BindPort:10001 Keys:[] PacketBufferSize:1400 reapEntryInterval:1800000000000 reapNetworkInterval:1825000000000 rejoinClusterDuration:1000000000 rejoinClusterInterval:6000000000 StatsPrintPeriod:5m0s HealthPrintPeriod:1m0s}"
time="2021-07-12T15:12:17Z" level=info msg="Node f135e2881fc7/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:17Z" level=info msg="Node f135e2881fc7/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:17Z" level=info msg="New memberlist node - Node:node2 will use memberlist nodeID:12d1d234a539 with config:&{NodeID:12d1d234a539 Hostname:node2 BindAddr:0.0.0.0 AdvertiseAddr: BindPort:10002 Keys:[] PacketBufferSize:1400 reapEntryInterval:1800000000000 reapNetworkInterval:1825000000000 rejoinClusterDuration:1000000000 rejoinClusterInterval:6000000000 StatsPrintPeriod:5m0s HealthPrintPeriod:1m0s}"
time="2021-07-12T15:12:17Z" level=info msg="Node 12d1d234a539/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:17Z" level=info msg="Node 12d1d234a539/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:17Z" level=info msg="The new bootstrap node list is:[localhost:10001]"
time="2021-07-12T15:12:17Z" level=debug msg="memberlist: Stream connection from=[::1]:36282"
time="2021-07-12T15:12:17Z" level=debug msg="memberlist: Initiating push/pull sync with:  [::1]:10001"
time="2021-07-12T15:12:17Z" level=info msg="Node 12d1d234a539/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:17Z" level=info msg="Node 12d1d234a539/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:17Z" level=info msg="Node f135e2881fc7/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:17Z" level=info msg="Node f135e2881fc7/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:17Z" level=debug msg="memberlist: Initiating push/pull sync with:  127.0.0.1:10001"
time="2021-07-12T15:12:17Z" level=debug msg="memberlist: Stream connection from=127.0.0.1:46710"
time="2021-07-12T15:12:18Z" level=info msg="New memberlist node - Node:node3 will use memberlist nodeID:e9889fb524ad with config:&{NodeID:e9889fb524ad Hostname:node3 BindAddr:0.0.0.0 AdvertiseAddr: BindPort:10003 Keys:[] PacketBufferSize:1400 reapEntryInterval:1800000000000 reapNetworkInterval:1825000000000 rejoinClusterDuration:1000000000 rejoinClusterInterval:6000000000 StatsPrintPeriod:5m0s HealthPrintPeriod:1m0s}"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="The new bootstrap node list is:[localhost:10002]"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Initiating push/pull sync with:  [::1]:10002"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Stream connection from=[::1]:58118"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node f135e2881fc7/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node f135e2881fc7/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node 12d1d234a539/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node 12d1d234a539/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Initiating push/pull sync with:  127.0.0.1:10002"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Stream connection from=127.0.0.1:43476"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="New memberlist node - Node:node4 will use memberlist nodeID:dd853f955091 with config:&{NodeID:dd853f955091 Hostname:node4 BindAddr:0.0.0.0 AdvertiseAddr: BindPort:10004 Keys:[] PacketBufferSize:1400 reapEntryInterval:1800000000000 reapNetworkInterval:1825000000000 rejoinClusterDuration:1000000000 rejoinClusterInterval:6000000000 StatsPrintPeriod:5m0s HealthPrintPeriod:1m0s}"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="The new bootstrap node list is:[localhost:10003]"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Initiating push/pull sync with:  [::1]:10003"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Stream connection from=[::1]:39720"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node 12d1d234a539/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node 12d1d234a539/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node f135e2881fc7/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node f135e2881fc7/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Initiating push/pull sync with:  127.0.0.1:10003"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Stream connection from=127.0.0.1:34386"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="New memberlist node - Node:node5 will use memberlist nodeID:644827c34b56 with config:&{NodeID:644827c34b56 Hostname:node5 BindAddr:0.0.0.0 AdvertiseAddr: BindPort:10005 Keys:[] PacketBufferSize:1400 reapEntryInterval:1800000000000 reapNetworkInterval:1825000000000 rejoinClusterDuration:1000000000 rejoinClusterInterval:6000000000 StatsPrintPeriod:5m0s HealthPrintPeriod:1m0s}"
time="2021-07-12T15:12:18Z" level=info msg="Node 644827c34b56/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node 644827c34b56/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="The new bootstrap node list is:[localhost:10004]"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Initiating push/pull sync with:  [::1]:10004"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Stream connection from=[::1]:35756"
time="2021-07-12T15:12:18Z" level=info msg="Node 644827c34b56/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node 644827c34b56/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node e9889fb524ad/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node f135e2881fc7/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node f135e2881fc7/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node 12d1d234a539/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node 12d1d234a539/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:18Z" level=info msg="Node dd853f955091/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Initiating push/pull sync with:  127.0.0.1:10004"
time="2021-07-12T15:12:18Z" level=debug msg="memberlist: Stream connection from=127.0.0.1:43650"
time="2021-07-12T15:12:19Z" level=info msg="Node 644827c34b56/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:19Z" level=info msg="Node 644827c34b56/172.17.0.2, joined gossip cluster"
time="2021-07-12T15:12:19Z" level=info msg="Node 644827c34b56/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:19Z" level=info msg="Node 644827c34b56/172.17.0.2, added to nodes list"
time="2021-07-12T15:12:23Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:24Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:24Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:24Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:29Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:30Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:30Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:30Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:35Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:36Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:36Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
time="2021-07-12T15:12:36Z" level=debug msg="rejoinClusterBootStrap did not find any valid IP"
    networkdb_test.go:839: timeout hit after 20s: node3:Waiting for cluser peers to be established

That might be just an unfortunate slowness of my system, but it may be another source of flakiness, which has to be investigated still. I propose not to close the original issue just yet, and wait for a few days/weeks to see if the test is still flaky and make the decision then.

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@rvolosatovs rvolosatovs force-pushed the fix_network_db_islands branch from 682063f to 3e3b4e9 Compare July 12, 2021 15:49
@rvolosatovs
Copy link
Copy Markdown
Author

rvolosatovs commented Jul 12, 2021

CI Failure:

[2021-07-12T15:59:32.921Z] === Failed

[2021-07-12T15:59:32.921Z] === FAIL: github.com/docker/docker/libnetwork/networkdb TestNetworkDBCRUDTableEntries (7.63s)

[2021-07-12T15:59:32.921Z]     networkdb_test.go:304: Entry existence verification test failed for node2(c079a6e0a803)

[2021-07-12T15:59:32.921Z] 2021/07/12 15:58:01 Closing DB instances...

Might be caused by the update , will investigate

I managed to consistently reproduce the error locally with updated dependency (i.e. each run, which effectively breaks CI)
Fixed it in 0e6706d

@rvolosatovs rvolosatovs force-pushed the fix_network_db_islands branch from 78e92be to 0e6706d Compare July 12, 2021 16:53
Copy link
Copy Markdown
Author

@rvolosatovs rvolosatovs Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See commit message, perhaps someone more familiar with the codebase can confirm that we don't need to alter the existing networkdb implementation to account for any potential change in memberlist?
@thaJeztah ?

Note, that if this test was flaky before, then we should not have anything to worry about (that would mean that the update did not cause the issue, but perhaps just made it fail more often, due to e.g. improved performance)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very familiar with this either, but seems to make sense.

Perhaps @arkodg (if he's still around) would know

@thaJeztah
Copy link
Copy Markdown
Member

TestNetworkDBCRUDTableEntries

There's another test that is known to be flaky, and showed a similar error, so it's possible that it may not be related (but good to check); #42484

@rvolosatovs
Copy link
Copy Markdown
Author

rvolosatovs commented Jul 12, 2021

Another failure:

[2021-07-12T17:04:41.212Z] === Failed

[2021-07-12T17:04:41.212Z] === FAIL: github.com/docker/docker/libnetwork/networkdb TestNetworkDBNodeJoinLeaveIteration (7.04s)

[2021-07-12T17:04:41.212Z]     networkdb_test.go:523: The networkNodes list has to have be 2 instead of 1 - [a2b0de7a54bf]

[2021-07-12T17:04:41.212Z] 

[2021-07-12T17:04:41.212Z] 

[2021-07-12T17:04:41.212Z] DONE 2027 tests, 69 skipped, 1 failure in 314.563s

Looks like another race condition, I have a strong feeling that something like 0e6706d would fix it as well.
The test is definitely flaky, since it usually passes locally, but I managed to reproduce it.

@rvolosatovs rvolosatovs force-pushed the fix_network_db_islands branch from 0e6706d to f250696 Compare July 12, 2021 17:20
@rvolosatovs rvolosatovs changed the title Fix flaky TestNetworkDBIslands Fix flaky libnetwork/networkdb tests Jul 12, 2021
Roman Volosatovs added 3 commits July 12, 2021 19:25
This allows the rejoin intervals to be chosen according to the context
within which the component is used, and, in particular, this allows
lower intervals to be used within TestNetworkDBIslands test.

Signed-off-by: Roman Volosatovs <roman.volosatovs@docker.com>
Upstream update fixes the issue where left node would be marked as
failed, which caused `TestNetworkDBIslands` to occasionally fail.

Signed-off-by: Roman Volosatovs <roman.volosatovs@docker.com>
`github.com/hashicorp/memberlist` update caused `TestNetworkDBCRUDTableEntries`
to occasionally fail, because the test would try to check whether an entry
write is propagated to all nodes, but it would not wait for all nodes to
be available before performing the write.
It could be that the failure is caused simply by improved performance of
the dependency - it could also be that some connectivity guarantee the
test depended on is not provided by the dependency anymore.
The same fix is applied to `TestNetworkDBNodeJoinLeaveIteration` due to
same issue.

Signed-off-by: Roman Volosatovs <roman.volosatovs@docker.com>
@rvolosatovs rvolosatovs force-pushed the fix_network_db_islands branch from f250696 to 2837fba Compare July 12, 2021 17:26
@rvolosatovs rvolosatovs marked this pull request as ready for review July 12, 2021 18:42
Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM (but not an expert on this 😅)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very familiar with this either, but seems to make sense.

Perhaps @arkodg (if he's still around) would know

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants