Fix flaky libnetwork/networkdb tests#42625
Conversation
682063f to
3e3b4e9
Compare
|
CI Failure:
I managed to consistently reproduce the error locally with updated dependency (i.e. each run, which effectively breaks CI) |
78e92be to
0e6706d
Compare
There was a problem hiding this comment.
See commit message, perhaps someone more familiar with the codebase can confirm that we don't need to alter the existing networkdb implementation to account for any potential change in memberlist?
@thaJeztah ?
Note, that if this test was flaky before, then we should not have anything to worry about (that would mean that the update did not cause the issue, but perhaps just made it fail more often, due to e.g. improved performance)
There was a problem hiding this comment.
Not very familiar with this either, but seems to make sense.
Perhaps @arkodg (if he's still around) would know
There's another test that is known to be flaky, and showed a similar error, so it's possible that it may not be related (but good to check); #42484 |
|
Another failure: Looks like another race condition, I have a strong feeling that something like 0e6706d would fix it as well. |
0e6706d to
f250696
Compare
This allows the rejoin intervals to be chosen according to the context within which the component is used, and, in particular, this allows lower intervals to be used within TestNetworkDBIslands test. Signed-off-by: Roman Volosatovs <roman.volosatovs@docker.com>
Upstream update fixes the issue where left node would be marked as failed, which caused `TestNetworkDBIslands` to occasionally fail. Signed-off-by: Roman Volosatovs <roman.volosatovs@docker.com>
`github.com/hashicorp/memberlist` update caused `TestNetworkDBCRUDTableEntries` to occasionally fail, because the test would try to check whether an entry write is propagated to all nodes, but it would not wait for all nodes to be available before performing the write. It could be that the failure is caused simply by improved performance of the dependency - it could also be that some connectivity guarantee the test depended on is not provided by the dependency anymore. The same fix is applied to `TestNetworkDBNodeJoinLeaveIteration` due to same issue. Signed-off-by: Roman Volosatovs <roman.volosatovs@docker.com>
f250696 to
2837fba
Compare
thaJeztah
left a comment
There was a problem hiding this comment.
SGTM (but not an expert on this 😅)
There was a problem hiding this comment.
Not very familiar with this either, but seems to make sense.
Perhaps @arkodg (if he's still around) would know
- What I did
Make
TestNetworkDBIslandsrun faster and respect-timeoutgo testflag value(Partially?) address #42459
Fix
TestNetworkDBNodeJoinLeaveIterationandTestNetworkDBCRUDTableEntriesas well, since they now fail constantly and break CI. Potentially, that can be extracted into another PR, which this one would be blocked on, but I don't think it's worth it, since the changes are minimal, let me know if you think otherwise.- How I did it
test.Deadlinevaluegithub.com/hashicorp/memberlist. Note that currently used version dates back to 2017.There have been a few improvements and bugfixes made since, in particular, the logic handling left/failed nodes was improved.
- How to verify it
I believe this commit is of most interest hashicorp/memberlist@237d410, but I have not made a thorough analysis, whether this is indeed the "fix" we need.
Note, I have encountered the following failure only once in ~100 runs:
That might be just an unfortunate slowness of my system, but it may be another source of flakiness, which has to be investigated still. I propose not to close the original issue just yet, and wait for a few days/weeks to see if the test is still flaky and make the decision then.
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)