-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Fix flaky cluster tests in 24-links.tcl #10157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky cluster tests in 24-links.tcl #10157
Conversation
| set nodes [get_cluster_nodes $id] | ||
| set links [get_cluster_links $id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the cluster is un stable, Instead of getting nodes and links again, IMO we should get them while waiting the condition to be met.
proc number_of_peers {id nodes} {
upvar $nodes n
set n [get_cluster_nodes $id]
return [expr [llength $n] - 1]
}
proc number_of_links {id links} {
upvar $links l
set l [get_cluster_links $id]
return [llength $l]
}
test "Each node has two links with each peer" {
foreach_redis_id id {
set nodes {}
set links {}
# Assert that from point of view of each node, there are two links for
# each peer. It might take a while for cluster to stabilize so wait up
# to 5 seconds.
wait_for_condition 50 100 {
[number_of_peers $id $nodes]*2 == [number_of_links $id $links]
} else {
assert_equal [expr [number_of_peers $id $nodes]*2] [number_of_links $id $links]
}
# Then check if there are two entries in `$links` for each entry in `$nodes`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would the cluster be unstable? I was under the impression that the cluster was still establishing connections as a part of the meeting process, which is why not all links were established. In steady state it shouldn't be unstable.
|
@ny0312 I noticed another a sporadic failure in this test: maybe you can look into it. |
Fix two flaky tests introduced in PR #9774
For test "Each node has two links with each peer"
This test seems to be flaky because cluster cluster is not stable and sometimes a node doesn't have both inbound and outbound connections established with every peer.
This failure is more rare than the next one. The fix is to add retries.
For test "Disconnect link when send buffer limit reached"
There were two sources of failure.
Redis getting OOM killed by kernel due to out of swap. In the test, I'm allowing cluster link buffers to grow up to 32MB. There are 20 Redis nodes running in parallel in cluster tests. That proved to be too much for the FreeBSD test environment used by the daily runs.
Example failure link: https://github.com/redis/redis/runs/4733591841?check_suite_focus=true
Fix is to use smaller cluster link buffer limits and fill it up by repeatedly sending smallish messages. This approach should be adaptive to different test environments.
I'm assuming as soon as I send a large PUBLISH command to fill up a cluster link, the link will be freed. But in reality the link will only get freed in the next clusterCron run whenever that happens. My test is not accounting for this race condition.
Example failure link: https://github.com/redis/redis/runs/4829401183?check_suite_focus=true#step:9:630
Fix is to wait for 0.5s before checking if link has been freed.