Skip to content

Conversation

@ny0312
Copy link
Contributor

@ny0312 ny0312 commented Jan 20, 2022

Fix two flaky tests introduced in PR #9774

For test "Each node has two links with each peer"

00:39:21> Each node has two links with each peer: FAILED: Expected 19*2 eq 37 (context: type eval line 11 cmd {assert {$num_peers*2 eq $num_links}} proc ::foreach_instance_id)
(Jumping to next unit after error)

This test seems to be flaky because cluster cluster is not stable and sometimes a node doesn't have both inbound and outbound connections established with every peer.

This failure is more rare than the next one. The fix is to add retries.

For test "Disconnect link when send buffer limit reached"

There were two sources of failure.

 00:47:22> Disconnect link when send buffer limit reached: error writing "sock802fbc590": broken pipe
    while executing
"$primary1 publish channel [prepare_value [expr 30*1024*1024]]"

Redis getting OOM killed by kernel due to out of swap. In the test, I'm allowing cluster link buffers to grow up to 32MB. There are 20 Redis nodes running in parallel in cluster tests. That proved to be too much for the FreeBSD test environment used by the daily runs.

Example failure link: https://github.com/redis/redis/runs/4733591841?check_suite_focus=true

Fix is to use smaller cluster link buffer limits and fill it up by repeatedly sending smallish messages. This approach should be adaptive to different test environments.

00:46:57> Disconnect link when send buffer limit reached: FAILED: Expected [get_info_field [::redis::redisHandle1876 cluster info] total_cluster_links_buffer_limit_exceeded] eq 1 (context: type eval line 36 cmd {assert {[get_info_field [$primary1 cluster info] total_cluster_links_buffer_limit_exceeded] eq 1}} proc ::test)

I'm assuming as soon as I send a large PUBLISH command to fill up a cluster link, the link will be freed. But in reality the link will only get freed in the next clusterCron run whenever that happens. My test is not accounting for this race condition.

Example failure link: https://github.com/redis/redis/runs/4829401183?check_suite_focus=true#step:9:630

Fix is to wait for 0.5s before checking if link has been freed.

@ny0312 ny0312 changed the title Fix flaky cluster test "Disconnect link when send buffer limit reached" Fix flaky cluster tests in 24-links.tcl Jan 21, 2022
@madolson madolson merged commit b40a9ba into redis:unstable Jan 24, 2022
Comment on lines +30 to 31
set nodes [get_cluster_nodes $id]
set links [get_cluster_links $id]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the cluster is un stable, Instead of getting nodes and links again, IMO we should get them while waiting the condition to be met.

proc number_of_peers {id nodes} {
    upvar $nodes n
    set n [get_cluster_nodes $id]
    return [expr [llength $n] - 1]
}

proc number_of_links {id links} {
    upvar $links l
    set l [get_cluster_links $id]
    return [llength $l]
}

test "Each node has two links with each peer" {
    foreach_redis_id id {
        set nodes {}
        set links {}
        # Assert that from point of view of each node, there are two links for
        # each peer. It might take a while for cluster to stabilize so wait up
        # to 5 seconds.
        wait_for_condition 50 100 {
            [number_of_peers $id $nodes]*2 == [number_of_links $id $links]
        } else {
            assert_equal [expr [number_of_peers $id $nodes]*2] [number_of_links $id $links]
        }
    
       # Then check if there are two entries in `$links` for each entry in `$nodes`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the cluster be unstable? I was under the impression that the cluster was still establishing connections as a part of the meeting process, which is why not all links were established. In steady state it shouldn't be unstable.

@oranagra
Copy link
Member

oranagra commented May 24, 2022

@ny0312 I noticed another a sporadic failure in this test:
https://github.com/redis/redis/runs/6564848175?check_suite_focus=true (happen with test-sanitizer-address (gcc), which is slow)

00:45:47> Each node has two links with each peer: FAILED: Expected 0 eq 1 (context: type eval line 30 cmd {assert {$to eq 1}} proc ::foreach_instance_id)

maybe you can look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants