Skip to content

[TEST-FAILURE] Make sure the replicas always get the different ranks #2683

Description

@sarthakaggarwal97

Summary

The failover test seems to be flakily failing over the past few days.

Failing test(s)

Error stack trace

Sharing part of the stack trace where the test failed, otherwise it's too huge.

===== Start of server log (pid 41461) =====

### Starting server for test 
41461:M 03 Oct 2025 00:17:42.401 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
41461:M 03 Oct 2025 00:17:42.401 * oO0OoO0OoO0Oo Valkey is starting oO0OoO0OoO0Oo
41461:M 03 Oct 2025 00:17:42.401 * Valkey version=255.255.255, bits=64, commit=d5bb986f, modified=0, pid=41461, just started
41461:M 03 Oct 2025 00:17:42.401 * Configuration loaded
41461:M 03 Oct 2025 00:17:42.420 * monotonic clock: POSIX clock_gettime
41461:M 03 Oct 2025 00:17:42.422 # Failed to write PID file: Permission denied
                .+^+.                                                
            .+#########+.                                            
        .+########+########+.           Valkey 255.255.255 (d5bb986f/0) 64 bit
    .+########+'     '+########+.                                    
 .########+'     .+.     '+########.    Running in cluster mode
 |####+'     .+#######+.     '+####|    Port: 22677
 |###|   .+###############+.   |###|    PID: 41461                     
 |###|   |#####*'' ''*#####|   |###|                                 
 |###|   |####'  .-.  '####|   |###|                                 
 |###|   |###(  (@@@)  )###|   |###|          https://valkey.io/      
 |###|   |####.  '-'  .####|   |###|                                 
 |###|   |#####*.   .*#####|   |###|                                 
 |###|   '+#####|   |#####+'   |###|                                 
 |####+.     +##|   |#+'     .+####|                                 
 '#######+   |##|        .+########'                                 
    '+###|   |##|    .+########+'                                    
        '|   |####+########+'                                        
             +#########+'                                            
                '+v+'                                                

41461:M 03 Oct 2025 00:17:42.422 * No cluster configuration found, I'm 6a5f20df0b200d1b58ef5889449de3f17ba50c1d
41461:M 03 Oct 2025 00:17:42.428 * Server initialized
41461:M 03 Oct 2025 00:17:42.429 * Ready to accept connections tcp
41461:M 03 Oct 2025 00:17:42.429 * Ready to accept connections unix
41461:M 03 Oct 2025 00:17:42.486 - Accepted 127.0.0.1:37181
41461:M 03 Oct 2025 00:17:42.489 - Client closed connection id=2 addr=127.0.0.1:37181 laddr=127.0.0.1:22677 fd=12 name= age=0 idle=0 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=16384 argv-mem=0 multi-mem=0 rbs=16384 rbp=16384 obl=0 oll=0 omem=0 tot-mem=33350 events=r cmd=ping user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=7 tot-net-out=7 tot-cmds=1
41461:M 03 Oct 2025 00:17:42.502 - Accepted 127.0.0.1:35635
41461:M 03 Oct 2025 00:17:44.004 * configEpoch set to 9 via CLUSTER SET-CONFIG-EPOCH
41461:M 03 Oct 2025 00:17:44.109 - Accepting cluster node connection from 127.0.0.1:43284
41461:M 03 Oct 2025 00:17:44.109 * IP address for this node updated to 127.0.0.1
41461:M 03 Oct 2025 00:17:44.153 * Mismatch in topology information for sender node 67ad6ccba4ffa61b55df01fd0d1355861a410752 () in shard 31af5c4799d9e9c2e5c7138a5cdb0c21401ce18f
41461:M 03 Oct 2025 00:17:44.229 - Accepting cluster node connection from 127.0.0.1:43292
41461:M 03 Oct 2025 00:17:44.251 * Successfully completed handshake with 35062b4ed705e54f43812eee9bd41c6e0c496627 ()
41461:M 03 Oct 2025 00:17:44.285 - Accepting cluster node connection from 127.0.0.1:43296
41461:M 03 Oct 2025 00:17:44.314 - Accepting cluster node connection from 127.0.0.1:43308
41461:M 03 Oct 2025 00:17:44.315 * Mismatch in topology information for sender node a2e2120e124856107e40731842c94842658192a4 () in shard cefe678411f184b2aa0e13adca3caed509b7905a
41461:M 03 Oct 2025 00:17:44.355 - Accepting cluster node connection from 127.0.0.1:43322
41461:M 03 Oct 2025 00:17:44.362 * Mismatch in topology information for sender node e53c62e73a1912b3a926e600a08bd50139ecba85 () in shard 21f69416a73a3b436e24af8d27f4a1c6b9c2804c
41461:M 03 Oct 2025 00:17:44.383 - Accepting cluster node connection from 127.0.0.1:43336
41461:M 03 Oct 2025 00:17:44.400 - Accepting cluster node connection from 127.0.0.1:43346
41461:M 03 Oct 2025 00:17:44.415 - Accepting cluster node connection from 127.0.0.1:43358
41461:M 03 Oct 2025 00:17:44.423 - Accepting cluster node connection from 127.0.0.1:43370
41461:M 03 Oct 2025 00:17:44.452 * Cluster state changed: ok
41461:M 03 Oct 2025 00:17:44.664 # Missing implement of connection type tls
41461:M 03 Oct 2025 00:17:44.692 * Node 7259d703afbedd95ac3e7061030e6ca18539ac51 () is no longer primary of shard 3fcfcd71f93a304db35b490f8b7dcaa16e380bc1; removed all 0 slot(s) it used to own
41461:M 03 Oct 2025 00:17:44.692 * Node 7259d703afbedd95ac3e7061030e6ca18539ac51 () is now part of shard c8987b2a50fbe392e5a95167e6e6c609a200206f
41461:M 03 Oct 2025 00:17:44.692 * Node 7259d703afbedd95ac3e7061030e6ca18539ac51 () is now a replica of node a2e2120e124856107e40731842c94842658192a4 () in shard c8987b2a50fbe392e5a95167e6e6c609a200206f
41461:M 03 Oct 2025 00:17:44.702 * Node dc713e69651e957f21e3e44af1c33cd06439f81f () is no longer primary of shard b04575e6a5090f971a99bed17ab8d8febda4323f; removed all 0 slot(s) it used to own
41461:M 03 Oct 2025 00:17:44.704 * Node dc713e69651e957f21e3e44af1c33cd06439f81f () is now part of shard c8987b2a50fbe392e5a95167e6e6c609a200206f
41461:M 03 Oct 2025 00:17:44.704 * Node dc713e69651e957f21e3e44af1c33cd06439f81f () is now a replica of node a2e2120e124856107e40731842c94842658192a4 () in shard c8987b2a50fbe392e5a95167e6e6c609a200206f
41461:M 03 Oct 2025 00:17:44.717 * Node c4d9ce4d28e32e5f7f74219c2cddc7a122c26295 () is no longer primary of shard 90b9dc5cf7567309984c2b554ba771bec762e104; removed all 0 slot(s) it used to own
41461:M 03 Oct 2025 00:17:44.717 * Node c4d9ce4d28e32e5f7f74219c2cddc7a122c26295 () is now part of shard dabdff46766cdd52ecc930db51940ca650998a10
41461:M 03 Oct 2025 00:17:44.717 * Node c4d9ce4d28e32e5f7f74219c2cddc7a122c26295 () is now a replica of node 67ad6ccba4ffa61b55df01fd0d1355861a410752 () in shard dabdff46766cdd52ecc930db51940ca650998a10
41461:M 03 Oct 2025 00:17:44.741 * Node 35062b4ed705e54f43812eee9bd41c6e0c496627 () is no longer primary of shard 17d4e90c0a0160a0aaea886cd03047591ca88497; removed all 0 slot(s) it used to own
41461:M 03 Oct 2025 00:17:44.741 * Node 35062b4ed705e54f43812eee9bd41c6e0c496627 () is now part of shard dabdff46766cdd52ecc930db51940ca650998a10
41461:M 03 Oct 2025 00:17:44.741 * Node 35062b4ed705e54f43812eee9bd41c6e0c496627 () is now a replica of node 67ad6ccba4ffa61b55df01fd0d1355861a410752 () in shard dabdff46766cdd52ecc930db51940ca650998a10
41461:M 03 Oct 2025 00:17:44.776 * Node c45d56ab32fa1e5489b5acfb5a34e83182de37ec () is no longer primary of shard bd50c82fce837dbf5dd9d3515ad26b3215991477; removed all 0 slot(s) it used to own
41461:M 03 Oct 2025 00:17:44.776 * Node c45d56ab32fa1e5489b5acfb5a34e83182de37ec () is now part of shard 0d7c55a266935f7d40a0dd9fea945dc3b7dd684a
41461:M 03 Oct 2025 00:17:44.776 * Node c45d56ab32fa1e5489b5acfb5a34e83182de37ec () is now a replica of node e53c62e73a1912b3a926e600a08bd50139ecba85 () in shard 0d7c55a266935f7d40a0dd9fea945dc3b7dd684a
41461:S 03 Oct 2025 00:17:44.797 * Connecting to PRIMARY 127.0.0.1:22683
41461:S 03 Oct 2025 00:17:44.797 * PRIMARY <-> REPLICA sync started
41461:S 03 Oct 2025 00:17:44.803 * Non blocking connect for SYNC fired the event.
41461:S 03 Oct 2025 00:17:44.818 * Primary replied to PING, replication can continue...
41461:S 03 Oct 2025 00:17:44.826 # DEBUG LOG: ========== I am replica 8 ==========
41461:S 03 Oct 2025 00:17:44.841 * Partial resynchronization not possible (no cached primary)
41461:S 03 Oct 2025 00:17:44.877 * Full resync from primary: a4895e2ff557b00203d7548d7cd3595e64458b22:0
41461:S 03 Oct 2025 00:17:44.881 * Replica main thread creating Bio thread to save RDB to disk
41461:S 03 Oct 2025 00:17:44.882 * Replica bio thread: PRIMARY <-> REPLICA sync: receiving streamed RDB from primary with EOF to disk
41461:S 03 Oct 2025 00:17:44.882 * Replica bio thread: Done downloading RDB
41461:S 03 Oct 2025 00:17:45.480 * Replica main thread detected RDB download completion in Bio thread
41461:S 03 Oct 2025 00:17:45.480 * Loading the RDB and finalizing primary-replica sync...
41461:S 03 Oct 2025 00:17:45.483 * PRIMARY <-> REPLICA sync: Flushing old data
41461:S 03 Oct 2025 00:17:45.483 * PRIMARY <-> REPLICA sync: Loading DB in memory
41461:S 03 Oct 2025 00:17:45.483 * Loading RDB produced by Valkey version 255.255.255
41461:S 03 Oct 2025 00:17:45.483 * RDB age 1 seconds
41461:S 03 Oct 2025 00:17:45.483 * RDB memory usage when created 2.61 Mb
41461:S 03 Oct 2025 00:17:45.483 * Done loading RDB, keys loaded: 0, keys expired: 0.
41461:S 03 Oct 2025 00:17:45.483 * PRIMARY <-> REPLICA sync: Finished with success
### Starting test Cluster is up in tests/unit/cluster/failover.tcl
### Starting test Cluster is writable in tests/unit/cluster/failover.tcl
41461:S 03 Oct 2025 00:17:55.743 - Accepted 127.0.0.1:40449
41461:S 03 Oct 2025 00:17:56.089 - Client closed connection id=7 addr=127.0.0.1:40449 laddr=127.0.0.1:22677 fd=30 name= age=1 idle=1 flags=N capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=16384 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=17990 events=r cmd=NULL user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=0 tot-net-out=0 tot-cmds=0
### Starting test Killing the first primary node in tests/unit/cluster/failover.tcl
### Starting test Wait for failover in tests/unit/cluster/failover.tcl
41461:S 03 Oct 2025 00:17:57.768 - DB 0: 30 keys (0 volatile) in 210 slots HT.
41461:S 03 Oct 2025 00:17:59.413 * NODE a2e2120e124856107e40731842c94842658192a4 () possibly failing.
41461:S 03 Oct 2025 00:17:59.517 - Node e53c62e73a1912b3a926e600a08bd50139ecba85 () reported node a2e2120e124856107e40731842c94842658192a4 () as not reachable.
41461:S 03 Oct 2025 00:17:59.517 - Node 67ad6ccba4ffa61b55df01fd0d1355861a410752 () reported node a2e2120e124856107e40731842c94842658192a4 () as not reachable.
41461:S 03 Oct 2025 00:17:59.517 * Marking node a2e2120e124856107e40731842c94842658192a4 () as failing (quorum reached).
41461:S 03 Oct 2025 00:17:59.517 # Cluster state changed: fail
41461:S 03 Oct 2025 00:17:59.517 # Cluster is currently down: At least one hash slot is not served by any available node. Please check the 'cluster-require-full-coverage' configuration.
### Starting test Cluster should eventually be up again in tests/unit/cluster/failover.tcl
41461:S 03 Oct 2025 00:17:59.857 * Reconfiguring node 7259d703afbedd95ac3e7061030e6ca18539ac51 () as primary for shard c8987b2a50fbe392e5a95167e6e6c609a200206f
41461:S 03 Oct 2025 00:17:59.857 * Mismatch in topology information for sender node 7259d703afbedd95ac3e7061030e6ca18539ac51 () in shard c8987b2a50fbe392e5a95167e6e6c609a200206f
41461:S 03 Oct 2025 00:17:59.858 * Cluster state changed: ok
41461:S 03 Oct 2025 00:17:59.887 * Reconfiguring node dc713e69651e957f21e3e44af1c33cd06439f81f () as primary for shard c8987b2a50fbe392e5a95167e6e6c609a200206f
41461:S 03 Oct 2025 00:17:59.887 * Mismatch in topology information for sender node dc713e69651e957f21e3e44af1c33cd06439f81f () in shard c8987b2a50fbe392e5a95167e6e6c609a200206f
41461:S 03 Oct 2025 00:17:59.906 * Node 7259d703afbedd95ac3e7061030e6ca18539ac51 () is now a replica of node dc713e69651e957f21e3e44af1c33cd06439f81f () in shard c8987b2a50fbe392e5a95167e6e6c609a200206f
### Starting test Restarting the previously killed primary node in tests/unit/cluster/failover.tcl
### Starting test Instance #0 gets converted into a replica in tests/unit/cluster/failover.tcl
41461:S 03 Oct 2025 00:18:00.701 * Mismatch in topology information for sender node a2e2120e124856107e40731842c94842658192a4 () in shard c8987b2a50fbe392e5a95167e6e6c609a200206f
41461:S 03 Oct 2025 00:18:00.701 - Node a2e2120e124856107e40731842c94842658192a4 () has old slots configuration, sending an UPDATE message about dc713e69651e957f21e3e44af1c33cd06439f81f ()
41461:S 03 Oct 2025 00:18:00.722 * Clear FAIL state for node a2e2120e124856107e40731842c94842658192a4 (): primary without slots is reachable again.
41461:S 03 Oct 2025 00:18:00.722 * Node a2e2120e124856107e40731842c94842658192a4 () is now a replica of node dc713e69651e957f21e3e44af1c33cd06439f81f () in shard c8987b2a50fbe392e5a95167e6e6c609a200206f
41461:S 03 Oct 2025 00:18:00.738 - Node e53c62e73a1912b3a926e600a08bd50139ecba85 () reported node a2e2120e124856107e40731842c94842658192a4 () is back online.
### Starting test Make sure the replicas always get the different ranks in tests/unit/cluster/failover.tcl
41461:S 03 Oct 2025 00:18:01.145 - Connection with Node a2e2120e124856107e40731842c94842658192a4 at 127.0.0.1:32685 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.355 - Connection with Node a2e2120e124856107e40731842c94842658192a4 at 127.0.0.1:32685 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.355 - Connection with Node 67ad6ccba4ffa61b55df01fd0d1355861a410752 at 127.0.0.1:32684 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.450 - Client closed connection id=5 addr=127.0.0.1:22683 laddr=127.0.0.1:58238 fd=29 name= age=16 idle=5 flags=M capa= db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=32768 argv-mem=0 multi-mem=0 rbs=1024 rbp=37 obl=0 oll=0 omem=0 tot-mem=34374 events=r cmd=set user=(superuser) redir=-1 resp=2 lib-name= lib-ver= tot-net-in=1683 tot-net-out=561 tot-cmds=32
41461:S 03 Oct 2025 00:18:01.450 * Connection with primary lost.
41461:S 03 Oct 2025 00:18:01.450 * Caching the disconnected primary state.
41461:S 03 Oct 2025 00:18:01.450 * Reconnecting to PRIMARY 127.0.0.1:22683
41461:S 03 Oct 2025 00:18:01.451 * PRIMARY <-> REPLICA sync started
41461:S 03 Oct 2025 00:18:01.451 # Error condition on socket for SYNC: Connection refused
41461:S 03 Oct 2025 00:18:01.458 - Connection with Node e53c62e73a1912b3a926e600a08bd50139ecba85 at 127.0.0.1:32683 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.562 - Connection with Node a2e2120e124856107e40731842c94842658192a4 at 127.0.0.1:32685 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.562 - Connection with Node 67ad6ccba4ffa61b55df01fd0d1355861a410752 at 127.0.0.1:32684 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.562 - Connection with Node e53c62e73a1912b3a926e600a08bd50139ecba85 at 127.0.0.1:32683 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.663 - Connection with Node 7259d703afbedd95ac3e7061030e6ca18539ac51 at 127.0.0.1:32682 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.663 - Connection with Node 67ad6ccba4ffa61b55df01fd0d1355861a410752 at 127.0.0.1:32684 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.663 - Connection with Node e53c62e73a1912b3a926e600a08bd50139ecba85 at 127.0.0.1:32683 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.765 - Connection with Node a2e2120e124856107e40731842c94842658192a4 at 127.0.0.1:32685 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.765 - Connection with Node 7259d703afbedd95ac3e7061030e6ca18539ac51 at 127.0.0.1:32682 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.765 - Connection with Node 67ad6ccba4ffa61b55df01fd0d1355861a410752 at 127.0.0.1:32684 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.765 - Connection with Node e53c62e73a1912b3a926e600a08bd50139ecba85 at 127.0.0.1:32683 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.866 * Connecting to PRIMARY 127.0.0.1:22683
41461:S 03 Oct 2025 00:18:01.871 * PRIMARY <-> REPLICA sync started
41461:S 03 Oct 2025 00:18:01.871 # Error condition on socket for SYNC: Connection refused
41461:S 03 Oct 2025 00:18:01.973 - Connection with Node a2e2120e124856107e40731842c94842658192a4 at 127.0.0.1:32685 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.973 - Connection with Node c4d9ce4d28e32e5f7f74219c2cddc7a122c26295 at 127.0.0.1:32681 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.973 - Connection with Node 7259d703afbedd95ac3e7061030e6ca18539ac51 at 127.0.0.1:32682 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.973 - Connection with Node 67ad6ccba4ffa61b55df01fd0d1355861a410752 at 127.0.0.1:32684 failed: Connection refused
41461:S 03 Oct 2025 00:18:01.973 - Connection with Node e53c62e73a1912b3a926e600a08bd50139ecba85 at 127.0.0.1:32683 failed: Connection refused
41461:S 03 Oct 2025 00:18:02.178 - Connection with Node c45d56ab32fa1e5489b5acfb5a34e83182de37ec at 127.0.0.1:32680 failed: Connection refused
41461:S 03 Oct 2025 00:18:02.178 - Connection with Node a2e2120e124856107e40731842c94842658192a4 at 127.0.0.1:32685 failed: Connection refused
41461:S 03 Oct 2025 00:18:02.178 - Connection with Node c4d9ce4d28e32e5f7f74219c2cddc7a122c26295 at 127.0.0.1:32681 failed: Connection refused
41461:S 03 Oct 2025 00:18:02.178 - Connection with Node 7259d703afbedd95ac3e7061030e6ca18539ac51 at 127.0.0.1:32682 failed: Connection refused
41461:S 03 Oct 2025 00:18:02.178 - Connection with Node 67ad6ccba4ffa61b55df01fd0d1355861a410752 at 127.0.0.1:32684 failed: Connection refused
41461:S 03 Oct 2025 00:18:02.178 - Connection with Node e53c62e73a1912b3a926e600a08bd50139ecba85 at 127.0.0.1:32683 failed: Connection refused
41461:S 03 Oct 2025 00:18:02.380 - Connection with Node c45d56ab32fa1e5489b5acfb5a34e83182de37ec at 127.0.0.1:32680 failed: Connection refused
41461:S 03 Oct 2025 00:18:02.380 - Connection with Node a2e2120e124856107e40731842c94842658192a4 at 127.0.0.1:32685 failed: Connection refused
41461:S 03 Oct 2025 00:18:02.380 - Connection with Node dc713e69651e957f21e3e44af1c33cd06439f81f at 127.0.0.1:32679 failed: Connection refused
41461:S 03 Oct 2025 00:18:02.380 - Connection with Node c4d9ce4d28e32e5f7f74219c2cddc7a122c26295 at 127.0.0.1:32681 failed: Connection refused
41461:S 03 Oct 2025 00:18:02.380 - Connection with Node 7259d703afbedd95ac3e7061030e6ca18539ac51 at 127.0.0.1:32682 failed: Connection refused
41461:S 03 Oct 2025 00:18:02.380 - Connection with Node 67ad6ccba4ffa61b55df01fd0d1355861a410752 at 127.0.0.1:32684 failed: Connection refused
41461:S 03 Oct 2025 00:18:02.380 - Connection with Node e53c62e73a1912b3a926e600a08bd50139ecba85 at 127.0.0.1:32683 failed: Connection refused
===== End of server log (pid 41461) =====

Metadata

Metadata

Assignees

Labels

help wantedExternal contributions would be appreciatedtest-failureAn issue indicating a test failure

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions