Skip to content

Preventing temporary circular replication and slot loss in Redis Cluster Failover #13018

Description

@MagicalLas

I've encountered an issue with some nodes in a Redis cluster during a manual failover, where the state seen by certain nodes becomes incorrect.

The normal scenario for a failover is as follows:

sequenceDiagram
actor user
user->>NodeA: redis-cli cluster failover
NodeA->>NodeB: manual failover start
activate NodeB
NodeB->>NodeA: ping with offset
NodeB->>NodeA: ping with offset
NodeA->>NodeC: auth failover
NodeC->>NodeA: vote
Note over NodeA: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: master, 0-0<br/>NodeC: master, 101-200
NodeA->>NodeB: PONG, i'm master, slot 0-100
Note over NodeB: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200
deactivate NodeB
NodeA->>NodeC: PONG, i'm master, slot 0-100
Note over NodeC: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: master, 0-0<br/>NodeC: master, 101-200
NodeB->>NodeA: PONG, i'm replica
Note over NodeA: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200
NodeB->>NodeC: PONG, i'm replica
Note over NodeC: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200
Loading

However, if NodeA's message is not delivered due to network latency or other reasons, the state viewed by NodeC becomes incorrect:

sequenceDiagram
actor user
user->>NodeA: redis-cli cluster failover
NodeA->>NodeB: manual failover start
activate NodeB
NodeB->>NodeA: ping with offset
NodeB->>NodeA: ping with offset
NodeA->>NodeC: auth failover
NodeC->>NodeA: vote
Note over NodeA: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: master, 0-0<br/>NodeC: master, 101-200
NodeA->>NodeB: PONG, i'm master, slot 0-100
Note over NodeB: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200
deactivate NodeB
NodeB->>NodeA: PONG, i'm replica
Note over NodeA: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200
NodeB->>NodeC: PONG, i'm replica
Note over NodeC: cluster nodes<br/>NodeA: replica of NodeB<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200
Note over NodeA: delayed some reason...
NodeA->>NodeC: PONG, i'm master, slot 0-100
Note over NodeC: cluster nodes<br/>NodeA: master, 0-100<br/>NodeB: replica of NodeA<br/>NodeC: master, 101-200
Loading

In this case, NodeC recognizes NodeA and NodeB as being in a circular replication state, and some slots are lost. This state persists until NodeA sends a PONG to NodeC. This situation can be easily reproduced by dropping packets from NodeA to NodeC using iptables.

I propose a solution that involves delaying the transition to an incorrect state when a node's status changes are detected. Specifically, if a sender is to become a replica, and the sender still owns slots while the new master is a replica of the sender, then the process of turning the sender into a replica should be delayed. This approach can prevent temporary circular replication and slot loss, as well as avoid additional problems(eg: #10489 (comment) , redis/lettuce#2578). (not sure...)

The proposed behavior involves a delay in the transition of a master to a replica in the event of a network partition. However, the scenario where the old master receives the message to become a replica before the message promoting a new master is very rare and unlikely to occur in most situations. Additionally, experiencing 1-2 extra 'moved' errors due to this delay is safer than not being able to find a node at all.

If need any further explanation or details about the situation, please let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions