Fix data loss when replica do a failover with a old history repl offset#885
Merged
PingXie merged 5 commits intoAug 21, 2024
Merged
Conversation
Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Signed-off-by: Binbin <binloveplay1314@qq.com>
PingXie
reviewed
Aug 12, 2024
PingXie
left a comment
Member
There was a problem hiding this comment.
This is a great bug. The fix LGTM overall.
Thanks, Binbin!
Co-authored-by: Ping Xie <pingxie@google.com> Signed-off-by: Binbin <binloveplay1314@qq.com>
2a1707e to
f9759b4
Compare
Member
Author
|
force-push for the DCO. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #885 +/- ##
============================================
+ Coverage 70.39% 70.58% +0.18%
============================================
Files 112 112
Lines 61465 61509 +44
============================================
+ Hits 43271 43417 +146
+ Misses 18194 18092 -102
🚀 New features to boost your workflow:
|
Signed-off-by: Binbin <binloveplay1314@qq.com>
ranshid
approved these changes
Aug 20, 2024
ranshid
left a comment
Member
There was a problem hiding this comment.
Overall LGTM - minor test comment
Signed-off-by: Binbin <binloveplay1314@qq.com>
PingXie
approved these changes
Aug 20, 2024
PingXie
left a comment
Member
There was a problem hiding this comment.
LGTM. Nits only. Thanks Binbin.
Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@google.com>
Member
Author
|
@PingXie thanks for the review! i think i took care of all the comments, please take another look. |
enjoy-binbin
added a commit
that referenced
this pull request
Aug 21, 2024
…et (#885) Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Another point worth mentioning, in above cases: 1. In the ROLE command, the replica status will be handshake, and the offset will be -1. 2. Before this PR, in the CLUSTER SHARD command, the replica status will be online, and the offset will be the old cached value (which is wrong). 3. After this PR, in the CLUSTER SHARD, the replica status will be loading, and the offset will be 0. Signed-off-by: Binbin <binloveplay1314@qq.com>
mapleFU
pushed a commit
to mapleFU/valkey
that referenced
this pull request
Aug 21, 2024
…et (valkey-io#885) Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Another point worth mentioning, in above cases: 1. In the ROLE command, the replica status will be handshake, and the offset will be -1. 2. Before this PR, in the CLUSTER SHARD command, the replica status will be online, and the offset will be the old cached value (which is wrong). 3. After this PR, in the CLUSTER SHARD, the replica status will be loading, and the offset will be 0. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: mwish <maplewish117@gmail.com>
enjoy-binbin
added a commit
that referenced
this pull request
Aug 22, 2024
…n CLUSTER REPLICATE (#884) If n is already myself primary, there is no need to re-establish the replication connection. In the past we allow a replica node to reconnect with its primary via this CLUSTER REPLICATE command, it will use psync. But since #885, we will assume that a full sync is needed in this case, so if we don't do this, the replica will always use full sync. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@google.com>
mapleFU
pushed a commit
to mapleFU/valkey
that referenced
this pull request
Aug 22, 2024
…et (valkey-io#885) Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Another point worth mentioning, in above cases: 1. In the ROLE command, the replica status will be handshake, and the offset will be -1. 2. Before this PR, in the CLUSTER SHARD command, the replica status will be online, and the offset will be the old cached value (which is wrong). 3. After this PR, in the CLUSTER SHARD, the replica status will be loading, and the offset will be 0. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: mwish <maplewish117@gmail.com>
enjoy-binbin
added a commit
to enjoy-binbin/valkey
that referenced
this pull request
Aug 23, 2024
In valkey-io#885, we only add a shutdown path, there is another path is that the server might got hang by slowlog. This PR added the pause path coverage to cover it. Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin
added a commit
to enjoy-binbin/valkey
that referenced
this pull request
Aug 26, 2024
…ard_id In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin
added a commit
that referenced
this pull request
Aug 28, 2024
In #885, we only add a shutdown path, there is another path is that the server might got hang by slowlog. This PR added the pause path coverage to cover it. Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin
added a commit
that referenced
this pull request
Aug 29, 2024
…ard_id (#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of #885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow #885 and closes #942. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com>
madolson
pushed a commit
that referenced
this pull request
Sep 2, 2024
…et (#885) Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Another point worth mentioning, in above cases: 1. In the ROLE command, the replica status will be handshake, and the offset will be -1. 2. Before this PR, in the CLUSTER SHARD command, the replica status will be online, and the offset will be the old cached value (which is wrong). 3. After this PR, in the CLUSTER SHARD, the replica status will be loading, and the offset will be 0. Signed-off-by: Binbin <binloveplay1314@qq.com>
madolson
pushed a commit
that referenced
this pull request
Sep 2, 2024
…n CLUSTER REPLICATE (#884) If n is already myself primary, there is no need to re-establish the replication connection. In the past we allow a replica node to reconnect with its primary via this CLUSTER REPLICATE command, it will use psync. But since #885, we will assume that a full sync is needed in this case, so if we don't do this, the replica will always use full sync. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@google.com>
madolson
pushed a commit
that referenced
this pull request
Sep 2, 2024
In #885, we only add a shutdown path, there is another path is that the server might got hang by slowlog. This PR added the pause path coverage to cover it. Signed-off-by: Binbin <binloveplay1314@qq.com>
madolson
pushed a commit
that referenced
this pull request
Sep 2, 2024
…ard_id (#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of #885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow #885 and closes #942. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com>
madolson
pushed a commit
that referenced
this pull request
Sep 3, 2024
…et (#885) Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Another point worth mentioning, in above cases: 1. In the ROLE command, the replica status will be handshake, and the offset will be -1. 2. Before this PR, in the CLUSTER SHARD command, the replica status will be online, and the offset will be the old cached value (which is wrong). 3. After this PR, in the CLUSTER SHARD, the replica status will be loading, and the offset will be 0. Signed-off-by: Binbin <binloveplay1314@qq.com>
madolson
pushed a commit
that referenced
this pull request
Sep 3, 2024
…n CLUSTER REPLICATE (#884) If n is already myself primary, there is no need to re-establish the replication connection. In the past we allow a replica node to reconnect with its primary via this CLUSTER REPLICATE command, it will use psync. But since #885, we will assume that a full sync is needed in this case, so if we don't do this, the replica will always use full sync. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@google.com>
madolson
pushed a commit
that referenced
this pull request
Sep 3, 2024
In #885, we only add a shutdown path, there is another path is that the server might got hang by slowlog. This PR added the pause path coverage to cover it. Signed-off-by: Binbin <binloveplay1314@qq.com>
madolson
pushed a commit
that referenced
this pull request
Sep 3, 2024
…ard_id (#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of #885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow #885 and closes #942. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@outlook.com>
lilpoozie2005
approved these changes
Sep 11, 2024
PingXie
pushed a commit
to PingXie/valkey
that referenced
this pull request
Sep 14, 2024
…et (valkey-io#885) Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Another point worth mentioning, in above cases: 1. In the ROLE command, the replica status will be handshake, and the offset will be -1. 2. Before this PR, in the CLUSTER SHARD command, the replica status will be online, and the offset will be the old cached value (which is wrong). 3. After this PR, in the CLUSTER SHARD, the replica status will be loading, and the offset will be 0. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Ping Xie <pingxie@google.com>
PingXie
pushed a commit
to PingXie/valkey
that referenced
this pull request
Sep 14, 2024
…n CLUSTER REPLICATE (valkey-io#884) If n is already myself primary, there is no need to re-establish the replication connection. In the past we allow a replica node to reconnect with its primary via this CLUSTER REPLICATE command, it will use psync. But since valkey-io#885, we will assume that a full sync is needed in this case, so if we don't do this, the replica will always use full sync. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@google.com>
PingXie
pushed a commit
to PingXie/valkey
that referenced
this pull request
Sep 14, 2024
In valkey-io#885, we only add a shutdown path, there is another path is that the server might got hang by slowlog. This PR added the pause path coverage to cover it. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Ping Xie <pingxie@google.com>
PingXie
pushed a commit
to PingXie/valkey
that referenced
this pull request
Sep 14, 2024
…ard_id (valkey-io#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@google.com>
PingXie
pushed a commit
to PingXie/valkey
that referenced
this pull request
Sep 14, 2024
…et (valkey-io#885) Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Another point worth mentioning, in above cases: 1. In the ROLE command, the replica status will be handshake, and the offset will be -1. 2. Before this PR, in the CLUSTER SHARD command, the replica status will be online, and the offset will be the old cached value (which is wrong). 3. After this PR, in the CLUSTER SHARD, the replica status will be loading, and the offset will be 0. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Ping Xie <pingxie@google.com>
PingXie
pushed a commit
to PingXie/valkey
that referenced
this pull request
Sep 14, 2024
…n CLUSTER REPLICATE (valkey-io#884) If n is already myself primary, there is no need to re-establish the replication connection. In the past we allow a replica node to reconnect with its primary via this CLUSTER REPLICATE command, it will use psync. But since valkey-io#885, we will assume that a full sync is needed in this case, so if we don't do this, the replica will always use full sync. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@google.com>
PingXie
pushed a commit
to PingXie/valkey
that referenced
this pull request
Sep 14, 2024
In valkey-io#885, we only add a shutdown path, there is another path is that the server might got hang by slowlog. This PR added the pause path coverage to cover it. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Ping Xie <pingxie@google.com>
PingXie
pushed a commit
to PingXie/valkey
that referenced
this pull request
Sep 14, 2024
…ard_id (valkey-io#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@google.com>
PingXie
pushed a commit
to PingXie/valkey
that referenced
this pull request
Sep 14, 2024
…et (valkey-io#885) Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Another point worth mentioning, in above cases: 1. In the ROLE command, the replica status will be handshake, and the offset will be -1. 2. Before this PR, in the CLUSTER SHARD command, the replica status will be online, and the offset will be the old cached value (which is wrong). 3. After this PR, in the CLUSTER SHARD, the replica status will be loading, and the offset will be 0. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Ping Xie <pingxie@google.com>
PingXie
pushed a commit
to PingXie/valkey
that referenced
this pull request
Sep 14, 2024
…n CLUSTER REPLICATE (valkey-io#884) If n is already myself primary, there is no need to re-establish the replication connection. In the past we allow a replica node to reconnect with its primary via this CLUSTER REPLICATE command, it will use psync. But since valkey-io#885, we will assume that a full sync is needed in this case, so if we don't do this, the replica will always use full sync. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@google.com>
PingXie
pushed a commit
to PingXie/valkey
that referenced
this pull request
Sep 14, 2024
In valkey-io#885, we only add a shutdown path, there is another path is that the server might got hang by slowlog. This PR added the pause path coverage to cover it. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Ping Xie <pingxie@google.com>
PingXie
pushed a commit
to PingXie/valkey
that referenced
this pull request
Sep 14, 2024
…ard_id (valkey-io#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@google.com>
PingXie
pushed a commit
to PingXie/valkey
that referenced
this pull request
Sep 14, 2024
…ard_id (valkey-io#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Ping Xie <pingxie@google.com> Signed-off-by: Ping Xie <pingxie@google.com>
zuiderkwast
added a commit
that referenced
this pull request
Mar 26, 2026
…#2227) In #2023 (#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in #885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
charsyam
pushed a commit
to charsyam/valkey
that referenced
this pull request
Mar 28, 2026
…valkey-io#2227) In valkey-io#2023 (valkey-io#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in valkey-io#885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
sarthakaggarwal97
pushed a commit
to sarthakaggarwal97/valkey
that referenced
this pull request
Apr 16, 2026
…valkey-io#2227) In valkey-io#2023 (valkey-io#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in valkey-io#885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
madolson
pushed a commit
that referenced
this pull request
Apr 27, 2026
…#2227) In #2023 (#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in #885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:
In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.
In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.
In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.
The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.
The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.
The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.
Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.
Another point worth mentioning, in above cases:
offset will be -1.
be online, and the offset will be the old cached value (which is wrong).
and the offset will be 0.