Skip to content

Fix data loss when replica do a failover with a old history repl offset#885

Merged
PingXie merged 5 commits into
valkey-io:unstablefrom
enjoy-binbin:fix_replica_migration_data_loss
Aug 21, 2024
Merged

Fix data loss when replica do a failover with a old history repl offset#885
PingXie merged 5 commits into
valkey-io:unstablefrom
enjoy-binbin:fix_replica_migration_data_loss

Conversation

@enjoy-binbin

@enjoy-binbin enjoy-binbin commented Aug 11, 2024

Copy link
Copy Markdown
Member

Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:

  1. In slot migration, a primary loses its last slot and then becomes
    a replica. When it is fully synchronized with the new primary, the new
    primary downs.

  2. In CLUSTER REPLICATE command, a replica becomes a replica of another
    primary. When it is fully synchronized with the new primary, the new
    primary downs.

In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.

The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.

The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.

The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.

Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.

Another point worth mentioning, in above cases:

  1. In the ROLE command, the replica status will be handshake, and the
    offset will be -1.
  2. Before this PR, in the CLUSTER SHARD command, the replica status will
    be online, and the offset will be the old cached value (which is wrong).
  3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
    and the offset will be 0.

Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:

1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.

2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.

In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.

The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.

The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.

The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.

Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.

Signed-off-by: Binbin <binloveplay1314@qq.com>

@PingXie PingXie left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great bug. The fix LGTM overall.

Thanks, Binbin!

Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c
Comment thread src/replication.c Outdated
Comment thread src/replication.c Outdated
Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Comment thread tests/unit/cluster/replica_migration.tcl
Co-authored-by: Ping Xie <pingxie@google.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin force-pushed the fix_replica_migration_data_loss branch from 2a1707e to f9759b4 Compare August 12, 2024 03:55
@enjoy-binbin

Copy link
Copy Markdown
Member Author

force-push for the DCO.

@codecov

codecov Bot commented Aug 12, 2024

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.33333% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.58%. Comparing base (7424620) to head (b47148d).
⚠️ Report is 848 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 78.57% 3 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #885      +/-   ##
============================================
+ Coverage     70.39%   70.58%   +0.18%     
============================================
  Files           112      112              
  Lines         61465    61509      +44     
============================================
+ Hits          43271    43417     +146     
+ Misses        18194    18092     -102     
Files with missing lines Coverage Δ
src/replication.c 87.02% <100.00%> (-0.11%) ⬇️
src/server.h 100.00% <ø> (ø)
src/cluster_legacy.c 85.58% <78.57%> (+0.06%) ⬆️

... and 22 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@madolson madolson requested a review from ranshid August 19, 2024 15:06

@ranshid ranshid left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM - minor test comment

Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Signed-off-by: Binbin <binloveplay1314@qq.com>

@PingXie PingXie left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nits only. Thanks Binbin.

Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Comment thread tests/unit/cluster/replica_migration.tcl
Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Comment thread tests/unit/cluster/replica_migration.tcl Outdated
Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
@enjoy-binbin

Copy link
Copy Markdown
Member Author

@PingXie thanks for the review! i think i took care of all the comments, please take another look.

@PingXie PingXie merged commit 910fd54 into valkey-io:unstable Aug 21, 2024
@enjoy-binbin enjoy-binbin deleted the fix_replica_migration_data_loss branch August 21, 2024 05:13
enjoy-binbin added a commit that referenced this pull request Aug 21, 2024
…et (#885)

Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:

1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.

2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.

In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.

The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.

The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.

The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.

Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.

Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Aug 21, 2024
mapleFU pushed a commit to mapleFU/valkey that referenced this pull request Aug 21, 2024
…et (valkey-io#885)

Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:

1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.

2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.

In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.

The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.

The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.

The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.

Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.

Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: mwish <maplewish117@gmail.com>
enjoy-binbin added a commit that referenced this pull request Aug 22, 2024
…n CLUSTER REPLICATE (#884)

If n is already myself primary, there is no need to re-establish the
replication connection.

In the past we allow a replica node to reconnect with its primary via
this CLUSTER REPLICATE command, it will use psync. But since #885, we
will assume that a full sync is needed in this case, so if we don't do
this, the replica will always use full sync.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
mapleFU pushed a commit to mapleFU/valkey that referenced this pull request Aug 22, 2024
…et (valkey-io#885)

Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:

1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.

2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.

In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.

The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.

The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.

The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.

Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.

Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: mwish <maplewish117@gmail.com>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Aug 23, 2024
In valkey-io#885, we only add a shutdown path, there is another path
is that the server might got hang by slowlog. This PR added
the pause path coverage to cover it.

Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Aug 26, 2024
…ard_id

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

This one follow valkey-io#885 and closes valkey-io#942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin added a commit that referenced this pull request Aug 28, 2024
In #885, we only add a shutdown path, there is another path
is that the server might got hang by slowlog. This PR added
the pause path coverage to cover it.

Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin added a commit that referenced this pull request Aug 29, 2024
…ard_id (#944)

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow #885 and closes #942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
madolson pushed a commit that referenced this pull request Sep 2, 2024
…et (#885)

Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:

1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.

2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.

In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.

The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.

The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.

The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.

Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.

Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.

Signed-off-by: Binbin <binloveplay1314@qq.com>
madolson pushed a commit that referenced this pull request Sep 2, 2024
…n CLUSTER REPLICATE (#884)

If n is already myself primary, there is no need to re-establish the
replication connection.

In the past we allow a replica node to reconnect with its primary via
this CLUSTER REPLICATE command, it will use psync. But since #885, we
will assume that a full sync is needed in this case, so if we don't do
this, the replica will always use full sync.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
madolson pushed a commit that referenced this pull request Sep 2, 2024
In #885, we only add a shutdown path, there is another path
is that the server might got hang by slowlog. This PR added
the pause path coverage to cover it.

Signed-off-by: Binbin <binloveplay1314@qq.com>
madolson pushed a commit that referenced this pull request Sep 2, 2024
…ard_id (#944)

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow #885 and closes #942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
madolson pushed a commit that referenced this pull request Sep 3, 2024
…et (#885)

Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:

1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.

2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.

In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.

The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.

The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.

The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.

Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.

Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.

Signed-off-by: Binbin <binloveplay1314@qq.com>
madolson pushed a commit that referenced this pull request Sep 3, 2024
…n CLUSTER REPLICATE (#884)

If n is already myself primary, there is no need to re-establish the
replication connection.

In the past we allow a replica node to reconnect with its primary via
this CLUSTER REPLICATE command, it will use psync. But since #885, we
will assume that a full sync is needed in this case, so if we don't do
this, the replica will always use full sync.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
madolson pushed a commit that referenced this pull request Sep 3, 2024
In #885, we only add a shutdown path, there is another path
is that the server might got hang by slowlog. This PR added
the pause path coverage to cover it.

Signed-off-by: Binbin <binloveplay1314@qq.com>
madolson pushed a commit that referenced this pull request Sep 3, 2024
…ard_id (#944)

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of #885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow #885 and closes #942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@outlook.com>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…et (valkey-io#885)

Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:

1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.

2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.

In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.

The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.

The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.

The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.

Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.

Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ping Xie <pingxie@google.com>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…n CLUSTER REPLICATE (valkey-io#884)

If n is already myself primary, there is no need to re-establish the
replication connection.

In the past we allow a replica node to reconnect with its primary via
this CLUSTER REPLICATE command, it will use psync. But since valkey-io#885, we
will assume that a full sync is needed in this case, so if we don't do
this, the replica will always use full sync.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
In valkey-io#885, we only add a shutdown path, there is another path
is that the server might got hang by slowlog. This PR added
the pause path coverage to cover it.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ping Xie <pingxie@google.com>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…ard_id (valkey-io#944)

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow valkey-io#885 and closes valkey-io#942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…et (valkey-io#885)

Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:

1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.

2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.

In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.

The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.

The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.

The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.

Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.

Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ping Xie <pingxie@google.com>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…n CLUSTER REPLICATE (valkey-io#884)

If n is already myself primary, there is no need to re-establish the
replication connection.

In the past we allow a replica node to reconnect with its primary via
this CLUSTER REPLICATE command, it will use psync. But since valkey-io#885, we
will assume that a full sync is needed in this case, so if we don't do
this, the replica will always use full sync.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
In valkey-io#885, we only add a shutdown path, there is another path
is that the server might got hang by slowlog. This PR added
the pause path coverage to cover it.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ping Xie <pingxie@google.com>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…ard_id (valkey-io#944)

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow valkey-io#885 and closes valkey-io#942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…et (valkey-io#885)

Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:

1. In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.

2. In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.

In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.

The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.

The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.

The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.

Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.

Another point worth mentioning, in above cases:
1. In the ROLE command, the replica status will be handshake, and the
offset will be -1.
2. Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
3. After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ping Xie <pingxie@google.com>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…n CLUSTER REPLICATE (valkey-io#884)

If n is already myself primary, there is no need to re-establish the
replication connection.

In the past we allow a replica node to reconnect with its primary via
this CLUSTER REPLICATE command, it will use psync. But since valkey-io#885, we
will assume that a full sync is needed in this case, so if we don't do
this, the replica will always use full sync.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
In valkey-io#885, we only add a shutdown path, there is another path
is that the server might got hang by slowlog. This PR added
the pause path coverage to cover it.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Ping Xie <pingxie@google.com>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…ard_id (valkey-io#944)

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow valkey-io#885 and closes valkey-io#942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
PingXie pushed a commit to PingXie/valkey that referenced this pull request Sep 14, 2024
…ard_id (valkey-io#944)

When reconfiguring sub-replica, there may a case that the sub-replica will
use the old offset and win the election and cause the data loss if the old
primary went down.

In this case, sender is myself's primary, when executing updateShardId,
not only the sender's shard_id is updated, but also the shard_id of
myself is updated, casuing the subsequent areInSameShard check, that is,
the full_sync_required check to fail.

As part of the recent fix of valkey-io#885, the sub-replica needs to decide whether
a full sync is required or not when switching shards. This shard membership
check is supposed to be done against sub-replica's current shard_id, which
however was lost in this code path. This then leads to sub-replica joining
the other shard with a completely different and incorrect replication history.

This is the only place where replicaof state can be updated on this path
so the most natural fix would be to pull the chain replication reduction
logic into this code block and before the updateShardId call.

This one follow valkey-io#885 and closes valkey-io#942.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Ping Xie <pingxie@google.com>
Signed-off-by: Ping Xie <pingxie@google.com>
zuiderkwast added a commit that referenced this pull request Mar 26, 2026
…#2227)

In #2023 (#2209, etc.), we are exploring ways to make failover faster,
that is, to minimize the delay.

When a node is marked as FAIL and before the failover starts, there is a
delay of 500-1000ms. The original purpose of this delay:

1. Allow FAIL to propagate to at least a majority of the primaries. This
   makes sure they will vote when a replica sends failover auth request.
2. Allow replicas to exchange their offsets, so they will have a correct
   view of their own rank.

We want to minimize this delay while ensuring safety. It is useful for
example in these cases:

1. If there is only one replica, then we don't need any delay, or
2. If there are more replicas, with a fast network, the replicas can
   exchange the offsets very quickly and start the failover within a few
   milliseconds instead of 500-1000ms.

In this PR, when we can be sure that the replica is the best ranked
replica, we let
it initiate a failover immediately and completely remove the delay.

### How to ensure safety?

1. To make sure this replica has the best rank, it only skips the delay
   if it is sure that it have the best rank and that all replicas in the
   same shard agree that the primary is failing. A new flag
   `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each
   replica has marked its primary as FAIL. If all replicas say that the
   primary is failing, we also know that the offset is not updated,
   because the offset is not incrementing when the primary is failing.
   We can skip the delay only if we have received a message from all
   replicas and they all have set this flag.

2. To make sure the primaries will vote even if they didn't receive the
   FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they
   will vote. This is equivalent to broadcasting a FAIL message to all
   primaries before we broadcast the failover auth request (but
   cheaper).

The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R)
is illustrated in the following sequence diagram:

```
A      R        B      C
|      |        |      |
| FAIL |        |      |
|----->| AUTH R.|      |
|      |------->|      |
| FAIL |        |      |
|-------------->|      |
|      | AUTH R.|      |
|      |-------------->|
| FAIL |        |      |
|--------------------->|
```

### Details

This is the how the failover is initiated, with new steps marked with
**(new)**:

1. A majority of primaries have marked another primary as PFAIL.

2. Some nodes counts failure reports and marks the failing primary as
   FAIL. The node
   that detects FAIL broadcasts it to all nodes in the cluster.

3. When a replica receives FAIL (or detects FAIL itself by counting
   PFAIL reports) it schedules a failover:
   
   a. It sets a timeout (500ms + random 0-500ms).
   b. It broadcasts pong to the other replicas in the same shard.   
   c. **(new)** The pong (actually the clusterMsg header) has a new flag
      `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong
      to each other here, this flag is set.

4. **(new)** When the following conditions are met, skip the remaining
   delay and start the failover using AUTH REQUEST with the FORCE ACK
   flag set, that is if
   
   a. a PONG is received from every other replica in the same shard
      (broadcast within the shard) and
   b. all replicas have marked that its primary is FAIL in their last
      message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and
   c. this is the best replica (rank = 0) and
   d. my replication offset != 0.

5. When the delay has passed and no other replica has initiated
   failover, then initiate failover.

Notes:

* With 3(c), we don't need to wait for FAIL to propagate to all voting
  primaries. At this point, a FAIL has already been broadcast by some
  node, but there is a race so our auth request may arrive to some node
  before the FAIL. Using the FORCE ACK flag ensures the primaries will
  vote for us. (It is equivalent to broacasting another FAIL just before
  broadcasting auth request.)
  
* 4(b) ensures that we have received the replication offset from all
  other replicas and that it's up to date. If a replica says that it's
  primary is failing, it also means that the replication from the
  primary to that replica has stopped.

* 4(c) is to avoid a special bad case. It can happen that not all
  replicas know about each other. In this case, two replicas can think
  they are both the best replica and start the failover at the same
  time. This can already happen without this PR. When it happens, it
  usually means that a new replica has just joined and it has no data
  (offset = 0) and if it wins the election, there is a problem of
  dataloss (discussed and partially mitigated for the replica migration
  case in #885). To avoid this case, skip this fast failover path if the
  replica has offset = 0.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
charsyam pushed a commit to charsyam/valkey that referenced this pull request Mar 28, 2026
…valkey-io#2227)

In valkey-io#2023 (valkey-io#2209, etc.), we are exploring ways to make failover faster,
that is, to minimize the delay.

When a node is marked as FAIL and before the failover starts, there is a
delay of 500-1000ms. The original purpose of this delay:

1. Allow FAIL to propagate to at least a majority of the primaries. This
   makes sure they will vote when a replica sends failover auth request.
2. Allow replicas to exchange their offsets, so they will have a correct
   view of their own rank.

We want to minimize this delay while ensuring safety. It is useful for
example in these cases:

1. If there is only one replica, then we don't need any delay, or
2. If there are more replicas, with a fast network, the replicas can
   exchange the offsets very quickly and start the failover within a few
   milliseconds instead of 500-1000ms.

In this PR, when we can be sure that the replica is the best ranked
replica, we let
it initiate a failover immediately and completely remove the delay.

### How to ensure safety?

1. To make sure this replica has the best rank, it only skips the delay
   if it is sure that it have the best rank and that all replicas in the
   same shard agree that the primary is failing. A new flag
   `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each
   replica has marked its primary as FAIL. If all replicas say that the
   primary is failing, we also know that the offset is not updated,
   because the offset is not incrementing when the primary is failing.
   We can skip the delay only if we have received a message from all
   replicas and they all have set this flag.

2. To make sure the primaries will vote even if they didn't receive the
   FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they
   will vote. This is equivalent to broadcasting a FAIL message to all
   primaries before we broadcast the failover auth request (but
   cheaper).

The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R)
is illustrated in the following sequence diagram:

```
A      R        B      C
|      |        |      |
| FAIL |        |      |
|----->| AUTH R.|      |
|      |------->|      |
| FAIL |        |      |
|-------------->|      |
|      | AUTH R.|      |
|      |-------------->|
| FAIL |        |      |
|--------------------->|
```

### Details

This is the how the failover is initiated, with new steps marked with
**(new)**:

1. A majority of primaries have marked another primary as PFAIL.

2. Some nodes counts failure reports and marks the failing primary as
   FAIL. The node
   that detects FAIL broadcasts it to all nodes in the cluster.

3. When a replica receives FAIL (or detects FAIL itself by counting
   PFAIL reports) it schedules a failover:
   
   a. It sets a timeout (500ms + random 0-500ms).
   b. It broadcasts pong to the other replicas in the same shard.   
   c. **(new)** The pong (actually the clusterMsg header) has a new flag
      `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong
      to each other here, this flag is set.

4. **(new)** When the following conditions are met, skip the remaining
   delay and start the failover using AUTH REQUEST with the FORCE ACK
   flag set, that is if
   
   a. a PONG is received from every other replica in the same shard
      (broadcast within the shard) and
   b. all replicas have marked that its primary is FAIL in their last
      message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and
   c. this is the best replica (rank = 0) and
   d. my replication offset != 0.

5. When the delay has passed and no other replica has initiated
   failover, then initiate failover.

Notes:

* With 3(c), we don't need to wait for FAIL to propagate to all voting
  primaries. At this point, a FAIL has already been broadcast by some
  node, but there is a race so our auth request may arrive to some node
  before the FAIL. Using the FORCE ACK flag ensures the primaries will
  vote for us. (It is equivalent to broacasting another FAIL just before
  broadcasting auth request.)
  
* 4(b) ensures that we have received the replication offset from all
  other replicas and that it's up to date. If a replica says that it's
  primary is failing, it also means that the replication from the
  primary to that replica has stopped.

* 4(c) is to avoid a special bad case. It can happen that not all
  replicas know about each other. In this case, two replicas can think
  they are both the best replica and start the failover at the same
  time. This can already happen without this PR. When it happens, it
  usually means that a new replica has just joined and it has no data
  (offset = 0) and if it wins the election, there is a problem of
  dataloss (discussed and partially mitigated for the replica migration
  case in valkey-io#885). To avoid this case, skip this fast failover path if the
  replica has offset = 0.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
…valkey-io#2227)

In valkey-io#2023 (valkey-io#2209, etc.), we are exploring ways to make failover faster,
that is, to minimize the delay.

When a node is marked as FAIL and before the failover starts, there is a
delay of 500-1000ms. The original purpose of this delay:

1. Allow FAIL to propagate to at least a majority of the primaries. This
   makes sure they will vote when a replica sends failover auth request.
2. Allow replicas to exchange their offsets, so they will have a correct
   view of their own rank.

We want to minimize this delay while ensuring safety. It is useful for
example in these cases:

1. If there is only one replica, then we don't need any delay, or
2. If there are more replicas, with a fast network, the replicas can
   exchange the offsets very quickly and start the failover within a few
   milliseconds instead of 500-1000ms.

In this PR, when we can be sure that the replica is the best ranked
replica, we let
it initiate a failover immediately and completely remove the delay.

### How to ensure safety?

1. To make sure this replica has the best rank, it only skips the delay
   if it is sure that it have the best rank and that all replicas in the
   same shard agree that the primary is failing. A new flag
   `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each
   replica has marked its primary as FAIL. If all replicas say that the
   primary is failing, we also know that the offset is not updated,
   because the offset is not incrementing when the primary is failing.
   We can skip the delay only if we have received a message from all
   replicas and they all have set this flag.

2. To make sure the primaries will vote even if they didn't receive the
   FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they
   will vote. This is equivalent to broadcasting a FAIL message to all
   primaries before we broadcast the failover auth request (but
   cheaper).

The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R)
is illustrated in the following sequence diagram:

```
A      R        B      C
|      |        |      |
| FAIL |        |      |
|----->| AUTH R.|      |
|      |------->|      |
| FAIL |        |      |
|-------------->|      |
|      | AUTH R.|      |
|      |-------------->|
| FAIL |        |      |
|--------------------->|
```

### Details

This is the how the failover is initiated, with new steps marked with
**(new)**:

1. A majority of primaries have marked another primary as PFAIL.

2. Some nodes counts failure reports and marks the failing primary as
   FAIL. The node
   that detects FAIL broadcasts it to all nodes in the cluster.

3. When a replica receives FAIL (or detects FAIL itself by counting
   PFAIL reports) it schedules a failover:
   
   a. It sets a timeout (500ms + random 0-500ms).
   b. It broadcasts pong to the other replicas in the same shard.   
   c. **(new)** The pong (actually the clusterMsg header) has a new flag
      `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong
      to each other here, this flag is set.

4. **(new)** When the following conditions are met, skip the remaining
   delay and start the failover using AUTH REQUEST with the FORCE ACK
   flag set, that is if
   
   a. a PONG is received from every other replica in the same shard
      (broadcast within the shard) and
   b. all replicas have marked that its primary is FAIL in their last
      message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and
   c. this is the best replica (rank = 0) and
   d. my replication offset != 0.

5. When the delay has passed and no other replica has initiated
   failover, then initiate failover.

Notes:

* With 3(c), we don't need to wait for FAIL to propagate to all voting
  primaries. At this point, a FAIL has already been broadcast by some
  node, but there is a race so our auth request may arrive to some node
  before the FAIL. Using the FORCE ACK flag ensures the primaries will
  vote for us. (It is equivalent to broacasting another FAIL just before
  broadcasting auth request.)
  
* 4(b) ensures that we have received the replication offset from all
  other replicas and that it's up to date. If a replica says that it's
  primary is failing, it also means that the replication from the
  primary to that replica has stopped.

* 4(c) is to avoid a special bad case. It can happen that not all
  replicas know about each other. In this case, two replicas can think
  they are both the best replica and start the failover at the same
  time. This can already happen without this PR. When it happens, it
  usually means that a new replica has just joined and it has no data
  (offset = 0) and if it wins the election, there is a problem of
  dataloss (discussed and partially mitigated for the replica migration
  case in valkey-io#885). To avoid this case, skip this fast failover path if the
  replica has offset = 0.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
madolson pushed a commit that referenced this pull request Apr 27, 2026
…#2227)

In #2023 (#2209, etc.), we are exploring ways to make failover faster,
that is, to minimize the delay.

When a node is marked as FAIL and before the failover starts, there is a
delay of 500-1000ms. The original purpose of this delay:

1. Allow FAIL to propagate to at least a majority of the primaries. This
   makes sure they will vote when a replica sends failover auth request.
2. Allow replicas to exchange their offsets, so they will have a correct
   view of their own rank.

We want to minimize this delay while ensuring safety. It is useful for
example in these cases:

1. If there is only one replica, then we don't need any delay, or
2. If there are more replicas, with a fast network, the replicas can
   exchange the offsets very quickly and start the failover within a few
   milliseconds instead of 500-1000ms.

In this PR, when we can be sure that the replica is the best ranked
replica, we let
it initiate a failover immediately and completely remove the delay.

### How to ensure safety?

1. To make sure this replica has the best rank, it only skips the delay
   if it is sure that it have the best rank and that all replicas in the
   same shard agree that the primary is failing. A new flag
   `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each
   replica has marked its primary as FAIL. If all replicas say that the
   primary is failing, we also know that the offset is not updated,
   because the offset is not incrementing when the primary is failing.
   We can skip the delay only if we have received a message from all
   replicas and they all have set this flag.

2. To make sure the primaries will vote even if they didn't receive the
   FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they
   will vote. This is equivalent to broadcasting a FAIL message to all
   primaries before we broadcast the failover auth request (but
   cheaper).

The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R)
is illustrated in the following sequence diagram:

```
A      R        B      C
|      |        |      |
| FAIL |        |      |
|----->| AUTH R.|      |
|      |------->|      |
| FAIL |        |      |
|-------------->|      |
|      | AUTH R.|      |
|      |-------------->|
| FAIL |        |      |
|--------------------->|
```

### Details

This is the how the failover is initiated, with new steps marked with
**(new)**:

1. A majority of primaries have marked another primary as PFAIL.

2. Some nodes counts failure reports and marks the failing primary as
   FAIL. The node
   that detects FAIL broadcasts it to all nodes in the cluster.

3. When a replica receives FAIL (or detects FAIL itself by counting
   PFAIL reports) it schedules a failover:
   
   a. It sets a timeout (500ms + random 0-500ms).
   b. It broadcasts pong to the other replicas in the same shard.   
   c. **(new)** The pong (actually the clusterMsg header) has a new flag
      `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong
      to each other here, this flag is set.

4. **(new)** When the following conditions are met, skip the remaining
   delay and start the failover using AUTH REQUEST with the FORCE ACK
   flag set, that is if
   
   a. a PONG is received from every other replica in the same shard
      (broadcast within the shard) and
   b. all replicas have marked that its primary is FAIL in their last
      message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and
   c. this is the best replica (rank = 0) and
   d. my replication offset != 0.

5. When the delay has passed and no other replica has initiated
   failover, then initiate failover.

Notes:

* With 3(c), we don't need to wait for FAIL to propagate to all voting
  primaries. At this point, a FAIL has already been broadcast by some
  node, but there is a race so our auth request may arrive to some node
  before the FAIL. Using the FORCE ACK flag ensures the primaries will
  vote for us. (It is equivalent to broacasting another FAIL just before
  broadcasting auth request.)
  
* 4(b) ensures that we have received the replication offset from all
  other replicas and that it's up to date. If a replica says that it's
  primary is failing, it also means that the replication from the
  primary to that replica has stopped.

* 4(c) is to avoid a special bad case. It can happen that not all
  replicas know about each other. In this case, two replicas can think
  they are both the best replica and start the failover at the same
  time. This can already happen without this PR. When it happens, it
  usually means that a new replica has just joined and it has no data
  (offset = 0) and if it wins the election, there is a problem of
  dataloss (discussed and partially mitigated for the replica migration
  case in #885). To avoid this case, skip this fast failover path if the
  replica has offset = 0.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster release-notes This issue should get a line item in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants