Make cluster failover delay relative to node timeout by zuiderkwast · Pull Request #2449 · valkey-io/valkey

zuiderkwast · 2025-08-07T14:36:06Z

In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is fixed_delay = min(500, node_timeout / 30).

Node timeout	Fixed failover delay
15000 or more	500 (same as before)
7500	250
3000	100
1500	50

Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases.

The purposes of the failover delay are

Allow FAIL to propagate to the voting primaries in the cluster
Allow replicas to exchange their offsets, so they will have a correct view of their own rank.

A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0.

A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout.

These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time.

The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail.

I hope this PR can be considered safer than #2227, although the two changes are orthogonal.

Part of issue #2023.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

hpatro

This looks safe to me.

Do you think we need the minimum set to 500 milliseconds or just keep everything relative to node timeout?

zuiderkwast · 2025-08-08T21:28:44Z

Do you think we need the minimum set to 500 milliseconds or just keep everything relative to node timeout?

I did use 500 as a maximum, not a minimum. This is for any cluster using node timeout higher than 15 seconds, to not make failover slower than now for them. I don't know if there are any such clustes though and whether it matters.

hpatro · 2025-08-09T00:44:19Z

Do you think we need the minimum set to 500 milliseconds or just keep everything relative to node timeout?

I did use 500 as a maximum, not a minimum. This is for any cluster using node timeout higher than 15 seconds, to not make failover slower than now for them. I don't know if there are any such clustes though and whether it matters.

Yes, I wrote it as verbatim as the code is. We use 15 seconds as the timeout so this change doesn't have any impact on our setup.

I think keeping it relative keeps the logic simpler and the additional delay shouldn't bother them as the time it takes to fail a node is already set to a high value.

enjoy-binbin

We've discussed this a lot before. I am ok with this change.

codecov · 2025-08-11T11:12:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.00%. Comparing base (7a5c0d0) to head (55cdfed).
⚠️ Report is 1 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2449      +/-   ##
============================================
- Coverage     72.01%   72.00%   -0.01%     
============================================
  Files           126      126              
  Lines         70473    70476       +3     
============================================
- Hits          50752    50748       -4     
- Misses        19721    19728       +7

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`87.76% <100.00%> (+0.03%)`	⬆️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…t other replicas yet Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…lative-node-timeout

In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than valkey-io#2227, although the two changes are orthogonal. Part of issue valkey-io#2023. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than #2227, although the two changes are orthogonal. Part of issue #2023. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`. | Node timeout | Fixed failover delay | |---------------|----------------------| | 15000 or more | 500 (same as before) | | 7500 | 250 | | 3000 | 100 | | 1500 | 50 | Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases. The purposes of the failover delay are 1. Allow FAIL to propagate to the voting primaries in the cluster 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0. A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout. These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time. The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail. I hope this PR can be considered safer than valkey-io#2227, although the two changes are orthogonal. Part of issue valkey-io#2023. --------- Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

…#3941) Since #2449 made the failover delay relative to cluster-node-timeout. Now delay = min(cluster-node-timeout / 30, 500), any cluster-node-timeout below 30, including the legal minimum 0 will collapses delay to zero, and `x % 0` is undefined behaviour. Signed-off-by: Binbin <binloveplay1314@qq.com>

Make cluster failover delay relative to node timeout

e8b102e

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

zuiderkwast added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Aug 7, 2025

zuiderkwast requested review from enjoy-binbin, hpatro and madolson August 7, 2025 14:37

github-advanced-security AI found potential problems Aug 7, 2025

View reviewed changes

Comment thread src/cluster_legacy.c Fixed

Comment thread src/cluster_legacy.c Fixed

Comment thread src/cluster_legacy.c Fixed

Comment thread src/cluster_legacy.c Fixed

Address CodeQL warning

1fb1bc2

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

hpatro reviewed Aug 8, 2025

View reviewed changes

enjoy-binbin approved these changes Aug 11, 2025

View reviewed changes

Merge branch 'unstable' into failover-delay-relative-node-timeout

9046967

zuiderkwast added 4 commits August 11, 2025 14:23

Add extra delay for new replicas (repl offset 0) that don't know abou…

a534c9d

…t other replicas yet Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Try again (more delay for new replica)

1e522d0

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Another try: delay replica with offset=0 regardless of rank

3e27696

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Merge remote-tracking branch 'valkey/unstable' into failover-delay-re…

caf5f19

…lative-node-timeout

zuiderkwast marked this pull request as ready for review August 14, 2025 13:19

Merge remote-tracking branch 'valkey/unstable' into failover-delay-re…

55cdfed

…lative-node-timeout

zuiderkwast added the release-notes This issue should get a line item in the release notes label Aug 18, 2025

zuiderkwast merged commit 2ca0dd8 into valkey-io:unstable Aug 18, 2025
61 checks passed

zuiderkwast deleted the failover-delay-relative-node-timeout branch August 18, 2025 13:35

enjoy-binbin mentioned this pull request Feb 5, 2026

[TEST-FAILURE] Make sure the replicas always get the different ranks #2683

Closed

enjoy-binbin mentioned this pull request Jun 9, 2026

Avoid random() % 0 undefined behaviour when cluster-node-timeout < 30 #3941

Merged

enjoy-binbin added this to Valkey 9.0 Jun 9, 2026

enjoy-binbin moved this to Done in Valkey 9.0 Jun 9, 2026

github-project-automation Bot moved this from Done to To be backported in Valkey 9.0 Jun 9, 2026

enjoy-binbin moved this from To be backported to Done in Valkey 9.0 Jun 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make cluster failover delay relative to node timeout#2449

Make cluster failover delay relative to node timeout#2449
zuiderkwast merged 8 commits into
valkey-io:unstablefrom
zuiderkwast:failover-delay-relative-node-timeout

zuiderkwast commented Aug 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hpatro left a comment

Uh oh!

zuiderkwast commented Aug 8, 2025

Uh oh!

hpatro commented Aug 9, 2025

Uh oh!

enjoy-binbin left a comment

Uh oh!

codecov Bot commented Aug 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

zuiderkwast commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

zuiderkwast commented Aug 8, 2025

Uh oh!

hpatro commented Aug 9, 2025

Uh oh!

enjoy-binbin left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zuiderkwast commented Aug 7, 2025 •

edited

Loading

codecov Bot commented Aug 11, 2025 •

edited

Loading