Skip to content

Make cluster failover delay relative to node timeout#2449

Merged
zuiderkwast merged 8 commits into
valkey-io:unstablefrom
zuiderkwast:failover-delay-relative-node-timeout
Aug 18, 2025
Merged

Make cluster failover delay relative to node timeout#2449
zuiderkwast merged 8 commits into
valkey-io:unstablefrom
zuiderkwast:failover-delay-relative-node-timeout

Conversation

@zuiderkwast

@zuiderkwast zuiderkwast commented Aug 7, 2025

Copy link
Copy Markdown
Contributor

In clusters with a very short node timeout such as 2-3 seconds, the extra failover delay of 500-1000 milliseconds (500 + random value 0-500; total 750 on average) before initiating a failover is a significant extra downtime to the cluster. This PR makes this delay relative to node timeout, using a shorter failover delay for a smaller configured node timeout. The formula is fixed_delay = min(500, node_timeout / 30).

Node timeout Fixed failover delay
15000 or more 500 (same as before)
7500 250
3000 100
1500 50

Additional change: Add an extra 500ms delay to new replicas that may not yet know about the other replicas. This avoids the scenario where a new replica with no data wins the failover. This change turned out to be needed to for the stability of some test cases.

The purposes of the failover delay are

  1. Allow FAIL to propagate to the voting primaries in the cluster
  2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank.

A third (undocumented) purpose of this delay is to allow newly added replicas to discover other replicas in the cluster via gossip and to compute their rank, to realize it's are not the best replica. This case is mitigated by adding another 500ms delay to new replicas, i.e. if it has replication offset 0.

A low node timeout only makes sense in fast networks, so we can assume that the above needs less time than in a cluster with a higher node timeout.

These delays don't affect the correctness of the algorithm. They are just there to increase the probability that a failover will succeed by making sure that the FAIL message has enough time to propagate in the cluster and to the random part is to reduce the probability that two replicas initiates the failover at the same time.

The typical use case is when data consistency matters and writes can't be skipped. For example, in some application, we buffer writes in the application during node failures to be able to apply them when the failover is completed. The application can't buffer them for a very long time, so we need the cluster to be up again within e.g. 5 seconds from the time a node starts to fail.

I hope this PR can be considered safer than #2227, although the two changes are orthogonal.

Part of issue #2023.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast zuiderkwast added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Aug 7, 2025
Comment thread src/cluster_legacy.c Fixed
Comment thread src/cluster_legacy.c Fixed
Comment thread src/cluster_legacy.c Fixed
Comment thread src/cluster_legacy.c Fixed
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

@hpatro hpatro left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks safe to me.

Do you think we need the minimum set to 500 milliseconds or just keep everything relative to node timeout?

@zuiderkwast

Copy link
Copy Markdown
Contributor Author

Do you think we need the minimum set to 500 milliseconds or just keep everything relative to node timeout?

I did use 500 as a maximum, not a minimum. This is for any cluster using node timeout higher than 15 seconds, to not make failover slower than now for them. I don't know if there are any such clustes though and whether it matters.

@hpatro

hpatro commented Aug 9, 2025

Copy link
Copy Markdown
Contributor

Do you think we need the minimum set to 500 milliseconds or just keep everything relative to node timeout?

I did use 500 as a maximum, not a minimum. This is for any cluster using node timeout higher than 15 seconds, to not make failover slower than now for them. I don't know if there are any such clustes though and whether it matters.

Yes, I wrote it as verbatim as the code is. We use 15 seconds as the timeout so this change doesn't have any impact on our setup.

I think keeping it relative keeps the logic simpler and the additional delay shouldn't bother them as the time it takes to fail a node is already set to a high value.

@enjoy-binbin enjoy-binbin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed this a lot before. I am ok with this change.

@codecov

codecov Bot commented Aug 11, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.00%. Comparing base (7a5c0d0) to head (55cdfed).
⚠️ Report is 1 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2449      +/-   ##
============================================
- Coverage     72.01%   72.00%   -0.01%     
============================================
  Files           126      126              
  Lines         70473    70476       +3     
============================================
- Hits          50752    50748       -4     
- Misses        19721    19728       +7     
Files with missing lines Coverage Δ
src/cluster_legacy.c 87.76% <100.00%> (+0.03%) ⬆️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…t other replicas yet

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast zuiderkwast marked this pull request as ready for review August 14, 2025 13:19
@zuiderkwast zuiderkwast added the release-notes This issue should get a line item in the release notes label Aug 18, 2025
@zuiderkwast zuiderkwast merged commit 2ca0dd8 into valkey-io:unstable Aug 18, 2025
61 checks passed
@zuiderkwast zuiderkwast deleted the failover-delay-relative-node-timeout branch August 18, 2025 13:35
allenss-amazon pushed a commit to allenss-amazon/valkey-core that referenced this pull request Aug 19, 2025
In clusters with a very short node timeout such as 2-3 seconds, the
extra failover delay of 500-1000 milliseconds (500 + random value 0-500;
total 750 on average) before initiating a failover is a significant
extra downtime to the cluster. This PR makes this delay relative to node
timeout, using a shorter failover delay for a smaller configured node
timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`.

| Node timeout  | Fixed failover delay |
|---------------|----------------------|
| 15000 or more | 500 (same as before) |
| 7500          | 250                  |
| 3000          | 100                  |
| 1500          | 50                   |

Additional change: Add an extra 500ms delay to new replicas that may not
yet know about the other replicas. This avoids the scenario where a new
replica with no data wins the failover. This change turned out to be
needed to for the stability of some test cases.

The purposes of the failover delay are

1. Allow FAIL to propagate to the voting primaries in the cluster
2. Allow replicas to exchange their offsets, so they will have a correct
view of their own rank.

A third (undocumented) purpose of this delay is to allow newly added
replicas to discover other replicas in the cluster via gossip and to
compute their rank, to realize it's are not the best replica. This case
is mitigated by adding another 500ms delay to new replicas, i.e. if it
has replication offset 0.

A low node timeout only makes sense in fast networks, so we can assume
that the above needs less time than in a cluster with a higher node
timeout.

These delays don't affect the correctness of the algorithm. They are
just there to increase the probability that a failover will succeed by
making sure that the FAIL message has enough time to propagate in the
cluster and to the random part is to reduce the probability that two
replicas initiates the failover at the same time.

The typical use case is when data consistency matters and writes can't
be skipped. For example, in some application, we buffer writes in the
application during node failures to be able to apply them when the
failover is completed. The application can't buffer them for a very long
time, so we need the cluster to be up again within e.g. 5 seconds from
the time a node starts to fail.

I hope this PR can be considered safer than valkey-io#2227, although the two
changes are orthogonal.

Part of issue valkey-io#2023.

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
asagege pushed a commit to asagege/valkey that referenced this pull request Aug 19, 2025
In clusters with a very short node timeout such as 2-3 seconds, the
extra failover delay of 500-1000 milliseconds (500 + random value 0-500;
total 750 on average) before initiating a failover is a significant
extra downtime to the cluster. This PR makes this delay relative to node
timeout, using a shorter failover delay for a smaller configured node
timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`.

| Node timeout  | Fixed failover delay |
|---------------|----------------------|
| 15000 or more | 500 (same as before) |
| 7500          | 250                  |
| 3000          | 100                  |
| 1500          | 50                   |

Additional change: Add an extra 500ms delay to new replicas that may not
yet know about the other replicas. This avoids the scenario where a new
replica with no data wins the failover. This change turned out to be
needed to for the stability of some test cases.

The purposes of the failover delay are

1. Allow FAIL to propagate to the voting primaries in the cluster
2. Allow replicas to exchange their offsets, so they will have a correct
view of their own rank.

A third (undocumented) purpose of this delay is to allow newly added
replicas to discover other replicas in the cluster via gossip and to
compute their rank, to realize it's are not the best replica. This case
is mitigated by adding another 500ms delay to new replicas, i.e. if it
has replication offset 0.

A low node timeout only makes sense in fast networks, so we can assume
that the above needs less time than in a cluster with a higher node
timeout.

These delays don't affect the correctness of the algorithm. They are
just there to increase the probability that a failover will succeed by
making sure that the FAIL message has enough time to propagate in the
cluster and to the random part is to reduce the probability that two
replicas initiates the failover at the same time.

The typical use case is when data consistency matters and writes can't
be skipped. For example, in some application, we buffer writes in the
application during node failures to be able to apply them when the
failover is completed. The application can't buffer them for a very long
time, so we need the cluster to be up again within e.g. 5 seconds from
the time a node starts to fail.

I hope this PR can be considered safer than valkey-io#2227, although the two
changes are orthogonal.

Part of issue valkey-io#2023.

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
rjd15372 pushed a commit to rjd15372/valkey that referenced this pull request Sep 19, 2025
In clusters with a very short node timeout such as 2-3 seconds, the
extra failover delay of 500-1000 milliseconds (500 + random value 0-500;
total 750 on average) before initiating a failover is a significant
extra downtime to the cluster. This PR makes this delay relative to node
timeout, using a shorter failover delay for a smaller configured node
timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`.

| Node timeout  | Fixed failover delay |
|---------------|----------------------|
| 15000 or more | 500 (same as before) |
| 7500          | 250                  |
| 3000          | 100                  |
| 1500          | 50                   |

Additional change: Add an extra 500ms delay to new replicas that may not
yet know about the other replicas. This avoids the scenario where a new
replica with no data wins the failover. This change turned out to be
needed to for the stability of some test cases.

The purposes of the failover delay are

1. Allow FAIL to propagate to the voting primaries in the cluster
2. Allow replicas to exchange their offsets, so they will have a correct
view of their own rank.

A third (undocumented) purpose of this delay is to allow newly added
replicas to discover other replicas in the cluster via gossip and to
compute their rank, to realize it's are not the best replica. This case
is mitigated by adding another 500ms delay to new replicas, i.e. if it
has replication offset 0.

A low node timeout only makes sense in fast networks, so we can assume
that the above needs less time than in a cluster with a higher node
timeout.

These delays don't affect the correctness of the algorithm. They are
just there to increase the probability that a failover will succeed by
making sure that the FAIL message has enough time to propagate in the
cluster and to the random part is to reduce the probability that two
replicas initiates the failover at the same time.

The typical use case is when data consistency matters and writes can't
be skipped. For example, in some application, we buffer writes in the
application during node failures to be able to apply them when the
failover is completed. The application can't buffer them for a very long
time, so we need the cluster to be up again within e.g. 5 seconds from
the time a node starts to fail.

I hope this PR can be considered safer than valkey-io#2227, although the two
changes are orthogonal.

Part of issue valkey-io#2023.

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
rjd15372 pushed a commit that referenced this pull request Sep 23, 2025
In clusters with a very short node timeout such as 2-3 seconds, the
extra failover delay of 500-1000 milliseconds (500 + random value 0-500;
total 750 on average) before initiating a failover is a significant
extra downtime to the cluster. This PR makes this delay relative to node
timeout, using a shorter failover delay for a smaller configured node
timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`.

| Node timeout  | Fixed failover delay |
|---------------|----------------------|
| 15000 or more | 500 (same as before) |
| 7500          | 250                  |
| 3000          | 100                  |
| 1500          | 50                   |

Additional change: Add an extra 500ms delay to new replicas that may not
yet know about the other replicas. This avoids the scenario where a new
replica with no data wins the failover. This change turned out to be
needed to for the stability of some test cases.

The purposes of the failover delay are

1. Allow FAIL to propagate to the voting primaries in the cluster
2. Allow replicas to exchange their offsets, so they will have a correct
view of their own rank.

A third (undocumented) purpose of this delay is to allow newly added
replicas to discover other replicas in the cluster via gossip and to
compute their rank, to realize it's are not the best replica. This case
is mitigated by adding another 500ms delay to new replicas, i.e. if it
has replication offset 0.

A low node timeout only makes sense in fast networks, so we can assume
that the above needs less time than in a cluster with a higher node
timeout.

These delays don't affect the correctness of the algorithm. They are
just there to increase the probability that a failover will succeed by
making sure that the FAIL message has enough time to propagate in the
cluster and to the random part is to reduce the probability that two
replicas initiates the failover at the same time.

The typical use case is when data consistency matters and writes can't
be skipped. For example, in some application, we buffer writes in the
application during node failures to be able to apply them when the
failover is completed. The application can't buffer them for a very long
time, so we need the cluster to be up again within e.g. 5 seconds from
the time a node starts to fail.

I hope this PR can be considered safer than #2227, although the two
changes are orthogonal.

Part of issue #2023.

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
In clusters with a very short node timeout such as 2-3 seconds, the
extra failover delay of 500-1000 milliseconds (500 + random value 0-500;
total 750 on average) before initiating a failover is a significant
extra downtime to the cluster. This PR makes this delay relative to node
timeout, using a shorter failover delay for a smaller configured node
timeout. The formula is `fixed_delay = min(500, node_timeout / 30)`.

| Node timeout  | Fixed failover delay |
|---------------|----------------------|
| 15000 or more | 500 (same as before) |
| 7500          | 250                  |
| 3000          | 100                  |
| 1500          | 50                   |

Additional change: Add an extra 500ms delay to new replicas that may not
yet know about the other replicas. This avoids the scenario where a new
replica with no data wins the failover. This change turned out to be
needed to for the stability of some test cases.

The purposes of the failover delay are

1. Allow FAIL to propagate to the voting primaries in the cluster
2. Allow replicas to exchange their offsets, so they will have a correct
view of their own rank.

A third (undocumented) purpose of this delay is to allow newly added
replicas to discover other replicas in the cluster via gossip and to
compute their rank, to realize it's are not the best replica. This case
is mitigated by adding another 500ms delay to new replicas, i.e. if it
has replication offset 0.

A low node timeout only makes sense in fast networks, so we can assume
that the above needs less time than in a cluster with a higher node
timeout.

These delays don't affect the correctness of the algorithm. They are
just there to increase the probability that a failover will succeed by
making sure that the FAIL message has enough time to propagate in the
cluster and to the random part is to reduce the probability that two
replicas initiates the failover at the same time.

The typical use case is when data consistency matters and writes can't
be skipped. For example, in some application, we buffer writes in the
application during node failures to be able to apply them when the
failover is completed. The application can't buffer them for a very long
time, so we need the cluster to be up again within e.g. 5 seconds from
the time a node starts to fail.

I hope this PR can be considered safer than valkey-io#2227, although the two
changes are orthogonal.

Part of issue valkey-io#2023.

---------

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
@enjoy-binbin enjoy-binbin moved this to Done in Valkey 9.0 Jun 9, 2026
@github-project-automation github-project-automation Bot moved this from Done to To be backported in Valkey 9.0 Jun 9, 2026
@enjoy-binbin enjoy-binbin moved this from To be backported to Done in Valkey 9.0 Jun 9, 2026
enjoy-binbin added a commit that referenced this pull request Jun 9, 2026
…#3941)

Since #2449 made the failover delay relative to cluster-node-timeout.
Now delay = min(cluster-node-timeout / 30, 500), any cluster-node-timeout
below 30, including the legal minimum 0 will collapses delay to zero,
and `x % 0` is undefined behaviour.

Signed-off-by: Binbin <binloveplay1314@qq.com>
valkeyrie-ops Bot pushed a commit that referenced this pull request Jun 10, 2026
…#3941)

Since #2449 made the failover delay relative to cluster-node-timeout.
Now delay = min(cluster-node-timeout / 30, 500), any cluster-node-timeout
below 30, including the legal minimum 0 will collapses delay to zero,
and `x % 0` is undefined behaviour.

Signed-off-by: Binbin <binloveplay1314@qq.com>
valkeyrie-ops Bot pushed a commit that referenced this pull request Jun 22, 2026
…#3941)

Since #2449 made the failover delay relative to cluster-node-timeout.
Now delay = min(cluster-node-timeout / 30, 500), any cluster-node-timeout
below 30, including the legal minimum 0 will collapses delay to zero,
and `x % 0` is undefined behaviour.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants