Optimize failover time when the new primary node is down again by enjoy-binbin · Pull Request #782 · valkey-io/valkey

enjoy-binbin · 2024-07-13T16:33:14Z

We will not reset failover_auth_time after setting it, this is used
to check auth_timeout and auth_retry_time, but we should at least
reset it after a successful failover.

Let's assume the following scenario:

Two replicas initiate an election.
Replica 1 is elected as the primary node, and replica 2 does not have
enough votes.
Replica 1 is down, ie the new primary node down again in a short time.
Replica 2 know that the new primary node is down and wants to initiate
a failover, but because the failover_auth_time of the previous round
has not been reset, it needs to wait for it to time out and then wait
for the next retry time, which will take cluster-node-timeout * 4 times,
this adds a lot of delay.

There is another problem. Like we will set additional random time for
failover_auth_time, such as random 500ms and replicas ranking 1s. If
replica 2 receives PONG from the new primary node before sending the
FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will
change itself to a replica. If the new primary node goes down again at
this time, replica 2 will use the previous failover_auth_time to initiate
an election instead of going through the logic of random 500ms and
replicas ranking 1s again, which may lead to unexpected consequences
(for example, a low-ranking replica initiates an election and becomes
the new primary node).

That is, we need to reset failover_auth_time at the appropriate time.
When the replica switches to a new primary, we reset it, because the
existing failover_auth_time is already out of date in this case.

We will not reset failover_auth_time after setting it, this is used to check auth_timeout and auth_retry_time, but we should at least reset it after a successful failover. Let's assume the following scenario: 1. Two replicas initiate an election. 2. Replica 1 is elected as the primary node, and replica 2 does not have enough votes. 3. Replica 1 is down, ie the new primary node down again in a short time. 4. Replica 2 know that the new primary node is down and wants to initiate a failover, but because the failover_auth_time of the previous round has not been reset, it needs to wait for it to time out and then wait for the next retry time, which will take cluster-node-timeout * 4 times, this adds a lot of delay. There is another problem. Like we will set additional random time for failover_auth_time, such as random 500ms and replicas ranking 1s. If replica 2 receives PONG from the new primary node before sending the FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will change itself to a replica. If the new primary node goes down again at this time, replica 2 will use the previous failover_auth_time to initiate an election instead of going through the logic of random 500ms and replicas ranking 1s again, which may lead to unexpected consequences (for example, a low-ranking replica initiates an election and becomes the new primary node). That is, we need to reset failover_auth_time at the appropriate time. When the replica switches to a new primary, we reset it, because the existing failover_auth_time is already out of date in this case. Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2024-07-13T16:35:41Z

For the failover2 test case, with the default node-timeout 15s, and it runs a loop on the new and old code.

unstable (the 90s+ part is the case 1, and the 40s+ part is the case 2):

Execution time of different units:
  46 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  90 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  91 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  90 seconds - unit/cluster/failover2
  45 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2

this branch:

Execution time of different units:
  48 seconds - unit/cluster/failover2
  48 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2

using the node-timeout 5000:

unstable
Execution time of different units:
  25 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  40 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  24 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  42 seconds - unit/cluster/failover2
  40 seconds - unit/cluster/failover2

this branch

Execution time of different units:
  27 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  27 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  27 seconds - unit/cluster/failover2

codecov · 2024-07-13T16:43:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.25%. Comparing base (b4ac2c4) to head (1355c06).
⚠️ Report is 890 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #782      +/-   ##
============================================
+ Coverage     70.23%   70.25%   +0.02%     
============================================
  Files           112      112              
  Lines         60602    60592      -10     
============================================
+ Hits          42563    42571       +8     
+ Misses        18039    18021      -18

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`85.86% <100.00%> (+0.07%)`	⬆️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

madolson

Nice find! some minor comments on the test.

Signed-off-by: Binbin <binloveplay1314@qq.com>

madolson

Nice! LGTM.

PingXie

LGTM.

Signed-off-by: Binbin <binloveplay1314@qq.com>

…y-io#782) We will not reset failover_auth_time after setting it, this is used to check auth_timeout and auth_retry_time, but we should at least reset it after a successful failover. Let's assume the following scenario: 1. Two replicas initiate an election. 2. Replica 1 is elected as the primary node, and replica 2 does not have enough votes. 3. Replica 1 is down, ie the new primary node down again in a short time. 4. Replica 2 know that the new primary node is down and wants to initiate a failover, but because the failover_auth_time of the previous round has not been reset, it needs to wait for it to time out and then wait for the next retry time, which will take cluster-node-timeout * 4 times, this adds a lot of delay. There is another problem. Like we will set additional random time for failover_auth_time, such as random 500ms and replicas ranking 1s. If replica 2 receives PONG from the new primary node before sending the FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will change itself to a replica. If the new primary node goes down again at this time, replica 2 will use the previous failover_auth_time to initiate an election instead of going through the logic of random 500ms and replicas ranking 1s again, which may lead to unexpected consequences (for example, a low-ranking replica initiates an election and becomes the new primary node). That is, we need to reset failover_auth_time at the appropriate time. When the replica switches to a new primary, we reset it, because the existing failover_auth_time is already out of date in this case. --------- Signed-off-by: Binbin <binloveplay1314@qq.com>

…' (merge request !21) Backport cluster optimiztions valkey-io/valkey#633 valkey-io/valkey#780 valkey-io/valkey#782 valkey-io/valkey#816

enjoy-binbin requested review from PingXie, madolson and zuiderkwast July 13, 2024 16:36

madolson reviewed Jul 13, 2024

View reviewed changes

Comment thread tests/unit/cluster/failover2.tcl Outdated

Comment thread tests/unit/cluster/failover2.tcl Outdated

Comment thread tests/unit/cluster/failover2.tcl Outdated

update the tests

0f34526

Signed-off-by: Binbin <binloveplay1314@qq.com>

madolson approved these changes Jul 15, 2024

View reviewed changes

madolson added the release-notes This issue should get a line item in the release notes label Jul 15, 2024

PingXie approved these changes Jul 16, 2024

View reviewed changes

Comment thread tests/support/util.tcl Outdated

Comment thread tests/unit/cluster/failover2.tcl Outdated

code review from Ping

1355c06

Signed-off-by: Binbin <binloveplay1314@qq.com>

hwware approved these changes Jul 19, 2024

View reviewed changes

hwware merged commit 15a8290 into valkey-io:unstable Jul 19, 2024

enjoy-binbin deleted the optimize_failover branch July 20, 2024 03:59

enjoy-binbin added the cluster label Sep 19, 2025

enjoy-binbin added this to Valkey 8.0 Dec 16, 2025

github-project-automation Bot moved this to To be backported in Valkey 8.0 Dec 16, 2025

enjoy-binbin moved this from To be backported to Done in Valkey 8.0 Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize failover time when the new primary node is down again#782

Optimize failover time when the new primary node is down again#782
hwware merged 3 commits into
valkey-io:unstablefrom
enjoy-binbin:optimize_failover

enjoy-binbin commented Jul 13, 2024

Uh oh!

enjoy-binbin commented Jul 13, 2024 •

edited

Loading

Uh oh!

codecov Bot commented Jul 13, 2024 •

edited

Loading

Uh oh!

madolson left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madolson left a comment

Uh oh!

PingXie left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

enjoy-binbin commented Jul 13, 2024

Uh oh!

enjoy-binbin commented Jul 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jul 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

PingXie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

enjoy-binbin commented Jul 13, 2024 •

edited

Loading

codecov Bot commented Jul 13, 2024 •

edited

Loading