Skip to content

Optimize failover time when the new primary node is down again#782

Merged
hwware merged 3 commits into
valkey-io:unstablefrom
enjoy-binbin:optimize_failover
Jul 19, 2024
Merged

Optimize failover time when the new primary node is down again#782
hwware merged 3 commits into
valkey-io:unstablefrom
enjoy-binbin:optimize_failover

Conversation

@enjoy-binbin

Copy link
Copy Markdown
Member

We will not reset failover_auth_time after setting it, this is used
to check auth_timeout and auth_retry_time, but we should at least
reset it after a successful failover.

Let's assume the following scenario:

  1. Two replicas initiate an election.
  2. Replica 1 is elected as the primary node, and replica 2 does not have
    enough votes.
  3. Replica 1 is down, ie the new primary node down again in a short time.
  4. Replica 2 know that the new primary node is down and wants to initiate
    a failover, but because the failover_auth_time of the previous round
    has not been reset, it needs to wait for it to time out and then wait
    for the next retry time, which will take cluster-node-timeout * 4 times,
    this adds a lot of delay.

There is another problem. Like we will set additional random time for
failover_auth_time, such as random 500ms and replicas ranking 1s. If
replica 2 receives PONG from the new primary node before sending the
FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will
change itself to a replica. If the new primary node goes down again at
this time, replica 2 will use the previous failover_auth_time to initiate
an election instead of going through the logic of random 500ms and
replicas ranking 1s again, which may lead to unexpected consequences
(for example, a low-ranking replica initiates an election and becomes
the new primary node).

That is, we need to reset failover_auth_time at the appropriate time.
When the replica switches to a new primary, we reset it, because the
existing failover_auth_time is already out of date in this case.

We will not reset failover_auth_time after setting it, this is used
to check auth_timeout and auth_retry_time, but we should at least
reset it after a successful failover.

Let's assume the following scenario:
1. Two replicas initiate an election.
2. Replica 1 is elected as the primary node, and replica 2 does not have
   enough votes.
3. Replica 1 is down, ie the new primary node down again in a short time.
4. Replica 2 know that the new primary node is down and wants to initiate
   a failover, but because the failover_auth_time of the previous round
   has not been reset, it needs to wait for it to time out and then wait
   for the next retry time, which will take cluster-node-timeout * 4 times,
   this adds a lot of delay.

There is another problem. Like we will set additional random time for
failover_auth_time, such as random 500ms and replicas ranking 1s. If
replica 2 receives PONG from the new primary node before sending the
FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will
change itself to a replica. If the new primary node goes down again at
this time, replica 2 will use the previous failover_auth_time to initiate
an election instead of going through the logic of random 500ms and
replicas ranking 1s again, which may lead to unexpected consequences
(for example, a low-ranking replica initiates an election and becomes
the new primary node).

That is, we need to reset failover_auth_time at the appropriate time.
When the replica switches to a new primary, we reset it, because the
existing failover_auth_time is already out of date in this case.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin

enjoy-binbin commented Jul 13, 2024

Copy link
Copy Markdown
Member Author

For the failover2 test case, with the default node-timeout 15s, and it runs a loop on the new and old code.

unstable (the 90s+ part is the case 1, and the 40s+ part is the case 2):

Execution time of different units:
  46 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  90 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  91 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  90 seconds - unit/cluster/failover2
  45 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2

this branch:

Execution time of different units:
  48 seconds - unit/cluster/failover2
  48 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  46 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2
  47 seconds - unit/cluster/failover2

using the node-timeout 5000:

unstable
Execution time of different units:
  25 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  40 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  24 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  42 seconds - unit/cluster/failover2
  40 seconds - unit/cluster/failover2

this branch

Execution time of different units:
  27 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  25 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  27 seconds - unit/cluster/failover2
  26 seconds - unit/cluster/failover2
  27 seconds - unit/cluster/failover2

@codecov

codecov Bot commented Jul 13, 2024

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.25%. Comparing base (b4ac2c4) to head (1355c06).
⚠️ Report is 890 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #782      +/-   ##
============================================
+ Coverage     70.23%   70.25%   +0.02%     
============================================
  Files           112      112              
  Lines         60602    60592      -10     
============================================
+ Hits          42563    42571       +8     
+ Misses        18039    18021      -18     
Files with missing lines Coverage Δ
src/cluster_legacy.c 85.86% <100.00%> (+0.07%) ⬆️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@madolson madolson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find! some minor comments on the test.

Comment thread tests/unit/cluster/failover2.tcl Outdated
Comment thread tests/unit/cluster/failover2.tcl Outdated
Comment thread tests/unit/cluster/failover2.tcl Outdated
Signed-off-by: Binbin <binloveplay1314@qq.com>

@madolson madolson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM.

@madolson madolson added the release-notes This issue should get a line item in the release notes label Jul 15, 2024

@PingXie PingXie left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Comment thread tests/support/util.tcl Outdated
Comment thread tests/unit/cluster/failover2.tcl Outdated
Signed-off-by: Binbin <binloveplay1314@qq.com>
@hwware hwware merged commit 15a8290 into valkey-io:unstable Jul 19, 2024
@enjoy-binbin enjoy-binbin deleted the optimize_failover branch July 20, 2024 03:59
hwware pushed a commit to hwware/valkey that referenced this pull request Jul 25, 2024
…y-io#782)

We will not reset failover_auth_time after setting it, this is used
to check auth_timeout and auth_retry_time, but we should at least
reset it after a successful failover.

Let's assume the following scenario:
1. Two replicas initiate an election.
2. Replica 1 is elected as the primary node, and replica 2 does not have
   enough votes.
3. Replica 1 is down, ie the new primary node down again in a short
time.
4. Replica 2 know that the new primary node is down and wants to
initiate
   a failover, but because the failover_auth_time of the previous round
   has not been reset, it needs to wait for it to time out and then wait
for the next retry time, which will take cluster-node-timeout * 4 times,
   this adds a lot of delay.

There is another problem. Like we will set additional random time for
failover_auth_time, such as random 500ms and replicas ranking 1s. If
replica 2 receives PONG from the new primary node before sending the
FAILOVER_AUTH_REQUEST, that is, before the failover_auth_time, it will
change itself to a replica. If the new primary node goes down again at
this time, replica 2 will use the previous failover_auth_time to
initiate
an election instead of going through the logic of random 500ms and
replicas ranking 1s again, which may lead to unexpected consequences
(for example, a low-ranking replica initiates an election and becomes
the new primary node).

That is, we need to reset failover_auth_time at the appropriate time.
When the replica switches to a new primary, we reset it, because the
existing failover_auth_time is already out of date in this case.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
@github-project-automation github-project-automation Bot moved this to To be backported in Valkey 8.0 Dec 16, 2025
@enjoy-binbin enjoy-binbin moved this from To be backported to Done in Valkey 8.0 Dec 16, 2025
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster release-notes This issue should get a line item in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants