Skip to content

Make manual failover reset the on-going election to promote failover#1274

Merged
enjoy-binbin merged 7 commits into
valkey-io:unstablefrom
enjoy-binbin:manual_failover_reset
Nov 22, 2024
Merged

Make manual failover reset the on-going election to promote failover#1274
enjoy-binbin merged 7 commits into
valkey-io:unstablefrom
enjoy-binbin:manual_failover_reset

Conversation

@enjoy-binbin

Copy link
Copy Markdown
Member

If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.

Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.

In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.

If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.

Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.

In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin requested a review from PingXie November 8, 2024 06:02
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Nov 8, 2024
@codecov

codecov Bot commented Nov 8, 2024

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.70%. Comparing base (4986310) to head (ad212eb).
⚠️ Report is 618 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1274      +/-   ##
============================================
+ Coverage     70.68%   70.70%   +0.02%     
============================================
  Files           115      115              
  Lines         63177    63178       +1     
============================================
+ Hits          44657    44673      +16     
+ Misses        18520    18505      -15     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.48% <100.00%> (-0.01%) ⬇️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@enjoy-binbin

Copy link
Copy Markdown
Member Author

A log demo from the test case (before the fix).

replica:

28295:S 08 Nov 2024 14:37:20.208 * Manual failover user request accepted (user request from 'id=4 addr=127.0.0.1:59705 laddr=127.0.0.1:21111 fd=16 name= age=11 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=15 multi-mem=0 rbs=1024 rbp=518 obl=0 oll=0 omem=0 tot-mem=1951 events=r cmd=cluster|failover user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=318 tot-net-out=3163 tot-cmds=7').
28295:S 08 Nov 2024 14:37:20.209 * Received replication offset for paused primary manual failover: 14
28295:S 08 Nov 2024 14:37:20.209 * All primary replication stream processed, manual failover can start.
28295:S 08 Nov 2024 14:37:20.209 * Start of election delayed for 0 milliseconds (rank #0, offset 14).
28295:S 08 Nov 2024 14:37:20.209 * Starting a failover election for epoch 4.
28295:S 08 Nov 2024 14:37:25.096 * Currently unable to failover: Waiting for votes, but majority still not reached.
28295:S 08 Nov 2024 14:37:25.096 * Needed quorum: 2. Number of votes received so far: 1
28295:S 08 Nov 2024 14:37:25.298 # Manual failover timed out.

# The second cluster failover, but got timed out due to the auth_timeout and need to wait for auth_retry_time
28295:S 08 Nov 2024 14:37:25.345 * Manual failover user request accepted (user request from 'id=4 addr=127.0.0.1:59705 laddr=127.0.0.1:21111 fd=16 name= age=16 idle=0 flags=N db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=15 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1951 events=r cmd=cluster|failover user=default redir=-1 resp=2 lib-name= lib-ver= tot-net-in=349 tot-net-out=3168 tot-cmds=8').
28295:S 08 Nov 2024 14:37:25.346 * Received replication offset for paused primary manual failover: 14
28295:S 08 Nov 2024 14:37:25.346 * All primary replication stream processed, manual failover can start.
28295:S 08 Nov 2024 14:37:30.046 * Currently unable to failover: Waiting for votes, but majority still not reached.
28295:S 08 Nov 2024 14:37:30.046 * Needed quorum: 2. Number of votes received so far: 1
28295:S 08 Nov 2024 14:37:30.349 # Manual failover timed out.

the primary:

28385:M 08 Nov 2024 14:37:20.208 * Manual failover requested by replica a31915be22368c4df57d2f17d58cc03f578e3149 ().
28385:M 08 Nov 2024 14:37:20.209 * Failover auth granted to a31915be22368c4df57d2f17d58cc03f578e3149 () for epoch 4
28385:M 08 Nov 2024 14:37:25.221 # Manual failover timed out.
28385:M 08 Nov 2024 14:37:25.346 * Manual failover requested by replica a31915be22368c4df57d2f17d58cc03f578e3149 ().
28385:M 08 Nov 2024 14:37:30.376 # Manual failover timed out.

Comment thread src/cluster_legacy.c
…_reset

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Comment thread tests/unit/cluster/manual-failover.tcl

@madolson madolson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me.

Comment thread tests/unit/cluster/manual-failover.tcl
Comment thread tests/unit/cluster/manual-failover.tcl
Signed-off-by: Binbin <binloveplay1314@qq.com>

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review. The idea looks good.

…_reset

Signed-off-by: Binbin <binloveplay1314@qq.com>

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

21K added lines? Lots of temp files?

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin

Copy link
Copy Markdown
Member Author

opps, sorry, a bad conflict handling.

@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Nov 21, 2024

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we should probably add .cmake (etc.) to .gitignore to prevent those files from being added by mistake.

@enjoy-binbin enjoy-binbin merged commit c4be326 into valkey-io:unstable Nov 22, 2024
@enjoy-binbin enjoy-binbin deleted the manual_failover_reset branch November 22, 2024 02:29
@enjoy-binbin enjoy-binbin moved this to 8.0.4 in Valkey 8.0 Jun 17, 2025
enjoy-binbin added a commit to vitarb/valkey that referenced this pull request Jun 17, 2025
…alkey-io#1274)

If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.

Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.

In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.

Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin added a commit to vitarb/valkey that referenced this pull request Jun 17, 2025
…alkey-io#1274)

If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.

Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.

In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.

Signed-off-by: Binbin <binloveplay1314@qq.com>
zuiderkwast pushed a commit to vitarb/valkey that referenced this pull request Aug 15, 2025
…alkey-io#1274)

If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.

Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.

In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.

Signed-off-by: Binbin <binloveplay1314@qq.com>
zuiderkwast pushed a commit to vitarb/valkey that referenced this pull request Aug 15, 2025
…alkey-io#1274)

If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.

Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.

In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast zuiderkwast moved this from 8.0.4 to 8.0.5 in Valkey 8.0 Aug 18, 2025
zuiderkwast pushed a commit to vitarb/valkey that referenced this pull request Aug 21, 2025
…alkey-io#1274)

If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.

Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.

In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zuiderkwast pushed a commit that referenced this pull request Aug 22, 2025
…1274)

If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.

Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.

In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Sep 16, 2025
…alkey-io#1274)

If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.

Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.

In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: 8.0.5
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants