Skip to content

raft: fix election deadlock when nodes have election_mode off#11981

Merged
sergepetrenko merged 1 commit intotarantool:masterfrom
philippeboyd:bugfix/raft-disabled-nodes-should-not-report-leader-seen
Nov 7, 2025
Merged

raft: fix election deadlock when nodes have election_mode off#11981
sergepetrenko merged 1 commit intotarantool:masterfrom
philippeboyd:bugfix/raft-disabled-nodes-should-not-report-leader-seen

Conversation

@philippeboyd
Copy link
Contributor

@philippeboyd philippeboyd commented Oct 24, 2025

Closes #12018

When instances with election_mode=off exist in a replicaset, they continue to broadcast is_leader_seen=true even after the leader dies. (Their death detection timers never start since RAFT is disabled for them). This causes the leader_witness_map bits for these hosts to remain set indefinitely on candidate nodes, blocking elections since the pre-vote protection check requires leader_witness_map==0.

The root cause is that election_mode=off nodes cannot be distinguished from active voters in RAFT messages. Both report state follower with is_leader_seen based on local state, but election_mode=off nodes never update their view since heartbeat processing exits early when raft is disabled.

This fix forces nodes with election_mode=off to always broadcast is_leader_seen=false. This allows candidate nodes to immediately clear witness map bits for non-participating nodes, enabling elections to proceed with only active participants.

Is this the right approach or have I missed anything?

@philippeboyd philippeboyd force-pushed the bugfix/raft-disabled-nodes-should-not-report-leader-seen branch from 2a96245 to 6843f0e Compare October 24, 2025 17:01
@philippeboyd philippeboyd requested a review from a team as a code owner October 24, 2025 17:01
@sergepetrenko sergepetrenko requested review from Gerold103, Serpentian and sergepetrenko and removed request for Gerold103 October 28, 2025 06:18
@coveralls
Copy link

coveralls commented Oct 28, 2025

Coverage Status

coverage: 87.678% (+0.03%) from 87.649%
when pulling a9e7820 on philippeboyd:bugfix/raft-disabled-nodes-should-not-report-leader-seen
into ec05cb1
on tarantool:master
.

@philippeboyd philippeboyd force-pushed the bugfix/raft-disabled-nodes-should-not-report-leader-seen branch 2 times, most recently from 943747e to eaa4dbd Compare October 31, 2025 19:38
Copy link
Collaborator

@sergepetrenko sergepetrenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Philippe!

Sorry for the long delay in review and thank you for your patch!

Your approach looks good to me, I have only a couple of comments regarding the changelog wording and commit style.

It's good that you've found and fixed this issue. Could you tell me how you stumbled upon it?

Copy link
Contributor

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding and fixing such a critical bug! This could have caused a cluster downtime if it had been found in production. The solution is nice and elegant, I have no significant comments regarding it

@philippeboyd
Copy link
Contributor Author

Hi @sergepetrenko thanks for reviewing. To answer your question:

Could you tell me how you stumbled upon it?

We were testing a setup with one replicaset spread in two datacenters with one datacenter being active and the other being passive.

With a synchro_quorum: 2 to keep the writes fast in the active datacenter while still having data replication in the passive datacenter and protecting us from a split-brain situation.

Having a replicaset storage with intances:

dc1-storage-1 (raft candidate)
dc1-storage-2 (raft candidate)
dc1-storage-3 (raft candidate)
dc2-storage-1 (raft off)
dc2-storage-2 (raft off)
dc2-storage-3 (raft off)

Say dc1-storage-1 was the leader and it died, no election was triggered.

@philippeboyd philippeboyd force-pushed the bugfix/raft-disabled-nodes-should-not-report-leader-seen branch from eaa4dbd to 60ae034 Compare November 6, 2025 13:56
@philippeboyd philippeboyd changed the title fix(raft): fix election deadlock when nodes have election_mode off raft: fix election deadlock when nodes have election_mode off Nov 6, 2025
Copy link
Collaborator

@sergepetrenko sergepetrenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Philippe, thanks for the fixes!
LGTM.

@sergepetrenko sergepetrenko added backport/3.2 Automatically create a 3.2 backport PR backport/3.3 Automatically create a 3.3 backport PR backport/3.4 Automatically create a 3.4 backport PR backport/3.5 Automatically create a 3.5 backport PR labels Nov 7, 2025
@sergepetrenko
Copy link
Collaborator

sergepetrenko commented Nov 7, 2025

@philippeboyd, thanks for the answer, got it.

Just be aware that a replication conflict might still happen with such a setup (although it's rather unlikely).

With a synchro_quorum: 2 to keep the writes fast in the active datacenter while still having data replication in the passive datacenter and protecting us from a split-brain situation.

Having a replicaset storage with intances:

dc1-storage-1 (raft candidate)
dc1-storage-2 (raft candidate)
dc1-storage-3 (raft candidate)
dc2-storage-1 (raft off)
dc2-storage-2 (raft off)
dc2-storage-3 (raft off)

While you won't get 2 leaders in the same term (obviously only 3 nodes participate in elections, 2 votes out of 3 give you a single leader), it's possible that the elected leader won't have all the committed transactions of the prevous leader, because the nodes with election_mode = 'off' are still counted in quorum for synchronous transaction commits.

So, imagine dc1-storage-1 is the leader, it writes some transaction A replicates it only to dc2-storage-1. The leader commits A, as it has gathered quorum. Then dc1-storage-1 dies before replicating A to anyone else, and the elections are triggered. Neither dc1-storage-2 nor dc1-storage-3 have the transaction, but one of them will be elected the next leader, which will cause a replication conflict once the already-committed transaction A reaches one of them (by Raft only the node having all the previously committed transactions may be elected leader).

That's unlikely, because all the candidates are in the same datacenter, so replication between them should be much faster than to the nodes of the other DC. But still possible.

Forcing nodes with `is_enabled=false` to always broadcast
`is_leader_seen=false`. This allows candidate nodes to immediately clear
witness map bits for non-participating nodes, enabling elections to
proceed with only active participants.

Closes tarantool#12018

NO_DOC=bugfix
@sergepetrenko sergepetrenko force-pushed the bugfix/raft-disabled-nodes-should-not-report-leader-seen branch from 60ae034 to a9e7820 Compare November 7, 2025 11:48
@sergepetrenko sergepetrenko added full-ci Enables all tests for a pull request and removed full-ci Enables all tests for a pull request labels Nov 7, 2025
@sergepetrenko sergepetrenko merged commit 214b54c into tarantool:master Nov 7, 2025
59 checks passed
@TarantoolBot
Copy link
Collaborator

Successfully created backport PR for release/3.2:

@TarantoolBot
Copy link
Collaborator

Successfully created backport PR for release/3.3:

@TarantoolBot
Copy link
Collaborator

Successfully created backport PR for release/3.4:

@TarantoolBot
Copy link
Collaborator

Successfully created backport PR for release/3.5:

@TarantoolBot
Copy link
Collaborator

Backport summary

@philippeboyd philippeboyd deleted the bugfix/raft-disabled-nodes-should-not-report-leader-seen branch November 24, 2025 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/3.2 Automatically create a 3.2 backport PR backport/3.3 Automatically create a 3.3 backport PR backport/3.4 Automatically create a 3.4 backport PR backport/3.5 Automatically create a 3.5 backport PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Leader elections never start after a leader is lost if there is a member with election_mode = 'off' in the replica set

7 participants