[backport 3.4] raft: fix box.ctl.promote hang/crash by TarantoolBot · Pull Request #11692 · tarantool/tarantool

TarantoolBot · 2025-07-22T07:58:50Z

(This PR is a backport of #11560 to release/3.4 to a future 3.4.1 release.)

Before this patch box.ctl.promote could lead to hang. It was happening in
the situation when one candidate server during the promote got a message
from follower that leader was already seen.

This particular scenario can reproduce box.ctl.promote hang:

We have a replicaset with 3 servers.
server1 loses an upstream connection with server3.
Then we call box.ctl.promote on server1. The term of server1
is higher than term of other servers.
When server1 sends broadcast messages to other servers, server3
starts election process due to election timeout and becomes a leader. Now,
terms of all servers are equal.
During the promote process, server1 gets a message from follower
node - server2 that leader - server3 was already seen.

The hang happened because the raft state machine did not handle
a scenario when the candidate server during election process got a
message from follower that leader was already seen. It is important
to note that candidate server should not have a connection to newly
elected leader. As a result the candidate server continues to
believe that there is no leader and sends election broadcast
messages to other servers.

Since this kind of promote does not disrupt the cluster it is
decided not to change a low level raft implementation. In order to
avoid a hang in user's fiber we introduce a timeout for box.ctl.promote,
after which the control is transferred back to the user. The timeout for
promote is set to election_timeout. At the end of this time the
TimedOut error will be raised.

We also change the order of error's checking in box_raft_try_promote
to keep them in order of priority. It can help us to get rid of an excess
constant in promote timeout and make it equals to election_timeout.

Closes #10836

NO_DOC=bugfix

Before this patch errors occurred in the `applier_f` weren't removed from diag container after completing this function. These errors appeared in other callbacks (e.g. `box_raft_try_promote_f`) which used the trigger `box_raft_on_update` that was invoked during the execution of `applier_f` nested functions. As a result, it led to memory leaks which were detected in the patch #11560. To fix this let's clear diag container in the end of the `applier_f`. Needed for #10836 NO_TEST=<leak fix> NO_DOC=<leak fix> NO_CHANGELOG=<leak fix> (cherry picked from commit 482d3cc)

Before this patch `box.ctl.promote` could lead to hang. It was happening in the situation when one candidate server during the promote got a message from follower that leader was already seen. This particular scenario can reproduce `box.ctl.promote` hang: 1) We have a replicaset with 3 servers. 2) `server1` loses an upstream connection with `server3`. 3) Then we call `box.ctl.promote` on `server1`. The term of `server1` is higher than term of other servers. 4) When `server1` sends broadcast messages to other servers, `server3` starts election process due to election timeout and becomes a leader. Now, terms of all servers are equal. 5) During the promote process, `server1` gets a message from follower node - `server2` that leader - `server3` was already seen. The hang happened because the raft state machine did not handle a scenario when the candidate server during election process got a message from follower that leader was already seen. It is important to note that candidate server should not have a connection to newly elected leader. As a result the candidate server continues to believe that there is no leader and sends election broadcast messages to other servers. Since this kind of promote does not disrupt the cluster it is decided not to change a low level raft implementation. In order to avoid a hang in user's fiber we introduce a timeout for `box.ctl.promote`, after which the control is transferred back to the user. The timeout for promote is set to `election_timeout`. At the end of this time the `TimedOut` error will be raised. We also change the order of error's checking in `box_raft_try_promote` to keep them in order of priority. It can help us to get rid of an excess constant in promote timeout and make it equals to `election_timeout`. Part of #10836 NO_DOC=bugfix (cherry picked from commit 02920c0)

Before this patch `box.ctl.promote` could lead to crash. It was happening in the situation when one candidate server lost a quorum before promote and gained it during promote. The reason of the crash was an opposite state of `raft->is_candidate` flag. When node lost its quorum this flag was set to false and trigger's function `box_raft_try_promote_f` ended without `ctx->is_done`. After the node has gained its quorum by reconnecting to other nodes, `raft_restore` has set `raft->is_candidate` to true according to `election_mode`. As a result, all error checks in `box_raft_try_promote` have failed and the assertion `!raft->is_candidate` has fired. To fix this crash we change the logic of errors' checking. Now, the majority of errors are set inside `box_raft_try_promote_f` instead of `box_raft_try_promote`. It can help us to track every critical events (e.g. loss of quorum or reconfiguring) and exit immediately from callback when it happened. As a result, if quorum loss is detected during promote, the `ER_NO_ELECTION_QUORUM` will be raised instantly. Since we change the core logic of error's checking, some replication tests begin to fail: 1. The `election_split_vote` test fails for 2 reasons: - The first calling of box.ctl.promote failed because the replicaset doesn't have enough time to establish a connection between server1 and server2. As a result, the `ER_NO_ELECTION_QUORUM` is raised. - Incorrect error is raised in `test_election_off_demote_other_no_leader` because after `box.ctl.demote` calling we don't wait until the downstream connection changes its `follow` status. 2. The `gh_6860_election_off_demote` test fails because it hangs while trying to grep log that split vote has been discovered. The reason of this behavior is that when the replication between node1 and node2 breaks, the `ER_NO_ELECTION_QUORUM` is raised and the `raft_check_split_vote` doesn't work. Now, we fix these tests by: 1. adding `wait_for_fullmesh` before all tests and waiting until the status of downstream connection of server1 will not be `follow`. 2. introducing proxies between node1 and node2. It can help us not to get the error about quorum loss and force replicaset into split vote state. We also change the `gh-3055-promote-wakeup-crash` test because after our patch in this scenario the `ER_NO_ELECTION_QUORUM` will be raised always due to quorum loss. There is no way the crash described in gh-3055 can happen because in the first iteration of `box_raft_try_promote_f` we exit with error. Closes #10836 NO_DOC=bugfix (cherry picked from commit d4f9c9c)

After the patch #11560 the `gh-3055-promote-wakeup-crash` test becomes unnecessary. It does not check anymore the assertion after spurious wakeup because the `ER_NO_ELECTION_QUORUM will` be raised earlier (number of nodes in replicaset < `quorum`). Let's drop this test. Part of #10836 NO_DOC=test NO_CHANGELOG=test (cherry picked from commit d0242af)

coveralls · 2025-07-22T08:25:20Z

coverage: 87.562% (+0.02%) from 87.545%
when pulling 5e6acf6 on backport/release/3.4/11560
into 4981e59
on release/3.4.

mrForza added 4 commits July 22, 2025 07:58

TarantoolBot requested a review from a team as a code owner July 22, 2025 07:58

TarantoolBot mentioned this pull request Jul 22, 2025

raft: fix box.ctl.promote hang/crash #11560

Merged

TarantoolBot assigned sergepetrenko Jul 22, 2025

TarantoolBot changed the title ~~[Backport release/3.4] raft: fix box.ctl.promote hang/crash~~ [backport 3.4] raft: fix box.ctl.promote hang/crash Jul 22, 2025

xuniq approved these changes Jul 22, 2025

View reviewed changes

sergepetrenko merged commit 2fb7b84 into release/3.4 Jul 22, 2025
25 checks passed

sergepetrenko deleted the backport/release/3.4/11560 branch July 22, 2025 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backport 3.4] raft: fix box.ctl.promote hang/crash#11692

[backport 3.4] raft: fix box.ctl.promote hang/crash#11692
sergepetrenko merged 4 commits intorelease/3.4from
backport/release/3.4/11560

TarantoolBot commented Jul 22, 2025 •

edited

Loading

Uh oh!

coveralls commented Jul 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

TarantoolBot commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Jul 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TarantoolBot commented Jul 22, 2025 •

edited

Loading