Skip to content

[backport 3.4] raft: fix box.ctl.promote hang/crash#11692

Merged
sergepetrenko merged 4 commits intorelease/3.4from
backport/release/3.4/11560
Jul 22, 2025
Merged

[backport 3.4] raft: fix box.ctl.promote hang/crash#11692
sergepetrenko merged 4 commits intorelease/3.4from
backport/release/3.4/11560

Conversation

@TarantoolBot
Copy link
Collaborator

@TarantoolBot TarantoolBot commented Jul 22, 2025

(This PR is a backport of #11560 to release/3.4 to a future 3.4.1 release.)


Before this patch box.ctl.promote could lead to hang. It was happening in
the situation when one candidate server during the promote got a message
from follower that leader was already seen.

This particular scenario can reproduce box.ctl.promote hang:

  1. We have a replicaset with 3 servers.
  2. server1 loses an upstream connection with server3.
  3. Then we call box.ctl.promote on server1. The term of server1
    is higher than term of other servers.
  4. When server1 sends broadcast messages to other servers, server3
    starts election process due to election timeout and becomes a leader. Now,
    terms of all servers are equal.
  5. During the promote process, server1 gets a message from follower
    node - server2 that leader - server3 was already seen.

The hang happened because the raft state machine did not handle
a scenario when the candidate server during election process got a
message from follower that leader was already seen. It is important
to note that candidate server should not have a connection to newly
elected leader. As a result the candidate server continues to
believe that there is no leader and sends election broadcast
messages to other servers.

Since this kind of promote does not disrupt the cluster it is
decided not to change a low level raft implementation. In order to
avoid a hang in user's fiber we introduce a timeout for box.ctl.promote,
after which the control is transferred back to the user. The timeout for
promote is set to election_timeout. At the end of this time the
TimedOut error will be raised.

We also change the order of error's checking in box_raft_try_promote
to keep them in order of priority. It can help us to get rid of an excess
constant in promote timeout and make it equals to election_timeout.

Closes #10836

NO_DOC=bugfix

mrForza added 4 commits July 22, 2025 07:58
Before this patch errors occurred in the `applier_f` weren't removed from
diag container after completing this function. These errors appeared in
other callbacks (e.g. `box_raft_try_promote_f`) which used the trigger
`box_raft_on_update` that was invoked during the execution of `applier_f`
nested functions. As a result, it led to memory leaks which were detected
in the patch #11560. To fix this let's clear diag container in the end of
the `applier_f`.

Needed for #10836

NO_TEST=<leak fix>
NO_DOC=<leak fix>
NO_CHANGELOG=<leak fix>

(cherry picked from commit 482d3cc)
Before this patch `box.ctl.promote` could lead to hang. It was happening in
the situation when one candidate server during the promote got a message
from follower that leader was already seen.

This particular scenario can reproduce `box.ctl.promote` hang:
1) We have a replicaset with 3 servers.
2) `server1` loses an upstream connection with `server3`.
3) Then we call `box.ctl.promote` on `server1`. The term of `server1`
is higher than term of other servers.
4) When `server1` sends broadcast messages to other servers, `server3`
starts election process due to election timeout and becomes a leader. Now,
terms of all servers are equal.
5) During the promote process, `server1` gets a message from follower
node - `server2` that leader - `server3` was already seen.

The hang happened because the raft state machine did not handle
a scenario when the candidate server during election process got a
message from follower that leader was already seen. It is important
to note that candidate server should not have a connection to newly
elected leader. As a result the candidate server continues to
believe that there is no leader and sends election broadcast
messages to other servers.

Since this kind of promote does not disrupt the cluster it is
decided not to change a low level raft implementation. In order to
avoid a hang in user's fiber we introduce a timeout for `box.ctl.promote`,
after which the control is transferred back to the user. The timeout for
promote is set to `election_timeout`. At the end of this time the
`TimedOut` error will be raised.

We also change the order of error's checking in `box_raft_try_promote`
to keep them in order of priority. It can help us to get rid of an excess
constant in promote timeout and make it equals to `election_timeout`.

Part of #10836

NO_DOC=bugfix

(cherry picked from commit 02920c0)
Before this patch `box.ctl.promote` could lead to crash. It was happening
in the situation when one candidate server lost a quorum before promote
and gained it during promote. The reason of the crash was an opposite
state of `raft->is_candidate` flag. When node lost its quorum this flag
was set to false and trigger's function `box_raft_try_promote_f` ended
without `ctx->is_done`. After the node has gained its quorum by
reconnecting to other nodes, `raft_restore` has set `raft->is_candidate`
to true according to `election_mode`. As a result, all error checks in
`box_raft_try_promote` have failed and the assertion `!raft->is_candidate`
has fired.

To fix this crash we change the logic of errors' checking. Now, the
majority of errors are set inside `box_raft_try_promote_f` instead of
`box_raft_try_promote`. It can help us to track every critical events
(e.g. loss of quorum or reconfiguring) and exit immediately from callback
when it happened. As a result, if quorum loss is detected during promote,
the `ER_NO_ELECTION_QUORUM` will be raised instantly.

Since we change the core logic of error's checking, some replication tests
begin to fail:
1. The `election_split_vote` test fails for 2 reasons:
  - The first calling of box.ctl.promote failed because the replicaset
    doesn't have enough time to establish a connection between server1 and
    server2. As a result, the `ER_NO_ELECTION_QUORUM` is raised.
  - Incorrect error is raised in
    `test_election_off_demote_other_no_leader` because after
    `box.ctl.demote` calling we don't wait until the downstream connection
    changes its `follow` status.
2. The `gh_6860_election_off_demote` test fails because it hangs while
trying to grep log that split vote has been discovered. The reason of this
behavior is that when the replication between node1 and node2 breaks, the
`ER_NO_ELECTION_QUORUM` is raised and the `raft_check_split_vote` doesn't
work.

Now, we fix these tests by:
1. adding `wait_for_fullmesh` before all tests and waiting until the
status of downstream connection of server1 will not be `follow`.
2. introducing proxies between node1 and node2. It can help us not to get
the error about quorum loss and force replicaset into split vote state.

We also change the `gh-3055-promote-wakeup-crash` test because after our
patch in this scenario the `ER_NO_ELECTION_QUORUM` will be raised always
due to quorum loss. There is no way the crash described in gh-3055 can
happen because in the first iteration of `box_raft_try_promote_f` we exit
with error.

Closes #10836

NO_DOC=bugfix

(cherry picked from commit d4f9c9c)
After the patch #11560 the `gh-3055-promote-wakeup-crash` test becomes
unnecessary. It does not check anymore the assertion after spurious
wakeup because the `ER_NO_ELECTION_QUORUM will` be raised earlier
(number of nodes in replicaset < `quorum`). Let's drop this test.

Part of #10836

NO_DOC=test
NO_CHANGELOG=test

(cherry picked from commit d0242af)
@TarantoolBot TarantoolBot requested a review from a team as a code owner July 22, 2025 07:58
@TarantoolBot TarantoolBot changed the title [Backport release/3.4] raft: fix box.ctl.promote hang/crash [backport 3.4] raft: fix box.ctl.promote hang/crash Jul 22, 2025
@coveralls
Copy link

Coverage Status

coverage: 87.562% (+0.02%) from 87.545%
when pulling 5e6acf6 on backport/release/3.4/11560
into 4981e59
on release/3.4
.

@sergepetrenko sergepetrenko merged commit 2fb7b84 into release/3.4 Jul 22, 2025
25 checks passed
@sergepetrenko sergepetrenko deleted the backport/release/3.4/11560 branch July 22, 2025 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants