[backport 3.4] raft: fix box.ctl.promote hang/crash#11692
Merged
sergepetrenko merged 4 commits intorelease/3.4from Jul 22, 2025
Merged
[backport 3.4] raft: fix box.ctl.promote hang/crash#11692sergepetrenko merged 4 commits intorelease/3.4from
sergepetrenko merged 4 commits intorelease/3.4from
Conversation
Before this patch errors occurred in the `applier_f` weren't removed from diag container after completing this function. These errors appeared in other callbacks (e.g. `box_raft_try_promote_f`) which used the trigger `box_raft_on_update` that was invoked during the execution of `applier_f` nested functions. As a result, it led to memory leaks which were detected in the patch #11560. To fix this let's clear diag container in the end of the `applier_f`. Needed for #10836 NO_TEST=<leak fix> NO_DOC=<leak fix> NO_CHANGELOG=<leak fix> (cherry picked from commit 482d3cc)
Before this patch `box.ctl.promote` could lead to hang. It was happening in the situation when one candidate server during the promote got a message from follower that leader was already seen. This particular scenario can reproduce `box.ctl.promote` hang: 1) We have a replicaset with 3 servers. 2) `server1` loses an upstream connection with `server3`. 3) Then we call `box.ctl.promote` on `server1`. The term of `server1` is higher than term of other servers. 4) When `server1` sends broadcast messages to other servers, `server3` starts election process due to election timeout and becomes a leader. Now, terms of all servers are equal. 5) During the promote process, `server1` gets a message from follower node - `server2` that leader - `server3` was already seen. The hang happened because the raft state machine did not handle a scenario when the candidate server during election process got a message from follower that leader was already seen. It is important to note that candidate server should not have a connection to newly elected leader. As a result the candidate server continues to believe that there is no leader and sends election broadcast messages to other servers. Since this kind of promote does not disrupt the cluster it is decided not to change a low level raft implementation. In order to avoid a hang in user's fiber we introduce a timeout for `box.ctl.promote`, after which the control is transferred back to the user. The timeout for promote is set to `election_timeout`. At the end of this time the `TimedOut` error will be raised. We also change the order of error's checking in `box_raft_try_promote` to keep them in order of priority. It can help us to get rid of an excess constant in promote timeout and make it equals to `election_timeout`. Part of #10836 NO_DOC=bugfix (cherry picked from commit 02920c0)
Before this patch `box.ctl.promote` could lead to crash. It was happening
in the situation when one candidate server lost a quorum before promote
and gained it during promote. The reason of the crash was an opposite
state of `raft->is_candidate` flag. When node lost its quorum this flag
was set to false and trigger's function `box_raft_try_promote_f` ended
without `ctx->is_done`. After the node has gained its quorum by
reconnecting to other nodes, `raft_restore` has set `raft->is_candidate`
to true according to `election_mode`. As a result, all error checks in
`box_raft_try_promote` have failed and the assertion `!raft->is_candidate`
has fired.
To fix this crash we change the logic of errors' checking. Now, the
majority of errors are set inside `box_raft_try_promote_f` instead of
`box_raft_try_promote`. It can help us to track every critical events
(e.g. loss of quorum or reconfiguring) and exit immediately from callback
when it happened. As a result, if quorum loss is detected during promote,
the `ER_NO_ELECTION_QUORUM` will be raised instantly.
Since we change the core logic of error's checking, some replication tests
begin to fail:
1. The `election_split_vote` test fails for 2 reasons:
- The first calling of box.ctl.promote failed because the replicaset
doesn't have enough time to establish a connection between server1 and
server2. As a result, the `ER_NO_ELECTION_QUORUM` is raised.
- Incorrect error is raised in
`test_election_off_demote_other_no_leader` because after
`box.ctl.demote` calling we don't wait until the downstream connection
changes its `follow` status.
2. The `gh_6860_election_off_demote` test fails because it hangs while
trying to grep log that split vote has been discovered. The reason of this
behavior is that when the replication between node1 and node2 breaks, the
`ER_NO_ELECTION_QUORUM` is raised and the `raft_check_split_vote` doesn't
work.
Now, we fix these tests by:
1. adding `wait_for_fullmesh` before all tests and waiting until the
status of downstream connection of server1 will not be `follow`.
2. introducing proxies between node1 and node2. It can help us not to get
the error about quorum loss and force replicaset into split vote state.
We also change the `gh-3055-promote-wakeup-crash` test because after our
patch in this scenario the `ER_NO_ELECTION_QUORUM` will be raised always
due to quorum loss. There is no way the crash described in gh-3055 can
happen because in the first iteration of `box_raft_try_promote_f` we exit
with error.
Closes #10836
NO_DOC=bugfix
(cherry picked from commit d4f9c9c)
After the patch #11560 the `gh-3055-promote-wakeup-crash` test becomes unnecessary. It does not check anymore the assertion after spurious wakeup because the `ER_NO_ELECTION_QUORUM will` be raised earlier (number of nodes in replicaset < `quorum`). Let's drop this test. Part of #10836 NO_DOC=test NO_CHANGELOG=test (cherry picked from commit d0242af)
xuniq
approved these changes
Jul 22, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
(This PR is a backport of #11560 to
release/3.4to a future3.4.1release.)Before this patch
box.ctl.promotecould lead to hang. It was happening inthe situation when one candidate server during the promote got a message
from follower that leader was already seen.
This particular scenario can reproduce
box.ctl.promotehang:server1loses an upstream connection withserver3.box.ctl.promoteonserver1. The term ofserver1is higher than term of other servers.
server1sends broadcast messages to other servers,server3starts election process due to election timeout and becomes a leader. Now,
terms of all servers are equal.
server1gets a message from followernode -
server2that leader -server3was already seen.The hang happened because the raft state machine did not handle
a scenario when the candidate server during election process got a
message from follower that leader was already seen. It is important
to note that candidate server should not have a connection to newly
elected leader. As a result the candidate server continues to
believe that there is no leader and sends election broadcast
messages to other servers.
Since this kind of promote does not disrupt the cluster it is
decided not to change a low level raft implementation. In order to
avoid a hang in user's fiber we introduce a timeout for
box.ctl.promote,after which the control is transferred back to the user. The timeout for
promote is set to
election_timeout. At the end of this time theTimedOuterror will be raised.We also change the order of error's checking in
box_raft_try_promoteto keep them in order of priority. It can help us to get rid of an excess
constant in promote timeout and make it equals to
election_timeout.Closes #10836
NO_DOC=bugfix