raft: fix box.ctl.promote hang/crash#11560
Conversation
changelogs/unreleased/gh-10653-passwords-are-logged-in-clear-text.md
Outdated
Show resolved
Hide resolved
a888515 to
1778ed0
Compare
6b2b6ed to
002b6dc
Compare
Serpentian
left a comment
There was a problem hiding this comment.
Thank you for figuring this out! Great work)
test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua
Outdated
Show resolved
Hide resolved
test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua
Outdated
Show resolved
Hide resolved
test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua
Outdated
Show resolved
Hide resolved
test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua
Outdated
Show resolved
Hide resolved
002b6dc to
0fd9744
Compare
87b1fb5 to
45d482d
Compare
Gerold103
left a comment
There was a problem hiding this comment.
Thanks for the fixes! Nice, I like that it didn't require changing the basic lib raft. And the commit messages are very clear.
test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua
Outdated
Show resolved
Hide resolved
test/replication-luatest/gh_10836_crash_in_box_ctl_promote_test.lua
Outdated
Show resolved
Hide resolved
45d482d to
bb80783
Compare
Serpentian
left a comment
There was a problem hiding this comment.
We're almost done here) Let's discuss the final comments
sergepetrenko
left a comment
There was a problem hiding this comment.
Thanks for the patch!
Let's fix a couple of nits in the tests and we'll be good to go.
test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua
Outdated
Show resolved
Hide resolved
After the patch tarantool#11560 the `tarantoolgh-3055-promote-wakeup-crash` test becomes unnecessary. It does not check anymore the assertion after spurious wakeup because the `ER_NO_ELECTION_QUORUM will` be raised earlier (number of nodes in replicaset < `quorum`). Let's drop this test. Part of tarantool#10836 NO_DOC=test NO_CHANGELOG=test
f458383 to
97fabcf
Compare
sergepetrenko
left a comment
There was a problem hiding this comment.
Unfortunately, the test's still flaky:
[009] not ok 1 replication-luatest.gh_10836_hang_in_box_ctl_promote.test_promote_not_hangs_during_non_leader_message_about_leader
[009] # ...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:114: expected: 4, actual: 5
[009] # stack traceback:
[009] # ...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:114: in function 'replication-luatest.gh_10836_hang_in_box_ctl_promote.test_promote_not_hangs_during_non_leader_message_about_leader'
[009] # artifacts:
[009] # server1 -> /private/tmp/t/009_replication-luatest/artifacts/rs-pkmBfpHeUWnn/server1-HOcSOMoG9klr
[009] # server2 -> /private/tmp/t/009_replication-luatest/artifacts/rs-pkmBfpHeUWnn/server2-2X_ZTCWn12B-
[009] # server3 -> /private/tmp/t/009_replication-luatest/artifacts/rs-pkmBfpHeUWnn/server3-VkYs0lK7kixt
[009] ok 2 replication-luatest.gh_10836_hang_in_box_ctl_promote.test_node_not_wait_promote_timeout_after_fiber_death
Let's fix it like this:
diff --git a/test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua b/test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua
index b5351ae490..659038ac18 100644
--- a/test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua
+++ b/test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua
@@ -111,8 +111,10 @@ g.test_promote_not_hangs_during_non_leader_message_about_leader = function()
_G.server1_promote:set_joinable(true)
end)
- t.assert_equals(g.server1:get_election_term(),
- g.server2:get_election_term())
+ t.helpers.retrying({}, function()
+ t.assert_equals(g.server1:get_election_term(),
+ g.server2:get_election_term())
+ end)
g.proxy_2_to_1:resume()
g.proxy_1_to_2:resume()After the patch tarantool#11560 the `tarantoolgh-3055-promote-wakeup-crash` test becomes unnecessary. It does not check anymore the assertion after spurious wakeup because the `ER_NO_ELECTION_QUORUM will` be raised earlier (number of nodes in replicaset < `quorum`). Let's drop this test. Part of tarantool#10836 NO_DOC=test NO_CHANGELOG=test
97fabcf to
ed99fc5
Compare
sergepetrenko
left a comment
There was a problem hiding this comment.
Sorry, Roman, but the tests are still flaky. We should fix them before merging, or otherwise they'll irritate other developers and mask the real problems in CI, especially on weaker machines, like our aarch64 runners.
Here's how I test for flakiness (the -j flag might be different for you. I have 16 cores, so 32 jobs in parallel works best for me):
./test-run.py $(yes 10836_hang | head -n 512) -j32
I've seen at least 4 different issues, but have saved only 3 of them, so please fix them and make sure there are no other.
1:
[021] replication-luatest/gh_10836_hang_in_box_ctl_p> [ fail ]
[021] Test failed! Output from reject file /tmp/t/rejects/replication-luatest/gh_10836_hang_in_box_ctl_promote.reject:
[021] Tarantool version is 3.5.0-entrypoint-72-ged99fc5c73
[021] TAP version 13
[021] 1..2
[021] # Started on Tue Jul 15 15:12:19 2025
[021] # Starting group: replication-luatest.gh_10836_hang_in_box_ctl_promote
[021] not ok 1 replication-luatest.gh_10836_hang_in_box_ctl_promote.test_promote_not_hangs_during_non_leader_message_about_leader
[021] # ...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:81: expected: 3, actual: 2
[021] # stack traceback:
[021] # ...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:81: in function 'replication-luatest.gh_10836_hang_in_box_ctl_promote.test_promote_not_hangs_during_non_leader_message_about_leader'
[021] # artifacts:
[021] # server3 -> /tmp/t/021_replication-luatest/artifacts/rs-qKGpcKrU_ARD/server3-sbI5o0sXh_S-
[021] # server1 -> /tmp/t/021_replication-luatest/artifacts/rs-qKGpcKrU_ARD/server1-xAf3ZnsVmnqi
[021] ok 2 replication-luatest.gh_10836_hang_in_box_ctl_promote.test_node_not_wait_promote_timeout_after_fiber_death
[021] # Ran 2 tests in 2.241 seconds, 1 succeeded, 1 failed
[021]
2:
[011] Test failed! Output from reject file /tmp/t/rejects/replication-luatest/gh_10836_hang_in_box_ctl_promote.reject:
[011] Tarantool version is 3.5.0-entrypoint-72-ged99fc5c73
[011] TAP version 13
[011] 1..2
[011] # Started on Tue Jul 15 15:18:20 2025
[011] # Starting group: replication-luatest.gh_10836_hang_in_box_ctl_promote
[011] ok 1 replication-luatest.gh_10836_hang_in_box_ctl_promote.test_promote_not_hangs_during_non_leader_message_about_leader
[011] not ok 2 replication-luatest.gh_10836_hang_in_box_ctl_promote.test_node_not_wait_promote_timeout_after_fiber_death
[011] # ...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:147: expected: false or nil, actual: true
[011] # stack traceback:
[011] # ...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:147: in function <...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:134>
[011] # ...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:134: in function 'replication-luatest.gh_10836_hang_in_box_ctl_promote.test_node_not_wait_promote_timeout_after_fiber_death'
[011] # artifacts:
[011] # server3 -> /tmp/t/011_replication-luatest/artifacts/rs-aD-kD_FYjfVj/server3-TlitABbHamJ7
[011] # server2 -> /tmp/t/011_replication-luatest/artifacts/rs-aD-kD_FYjfVj/server2--oz3A9kqKvJB
[011] # Ran 2 tests in 3.635 seconds, 1 succeeded, 1 failed
[011]
3:
[027] replication-luatest/gh_10836_crash_in_box_ctl_> [ fail ]
[027] Test failed! Output from reject file /tmp/t/rejects/replication-luatest/gh_10836_crash_in_box_ctl_promote.reject:
[027] Tarantool version is 3.5.0-entrypoint-72-ged99fc5c73
[027] TAP version 13
[027] 1..1
[027] # Started on Tue Jul 15 15:20:52 2025
[027] # Starting group: replication-luatest.gh_10836_crash_in_box_ctl_promote
[027] not ok 1 replication-luatest.gh_10836_crash_in_box_ctl_promote.test_node_not_crashes_while_gaining_quorum_during_promote
[027] # ...ation-luatest/gh_10836_crash_in_box_ctl_promote_test.lua:78: expected: nil, actual: timed out
[027] # stack traceback:
[027] # ...ation-luatest/gh_10836_crash_in_box_ctl_promote_test.lua:78: in function <...ation-luatest/gh_10836_crash_in_box_ctl_promote_test.lua:76>
[027] # ...ation-luatest/gh_10836_crash_in_box_ctl_promote_test.lua:76: in function 'replication-luatest.gh_10836_crash_in_box_ctl_promote.test_node_not_crashes_while_gaining_quorum_during_promote'
[027] # artifacts:
[027] # server1 -> /tmp/t/027_replication-luatest/artifacts/rs-k-_vDCsG_3oY/server1-PvL9_z8OLwlN
[027] # server3 -> /tmp/t/027_replication-luatest/artifacts/rs-k-_vDCsG_3oY/server3-EA3KaqFntxb8
[027] # Ran 1 tests in 7.385 seconds, 0 succeeded, 1 failed
[027]
|
Since my laptop had only 22 available threads, I ran these two tests ( To fix flakies I
I hope these issues are fixed! |
sergepetrenko
left a comment
There was a problem hiding this comment.
Roman, thanks for the fixes!
The tests are stable on my machine now.
|
@mrForza, the ASAN tests fail, PTAL. |
|
Memory leaks occurred because we didn't clean the diag container if the In this common scenario we could get a memory leak:
The evidence of external error in box_raft_try_promote_f's diag: I add my custom logs in The diff:
diff --git a/src/box/raft.c b/src/box/raft.c
index e0b0b65a0..0871b9348 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -465,8 +465,10 @@ box_raft_try_promote_f(struct trigger *trig, void *event)
assert(raft == box_raft());
struct box_raft_watch_ctx *ctx = trig->data;
- if (raft->state == RAFT_STATE_LEADER)
+ if (raft->state == RAFT_STATE_LEADER) {
+ say_info("!\tRAFT_STATE_LEADER");
goto done;
+ }
/*
* The checking for quorum loss is not placed inside !is_candidate
* check because the quorum may be restored when raft->is_candidate
@@ -513,6 +515,7 @@ box_raft_try_promote_f(struct trigger *trig, void *event)
return 0;
done:
ctx->is_done = true;
+ say_info("!\terror in fiber's diag: %d", diag_get()->last != NULL);
diag_move(diag_get(), &ctx->diag);
fiber_wakeup(ctx->owner);
return 0;
@@ -521,6 +524,7 @@ box_raft_try_promote_f(struct trigger *trig, void *event)
int
box_raft_try_promote(void)
{
+ say_info("!\t");
struct raft *raft = box_raft();
assert(raft->is_enabled);
assert(box_election_mode == ELECTION_MODE_MANUAL ||
@@ -547,6 +551,9 @@ box_raft_try_promote(void)
trigger_clear(&trig);
if (raft->state == RAFT_STATE_LEADER) {
+ say_info("!\terror in ctx: %d", ctx.diag.last != NULL);
+ if (ctx.diag.last != NULL)
+ say_info("!\tend of promote (success); ctx.diag.last->code: ", ctx.diag.last->code);
/*
* Since some errors from other functions may be set in
* box_raft_try_promote_f's diag, we should always clear
|
Before this patch errors occurred in the `applier_f` weren't removed from diag container after completing this function. These errors appeared in other callbacks (e.g. `box_raft_try_promote_f`) which used the trigger `box_raft_on_update` that was invoked during the execution of `applier_f` nested functions. As a result, it led to memory leaks which were detected in the patch tarantool#11560. To fix this let's clear diag container in the end of the `applier_f`. Needed for tarantool#10836 NO_TEST=<leak fix> NO_DOC=<leak fix> NO_CHANGELOG=<leak fix>
Before this patch `box.ctl.promote` could lead to hang. It was happening in the situation when one candidate server during the promote got a message from follower that leader was already seen. This particular scenario can reproduce `box.ctl.promote` hang: 1) We have a replicaset with 3 servers. 2) `server1` loses an upstream connection with `server3`. 3) Then we call `box.ctl.promote` on `server1`. The term of `server1` is higher than term of other servers. 4) When `server1` sends broadcast messages to other servers, `server3` starts election process due to election timeout and becomes a leader. Now, terms of all servers are equal. 5) During the promote process, `server1` gets a message from follower node - `server2` that leader - `server3` was already seen. The hang happened because the raft state machine did not handle a scenario when the candidate server during election process got a message from follower that leader was already seen. It is important to note that candidate server should not have a connection to newly elected leader. As a result the candidate server continues to believe that there is no leader and sends election broadcast messages to other servers. Since this kind of promote does not disrupt the cluster it is decided not to change a low level raft implementation. In order to avoid a hang in user's fiber we introduce a timeout for `box.ctl.promote`, after which the control is transferred back to the user. The timeout for promote is set to `election_timeout`. At the end of this time the `TimedOut` error will be raised. We also change the order of error's checking in `box_raft_try_promote` to keep them in order of priority. It can help us to get rid of an excess constant in promote timeout and make it equals to `election_timeout`. Part of tarantool#10836 NO_DOC=bugfix
Before this patch `box.ctl.promote` could lead to crash. It was happening
in the situation when one candidate server lost a quorum before promote
and gained it during promote. The reason of the crash was an opposite
state of `raft->is_candidate` flag. When node lost its quorum this flag
was set to false and trigger's function `box_raft_try_promote_f` ended
without `ctx->is_done`. After the node has gained its quorum by
reconnecting to other nodes, `raft_restore` has set `raft->is_candidate`
to true according to `election_mode`. As a result, all error checks in
`box_raft_try_promote` have failed and the assertion `!raft->is_candidate`
has fired.
To fix this crash we change the logic of errors' checking. Now, the
majority of errors are set inside `box_raft_try_promote_f` instead of
`box_raft_try_promote`. It can help us to track every critical events
(e.g. loss of quorum or reconfiguring) and exit immediately from callback
when it happened. As a result, if quorum loss is detected during promote,
the `ER_NO_ELECTION_QUORUM` will be raised instantly.
Since we change the core logic of error's checking, some replication tests
begin to fail:
1. The `election_split_vote` test fails for 2 reasons:
- The first calling of box.ctl.promote failed because the replicaset
doesn't have enough time to establish a connection between server1 and
server2. As a result, the `ER_NO_ELECTION_QUORUM` is raised.
- Incorrect error is raised in
`test_election_off_demote_other_no_leader` because after
`box.ctl.demote` calling we don't wait until the downstream connection
changes its `follow` status.
2. The `gh_6860_election_off_demote` test fails because it hangs while
trying to grep log that split vote has been discovered. The reason of this
behavior is that when the replication between node1 and node2 breaks, the
`ER_NO_ELECTION_QUORUM` is raised and the `raft_check_split_vote` doesn't
work.
Now, we fix these tests by:
1. adding `wait_for_fullmesh` before all tests and waiting until the
status of downstream connection of server1 will not be `follow`.
2. introducing proxies between node1 and node2. It can help us not to get
the error about quorum loss and force replicaset into split vote state.
We also change the `tarantoolgh-3055-promote-wakeup-crash` test because after our
patch in this scenario the `ER_NO_ELECTION_QUORUM` will be raised always
due to quorum loss. There is no way the crash described in tarantoolgh-3055 can
happen because in the first iteration of `box_raft_try_promote_f` we exit
with error.
Closes tarantool#10836
NO_DOC=bugfix
After the patch tarantool#11560 the `tarantoolgh-3055-promote-wakeup-crash` test becomes unnecessary. It does not check anymore the assertion after spurious wakeup because the `ER_NO_ELECTION_QUORUM will` be raised earlier (number of nodes in replicaset < `quorum`). Let's drop this test. Part of tarantool#10836 NO_DOC=test NO_CHANGELOG=test
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin release/3.2
git worktree add -d .worktree/backport/release/3.2/11560 origin/release/3.2
cd .worktree/backport/release/3.2/11560
git switch --create backport/release/3.2/11560
git cherry-pick -x 482d3ccc153fba0b736bfd316d49befa43dbc7c6 02920c04155bffab0aab29d6dc520bdf3f220e16 d4f9c9c992bc145a3d0f0e8fea2e701ff71f126a d0242af36253b279f0fddd6887dfd89e679c175b |
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin release/3.3
git worktree add -d .worktree/backport/release/3.3/11560 origin/release/3.3
cd .worktree/backport/release/3.3/11560
git switch --create backport/release/3.3/11560
git cherry-pick -x 482d3ccc153fba0b736bfd316d49befa43dbc7c6 02920c04155bffab0aab29d6dc520bdf3f220e16 d4f9c9c992bc145a3d0f0e8fea2e701ff71f126a d0242af36253b279f0fddd6887dfd89e679c175b |
|
Successfully created backport PR for |
Backport summary
|
Before this patch
box.ctl.promotecould lead to hang. It was happening inthe situation when one candidate server during the promote got a message
from follower that leader was already seen.
This particular scenario can reproduce
box.ctl.promotehang:server1loses an upstream connection withserver3.box.ctl.promoteonserver1. The term ofserver1is higher than term of other servers.
server1sends broadcast messages to other servers,server3starts election process due to election timeout and becomes a leader. Now,
terms of all servers are equal.
server1gets a message from followernode -
server2that leader -server3was already seen.The hang happened because the raft state machine did not handle
a scenario when the candidate server during election process got a
message from follower that leader was already seen. It is important
to note that candidate server should not have a connection to newly
elected leader. As a result the candidate server continues to
believe that there is no leader and sends election broadcast
messages to other servers.
Since this kind of promote does not disrupt the cluster it is
decided not to change a low level raft implementation. In order to
avoid a hang in user's fiber we introduce a timeout for
box.ctl.promote,after which the control is transferred back to the user. The timeout for
promote is set to
election_timeout. At the end of this time theTimedOuterror will be raised.We also change the order of error's checking in
box_raft_try_promoteto keep them in order of priority. It can help us to get rid of an excess
constant in promote timeout and make it equals to
election_timeout.Closes #10836
NO_DOC=bugfix