raft: fix box.ctl.promote hang/crash by mrForza · Pull Request #11560 · tarantool/tarantool

mrForza · 2025-06-02T10:07:21Z

Before this patch box.ctl.promote could lead to hang. It was happening in
the situation when one candidate server during the promote got a message
from follower that leader was already seen.

This particular scenario can reproduce box.ctl.promote hang:

We have a replicaset with 3 servers.
server1 loses an upstream connection with server3.
Then we call box.ctl.promote on server1. The term of server1
is higher than term of other servers.
When server1 sends broadcast messages to other servers, server3
starts election process due to election timeout and becomes a leader. Now,
terms of all servers are equal.
During the promote process, server1 gets a message from follower
node - server2 that leader - server3 was already seen.

The hang happened because the raft state machine did not handle
a scenario when the candidate server during election process got a
message from follower that leader was already seen. It is important
to note that candidate server should not have a connection to newly
elected leader. As a result the candidate server continues to
believe that there is no leader and sends election broadcast
messages to other servers.

Since this kind of promote does not disrupt the cluster it is
decided not to change a low level raft implementation. In order to
avoid a hang in user's fiber we introduce a timeout for box.ctl.promote,
after which the control is transferred back to the user. The timeout for
promote is set to election_timeout. At the end of this time the
TimedOut error will be raised.

We also change the order of error's checking in box_raft_try_promote
to keep them in order of priority. It can help us to get rid of an excess
constant in promote timeout and make it equals to election_timeout.

Closes #10836

NO_DOC=bugfix

changelogs/unreleased/gh-10653-passwords-are-logged-in-clear-text.md

changelogs/unreleased/gh-10836-hang-in-box-ctl-promote.md

src/lib/raft/raft.c

coveralls · 2025-06-09T13:36:00Z

coverage: 87.57% (-0.004%) from 87.574%
when pulling 8fa68d4 on mrForza:mrforza/gh-10836-hang-in-box-ctl-promote
into e6ec4aa
on tarantool:master.

Serpentian

Thank you for figuring this out! Great work)

src/box/raft.c

test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua

src/box/raft.c

test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua

Gerold103

Thanks for the fixes! Nice, I like that it didn't require changing the basic lib raft. And the commit messages are very clear.

src/box/raft.c

test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua

changelogs/unreleased/gh-10836-crash-in-box-ctl-promote.md

src/box/raft.c

test/replication-luatest/gh_10836_crash_in_box_ctl_promote_test.lua

Serpentian

We're almost done here) Let's discuss the final comments

src/box/raft.c

sergepetrenko

Thanks for the patch!
Let's fix a couple of nits in the tests and we'll be good to go.

test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua

test/replication-luatest/election_split_vote_test.lua

test/replication/gh-3055-promote-wakeup-crash.result

After the patch tarantool#11560 the `tarantoolgh-3055-promote-wakeup-crash` test becomes unnecessary. It does not check anymore the assertion after spurious wakeup because the `ER_NO_ELECTION_QUORUM will` be raised earlier (number of nodes in replicaset < `quorum`). Let's drop this test. Part of tarantool#10836 NO_DOC=test NO_CHANGELOG=test

sergepetrenko

Unfortunately, the test's still flaky:

[009] not ok 1	replication-luatest.gh_10836_hang_in_box_ctl_promote.test_promote_not_hangs_during_non_leader_message_about_leader
[009] #   ...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:114: expected: 4, actual: 5
[009] #   stack traceback:
[009] #   	...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:114: in function 'replication-luatest.gh_10836_hang_in_box_ctl_promote.test_promote_not_hangs_during_non_leader_message_about_leader'
[009] #   artifacts:
[009] #   	server1 -> /private/tmp/t/009_replication-luatest/artifacts/rs-pkmBfpHeUWnn/server1-HOcSOMoG9klr
[009] #   	server2 -> /private/tmp/t/009_replication-luatest/artifacts/rs-pkmBfpHeUWnn/server2-2X_ZTCWn12B-
[009] #   	server3 -> /private/tmp/t/009_replication-luatest/artifacts/rs-pkmBfpHeUWnn/server3-VkYs0lK7kixt
[009] ok     2	replication-luatest.gh_10836_hang_in_box_ctl_promote.test_node_not_wait_promote_timeout_after_fiber_death

Let's fix it like this:

diff --git a/test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua b/test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua
index b5351ae490..659038ac18 100644
--- a/test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua
+++ b/test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua
@@ -111,8 +111,10 @@ g.test_promote_not_hangs_during_non_leader_message_about_leader = function()
         _G.server1_promote:set_joinable(true)
     end)
 
-    t.assert_equals(g.server1:get_election_term(),
-                    g.server2:get_election_term())
+    t.helpers.retrying({}, function()
+        t.assert_equals(g.server1:get_election_term(),
+                        g.server2:get_election_term())
+    end)
 
     g.proxy_2_to_1:resume()
     g.proxy_1_to_2:resume()

After the patch tarantool#11560 the `tarantoolgh-3055-promote-wakeup-crash` test becomes unnecessary. It does not check anymore the assertion after spurious wakeup because the `ER_NO_ELECTION_QUORUM will` be raised earlier (number of nodes in replicaset < `quorum`). Let's drop this test. Part of tarantool#10836 NO_DOC=test NO_CHANGELOG=test

sergepetrenko

Sorry, Roman, but the tests are still flaky. We should fix them before merging, or otherwise they'll irritate other developers and mask the real problems in CI, especially on weaker machines, like our aarch64 runners.

Here's how I test for flakiness (the -j flag might be different for you. I have 16 cores, so 32 jobs in parallel works best for me):

./test-run.py $(yes 10836_hang | head -n 512) -j32

I've seen at least 4 different issues, but have saved only 3 of them, so please fix them and make sure there are no other.
1:

[021] replication-luatest/gh_10836_hang_in_box_ctl_p>                                    [ fail ]
[021] Test failed! Output from reject file /tmp/t/rejects/replication-luatest/gh_10836_hang_in_box_ctl_promote.reject:
[021] Tarantool version is 3.5.0-entrypoint-72-ged99fc5c73
[021] TAP version 13
[021] 1..2
[021] # Started on Tue Jul 15 15:12:19 2025
[021] # Starting group: replication-luatest.gh_10836_hang_in_box_ctl_promote
[021] not ok 1	replication-luatest.gh_10836_hang_in_box_ctl_promote.test_promote_not_hangs_during_non_leader_message_about_leader
[021] #   ...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:81: expected: 3, actual: 2
[021] #   stack traceback:
[021] #   	...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:81: in function 'replication-luatest.gh_10836_hang_in_box_ctl_promote.test_promote_not_hangs_during_non_leader_message_about_leader'
[021] #   artifacts:
[021] #   	server3 -> /tmp/t/021_replication-luatest/artifacts/rs-qKGpcKrU_ARD/server3-sbI5o0sXh_S-
[021] #   	server1 -> /tmp/t/021_replication-luatest/artifacts/rs-qKGpcKrU_ARD/server1-xAf3ZnsVmnqi
[021] ok     2	replication-luatest.gh_10836_hang_in_box_ctl_promote.test_node_not_wait_promote_timeout_after_fiber_death
[021] # Ran 2 tests in 2.241 seconds, 1 succeeded, 1 failed
[021]

2:

[011] Test failed! Output from reject file /tmp/t/rejects/replication-luatest/gh_10836_hang_in_box_ctl_promote.reject:
[011] Tarantool version is 3.5.0-entrypoint-72-ged99fc5c73
[011] TAP version 13
[011] 1..2
[011] # Started on Tue Jul 15 15:18:20 2025
[011] # Starting group: replication-luatest.gh_10836_hang_in_box_ctl_promote
[011] ok     1	replication-luatest.gh_10836_hang_in_box_ctl_promote.test_promote_not_hangs_during_non_leader_message_about_leader
[011] not ok 2	replication-luatest.gh_10836_hang_in_box_ctl_promote.test_node_not_wait_promote_timeout_after_fiber_death
[011] #   ...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:147: expected: false or nil, actual: true
[011] #   stack traceback:
[011] #   	...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:147: in function <...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:134>
[011] #   	...cation-luatest/gh_10836_hang_in_box_ctl_promote_test.lua:134: in function 'replication-luatest.gh_10836_hang_in_box_ctl_promote.test_node_not_wait_promote_timeout_after_fiber_death'
[011] #   artifacts:
[011] #   	server3 -> /tmp/t/011_replication-luatest/artifacts/rs-aD-kD_FYjfVj/server3-TlitABbHamJ7
[011] #   	server2 -> /tmp/t/011_replication-luatest/artifacts/rs-aD-kD_FYjfVj/server2--oz3A9kqKvJB
[011] # Ran 2 tests in 3.635 seconds, 1 succeeded, 1 failed
[011]

3:

[027] replication-luatest/gh_10836_crash_in_box_ctl_>                                    [ fail ]
[027] Test failed! Output from reject file /tmp/t/rejects/replication-luatest/gh_10836_crash_in_box_ctl_promote.reject:
[027] Tarantool version is 3.5.0-entrypoint-72-ged99fc5c73
[027] TAP version 13
[027] 1..1
[027] # Started on Tue Jul 15 15:20:52 2025
[027] # Starting group: replication-luatest.gh_10836_crash_in_box_ctl_promote
[027] not ok 1	replication-luatest.gh_10836_crash_in_box_ctl_promote.test_node_not_crashes_while_gaining_quorum_during_promote
[027] #   ...ation-luatest/gh_10836_crash_in_box_ctl_promote_test.lua:78: expected: nil, actual: timed out
[027] #   stack traceback:
[027] #   	...ation-luatest/gh_10836_crash_in_box_ctl_promote_test.lua:78: in function <...ation-luatest/gh_10836_crash_in_box_ctl_promote_test.lua:76>
[027] #   	...ation-luatest/gh_10836_crash_in_box_ctl_promote_test.lua:76: in function 'replication-luatest.gh_10836_crash_in_box_ctl_promote.test_node_not_crashes_while_gaining_quorum_during_promote'
[027] #   artifacts:
[027] #   	server1 -> /tmp/t/027_replication-luatest/artifacts/rs-k-_vDCsG_3oY/server1-PvL9_z8OLwlN
[027] #   	server3 -> /tmp/t/027_replication-luatest/artifacts/rs-k-_vDCsG_3oY/server3-EA3KaqFntxb8
[027] # Ran 1 tests in 7.385 seconds, 0 succeeded, 1 failed
[027]

mrForza · 2025-07-16T07:26:02Z

Since my laptop had only 22 available threads, I ran these two tests (gh_10836_hang and gh_10836_crash) in parallel with 22 workers 1024 times. The tests have never failed.

To fix flakies I

Increased election_timeout param in order for the node to have more time to perform promote (send rpc requests and get responses). This change can help us to avoid TimedOut error in hang:93 and crash:79 tests.
Increased replication_timeout param in order for the node to wait longer for responses from other servers during heartbeats. It can help us to avoid ER_NO_ELECTION_QUORUM error in hang:128, crash:79 tests. This scenario can occur when sockets of replicas die. We need more time to wait while the connection will be restored.
Added retrying block for term assertion because in rare situations the information about leader's term may not be able to reach some servers in time. It cn help us to avoid failure in term's assertion in hang:81 test

I hope these issues are fixed!

sergepetrenko

Roman, thanks for the fixes!

The tests are stable on my machine now.

sergepetrenko · 2025-07-16T13:46:04Z

@mrForza, the ASAN tests fail, PTAL.

mrForza · 2025-07-18T14:29:47Z

Memory leaks occurred because we didn't clean the diag container if the box_raft_try_promote was ended successfully.

In this common scenario we could get a memory leak:

We have a replicaset and we promote e.g. g.server_1.
In some strange way the external error (e.g. ER_LOADING) is set to the box_raft_try_promote_f's diag container during promote operation. May be because of a bug in fibers.
If the entire promote operation ends with no other promote-specific errors in callback box_raft_try_promote_f we move external error into ctx.diag anyway.
Then we successfully exit from box_raft_try_promote (./src/box/raft.c:549) and forget to clean the external error from ctx.diag.
Memory leak occurs.

The evidence of external error in box_raft_try_promote_f's diag:

2025-07-18 13:15:53.619 [313220] main/123/main raft.c:530 I> !  start box_raft_try_promote
2025-07-18 13:15:53.619 [313220] main/123/main raft.c:891 I> RAFT: bump term to 3, follow
2025-07-18 13:15:53.619 [313220] main/123/main raft.c:929 I> RAFT: vote for 2, follow
2025-07-18 13:15:53.620 [313220] main/119/raft_worker raft.c:785 I> RAFT: persisted state {term: 3, vote: 2}
2025-07-18 13:15:53.620 [313220] main/119/raft_worker raft.c:874 I> RAFT: enter candidate state with 1 self vote
2025-07-18 13:15:53.622 [313220] main/114/applier/unix/:/tmp/t/001_replication-luatest/rs-3UlIwDkhRZbU/server2.sock raft.c:507 I> RAFT: message {term: 3, state: follower} from 1
2025-07-18 13:15:53.622 [313220] main/114/applier/unix/:/tmp/t/001_replication-luatest/rs-3UlIwDkhRZbU/server2.sock raft.c:507 I> RAFT: message {term: 3, vote: 2, state: follower} from 1
2025-07-18 13:15:53.622 [313220] main/114/applier/unix/:/tmp/t/001_replication-luatest/rs-3UlIwDkhRZbU/server2.sock raft.c:844 I> RAFT: enter leader state with quorum 2
2025-07-18 13:15:53.622 [313220] main/114/applier/unix/:/tmp/t/001_replication-luatest/rs-3UlIwDkhRZbU/server2.sock raft.c:469 I> ! RAFT_STATE_LEADER
2025-07-18 13:15:53.622 [313220] main/114/applier/unix/:/tmp/t/001_replication-luatest/rs-3UlIwDkhRZbU/server2.sock raft.c:521 I> ! error in fiber's diag: 1
2025-07-18 13:15:53.622 [313220] main/123/main raft.c:557 I> !  error in ctx: 1
2025-07-18 13:15:53.622 [313220] main/123/main raft.c:559 I> !  end of promote (success); ctx.diag.last->code: 116
2025-07-18 13:15:53.622 [313220] main/119/raft_worker box.cc:445 I> box switched to rw

I add my custom logs in box_raft_try_promote_f and box_raft_try_promote in order to track the external error in diag.

The diff:

diff --git a/src/box/raft.c b/src/box/raft.c
index e0b0b65a0..0871b9348 100644
--- a/src/box/raft.c
+++ b/src/box/raft.c
@@ -465,8 +465,10 @@ box_raft_try_promote_f(struct trigger *trig, void *event)
 	assert(raft == box_raft());
 	struct box_raft_watch_ctx *ctx = trig->data;
 
-	if (raft->state == RAFT_STATE_LEADER)
+	if (raft->state == RAFT_STATE_LEADER) {
+		say_info("!\tRAFT_STATE_LEADER");
 		goto done;
+	}
 	/*
 	 * The checking for quorum loss is not placed inside !is_candidate
 	 * check because the quorum may be restored when raft->is_candidate
@@ -513,6 +515,7 @@ box_raft_try_promote_f(struct trigger *trig, void *event)
 	return 0;
 done:
 	ctx->is_done = true;
+	say_info("!\terror in fiber's diag: %d", diag_get()->last != NULL);
 	diag_move(diag_get(), &ctx->diag);
 	fiber_wakeup(ctx->owner);
 	return 0;
@@ -521,6 +524,7 @@ box_raft_try_promote_f(struct trigger *trig, void *event)
 int
 box_raft_try_promote(void)
 {
+	say_info("!\t");
 	struct raft *raft = box_raft();
 	assert(raft->is_enabled);
 	assert(box_election_mode == ELECTION_MODE_MANUAL ||
@@ -547,6 +551,9 @@ box_raft_try_promote(void)
 	trigger_clear(&trig);
 
 	if (raft->state == RAFT_STATE_LEADER) {
+		say_info("!\terror in ctx: %d", ctx.diag.last != NULL);
+		if (ctx.diag.last != NULL)
+			say_info("!\tend of promote (success); ctx.diag.last->code: ", ctx.diag.last->code);
 		/*
 		 * Since some errors from other functions may be set in
 		 * box_raft_try_promote_f's diag, we should always clear

Before this patch errors occurred in the `applier_f` weren't removed from diag container after completing this function. These errors appeared in other callbacks (e.g. `box_raft_try_promote_f`) which used the trigger `box_raft_on_update` that was invoked during the execution of `applier_f` nested functions. As a result, it led to memory leaks which were detected in the patch tarantool#11560. To fix this let's clear diag container in the end of the `applier_f`. Needed for tarantool#10836 NO_TEST=<leak fix> NO_DOC=<leak fix> NO_CHANGELOG=<leak fix>

Before this patch `box.ctl.promote` could lead to hang. It was happening in the situation when one candidate server during the promote got a message from follower that leader was already seen. This particular scenario can reproduce `box.ctl.promote` hang: 1) We have a replicaset with 3 servers. 2) `server1` loses an upstream connection with `server3`. 3) Then we call `box.ctl.promote` on `server1`. The term of `server1` is higher than term of other servers. 4) When `server1` sends broadcast messages to other servers, `server3` starts election process due to election timeout and becomes a leader. Now, terms of all servers are equal. 5) During the promote process, `server1` gets a message from follower node - `server2` that leader - `server3` was already seen. The hang happened because the raft state machine did not handle a scenario when the candidate server during election process got a message from follower that leader was already seen. It is important to note that candidate server should not have a connection to newly elected leader. As a result the candidate server continues to believe that there is no leader and sends election broadcast messages to other servers. Since this kind of promote does not disrupt the cluster it is decided not to change a low level raft implementation. In order to avoid a hang in user's fiber we introduce a timeout for `box.ctl.promote`, after which the control is transferred back to the user. The timeout for promote is set to `election_timeout`. At the end of this time the `TimedOut` error will be raised. We also change the order of error's checking in `box_raft_try_promote` to keep them in order of priority. It can help us to get rid of an excess constant in promote timeout and make it equals to `election_timeout`. Part of tarantool#10836 NO_DOC=bugfix

Before this patch `box.ctl.promote` could lead to crash. It was happening in the situation when one candidate server lost a quorum before promote and gained it during promote. The reason of the crash was an opposite state of `raft->is_candidate` flag. When node lost its quorum this flag was set to false and trigger's function `box_raft_try_promote_f` ended without `ctx->is_done`. After the node has gained its quorum by reconnecting to other nodes, `raft_restore` has set `raft->is_candidate` to true according to `election_mode`. As a result, all error checks in `box_raft_try_promote` have failed and the assertion `!raft->is_candidate` has fired. To fix this crash we change the logic of errors' checking. Now, the majority of errors are set inside `box_raft_try_promote_f` instead of `box_raft_try_promote`. It can help us to track every critical events (e.g. loss of quorum or reconfiguring) and exit immediately from callback when it happened. As a result, if quorum loss is detected during promote, the `ER_NO_ELECTION_QUORUM` will be raised instantly. Since we change the core logic of error's checking, some replication tests begin to fail: 1. The `election_split_vote` test fails for 2 reasons: - The first calling of box.ctl.promote failed because the replicaset doesn't have enough time to establish a connection between server1 and server2. As a result, the `ER_NO_ELECTION_QUORUM` is raised. - Incorrect error is raised in `test_election_off_demote_other_no_leader` because after `box.ctl.demote` calling we don't wait until the downstream connection changes its `follow` status. 2. The `gh_6860_election_off_demote` test fails because it hangs while trying to grep log that split vote has been discovered. The reason of this behavior is that when the replication between node1 and node2 breaks, the `ER_NO_ELECTION_QUORUM` is raised and the `raft_check_split_vote` doesn't work. Now, we fix these tests by: 1. adding `wait_for_fullmesh` before all tests and waiting until the status of downstream connection of server1 will not be `follow`. 2. introducing proxies between node1 and node2. It can help us not to get the error about quorum loss and force replicaset into split vote state. We also change the `tarantoolgh-3055-promote-wakeup-crash` test because after our patch in this scenario the `ER_NO_ELECTION_QUORUM` will be raised always due to quorum loss. There is no way the crash described in tarantoolgh-3055 can happen because in the first iteration of `box_raft_try_promote_f` we exit with error. Closes tarantool#10836 NO_DOC=bugfix

After the patch tarantool#11560 the `tarantoolgh-3055-promote-wakeup-crash` test becomes unnecessary. It does not check anymore the assertion after spurious wakeup because the `ER_NO_ELECTION_QUORUM will` be raised earlier (number of nodes in replicaset < `quorum`). Let's drop this test. Part of tarantool#10836 NO_DOC=test NO_CHANGELOG=test

TarantoolBot · 2025-07-22T07:58:46Z

Backport failed for release/3.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release/3.2
git worktree add -d .worktree/backport/release/3.2/11560 origin/release/3.2
cd .worktree/backport/release/3.2/11560
git switch --create backport/release/3.2/11560
git cherry-pick -x 482d3ccc153fba0b736bfd316d49befa43dbc7c6 02920c04155bffab0aab29d6dc520bdf3f220e16 d4f9c9c992bc145a3d0f0e8fea2e701ff71f126a d0242af36253b279f0fddd6887dfd89e679c175b

TarantoolBot · 2025-07-22T07:58:47Z

Backport failed for release/3.3, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release/3.3
git worktree add -d .worktree/backport/release/3.3/11560 origin/release/3.3
cd .worktree/backport/release/3.3/11560
git switch --create backport/release/3.3/11560
git cherry-pick -x 482d3ccc153fba0b736bfd316d49befa43dbc7c6 02920c04155bffab0aab29d6dc520bdf3f220e16 d4f9c9c992bc145a3d0f0e8fea2e701ff71f126a d0242af36253b279f0fddd6887dfd89e679c175b

TarantoolBot · 2025-07-22T07:58:51Z

Successfully created backport PR for release/3.4:

[backport 3.4] raft: fix box.ctl.promote hang/crash #11692

TarantoolBot · 2025-07-22T07:58:53Z

Backport summary

Created [backport 3.4] raft: fix box.ctl.promote hang/crash #11692 to release/3.4 to a future 3.4.1 release

mrForza requested a review from a team as a code owner June 2, 2025 10:07

lenkis approved these changes Jun 2, 2025

View reviewed changes

changelogs/unreleased/gh-10653-passwords-are-logged-in-clear-text.md Outdated Show resolved Hide resolved

changelogs/unreleased/gh-10836-hang-in-box-ctl-promote.md Outdated Show resolved Hide resolved

mrForza force-pushed the mrforza/gh-10836-hang-in-box-ctl-promote branch 3 times, most recently from a888515 to 1778ed0 Compare June 2, 2025 15:57

Gerold103 reviewed Jun 2, 2025

View reviewed changes

src/lib/raft/raft.c Outdated Show resolved Hide resolved

mrForza force-pushed the mrforza/gh-10836-hang-in-box-ctl-promote branch 4 times, most recently from 6b2b6ed to 002b6dc Compare June 9, 2025 13:18

mrForza requested a review from Serpentian June 9, 2025 13:46

mrForza assigned Serpentian Jun 9, 2025

Serpentian requested changes Jun 10, 2025

View reviewed changes

Serpentian assigned mrForza and unassigned Serpentian Jun 10, 2025

Serpentian reviewed Jun 10, 2025

View reviewed changes

test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua Outdated Show resolved Hide resolved

mrForza force-pushed the mrforza/gh-10836-hang-in-box-ctl-promote branch from 002b6dc to 0fd9744 Compare June 11, 2025 09:14

mrForza changed the title ~~raft: fix box.ctl.promote hang~~ raft: fix box.ctl.promote hang/crash Jun 11, 2025

mrForza force-pushed the mrforza/gh-10836-hang-in-box-ctl-promote branch from 87b1fb5 to 45d482d Compare June 11, 2025 16:08

mrForza assigned Serpentian and Gerold103 and unassigned mrForza Jun 11, 2025

mrForza requested review from Gerold103 and Serpentian June 11, 2025 17:09

Gerold103 reviewed Jun 11, 2025

View reviewed changes

mrForza force-pushed the mrforza/gh-10836-hang-in-box-ctl-promote branch from 45d482d to bb80783 Compare June 16, 2025 13:43

Serpentian reviewed Jun 16, 2025

View reviewed changes

src/box/raft.c Outdated Show resolved Hide resolved

src/box/raft.c Outdated Show resolved Hide resolved

Serpentian assigned mrForza and unassigned Serpentian Jun 16, 2025

Serpentian unassigned Gerold103 Jul 8, 2025

sergepetrenko requested changes Jul 8, 2025

View reviewed changes

test/replication-luatest/gh_10836_hang_in_box_ctl_promote_test.lua Outdated Show resolved Hide resolved

test/replication-luatest/election_split_vote_test.lua Show resolved Hide resolved

test/replication/gh-3055-promote-wakeup-crash.result Outdated Show resolved Hide resolved

sergepetrenko assigned mrForza and unassigned sergepetrenko Jul 8, 2025

mrForza force-pushed the mrforza/gh-10836-hang-in-box-ctl-promote branch from f458383 to 97fabcf Compare July 8, 2025 15:24

mrForza requested a review from sergepetrenko July 9, 2025 07:28

mrForza assigned sergepetrenko and unassigned mrForza Jul 9, 2025

sergepetrenko requested changes Jul 11, 2025

View reviewed changes

sergepetrenko assigned mrForza and unassigned sergepetrenko Jul 11, 2025

mrForza force-pushed the mrforza/gh-10836-hang-in-box-ctl-promote branch from 97fabcf to ed99fc5 Compare July 14, 2025 08:49

sergepetrenko requested changes Jul 15, 2025

View reviewed changes

sergepetrenko approved these changes Jul 16, 2025

View reviewed changes

mrForza added 4 commits July 21, 2025 11:32

TarantoolBot mentioned this pull request Jul 22, 2025

[backport 3.4] raft: fix box.ctl.promote hang/crash #11692

Merged

This was referenced Jul 22, 2025

[backport 3.3] raft: fix box.ctl.promote hang/crash #11700

Merged

[backport 3.2] raft: fix box.ctl.promote hang/crash #11701

Merged

Conversation

mrForza commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coveralls commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Serpentian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gerold103 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Serpentian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sergepetrenko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sergepetrenko left a comment

Choose a reason for hiding this comment

Uh oh!

sergepetrenko left a comment

Choose a reason for hiding this comment

Uh oh!

mrForza commented Jul 16, 2025

Uh oh!

sergepetrenko left a comment

Choose a reason for hiding this comment

Uh oh!

sergepetrenko commented Jul 16, 2025

Uh oh!

mrForza commented Jul 18, 2025

Uh oh!

TarantoolBot commented Jul 22, 2025

Uh oh!

TarantoolBot commented Jul 22, 2025

Uh oh!

TarantoolBot commented Jul 22, 2025

Uh oh!

TarantoolBot commented Jul 22, 2025

Backport summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mrForza commented Jun 2, 2025 •

edited

Loading

coveralls commented Jun 9, 2025 •

edited

Loading