Skip to content

box: forbid concurrent invocation of box_raft_try_promote#11882

Merged
sergepetrenko merged 1 commit intotarantool:masterfrom
CuriousGeorgiy:box-raft-try-promote-race
Nov 20, 2025
Merged

box: forbid concurrent invocation of box_raft_try_promote#11882
sergepetrenko merged 1 commit intotarantool:masterfrom
CuriousGeorgiy:box-raft-try-promote-race

Conversation

@CuriousGeorgiy
Copy link
Member

@CuriousGeorgiy CuriousGeorgiy commented Oct 6, 2025

This patch fixes two bugs related to concurrent invocation of box_raft_try_promote.

Closes #11703
Closes #11708

@CuriousGeorgiy CuriousGeorgiy requested a review from a team as a code owner October 6, 2025 13:49
@CuriousGeorgiy CuriousGeorgiy force-pushed the box-raft-try-promote-race branch 2 times, most recently from 9157d89 to 37dbb0f Compare October 6, 2025 15:33
@coveralls
Copy link

coveralls commented Oct 6, 2025

Coverage Status

coverage: 87.644% (+0.02%) from 87.621%
when pulling 6f3eefa on CuriousGeorgiy:box-raft-try-promote-race
into 5528ca2
on tarantool:master
.

@CuriousGeorgiy CuriousGeorgiy changed the title box: disallow concurrent invocation of box_raft_try_promote box: forbid concurrent invocation of box_raft_try_promote Oct 6, 2025
@CuriousGeorgiy CuriousGeorgiy force-pushed the box-raft-try-promote-race branch from 37dbb0f to a807569 Compare October 6, 2025 16:01
Copy link
Collaborator

@Gerold103 Gerold103 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! Hm, tricky bug.

@CuriousGeorgiy CuriousGeorgiy force-pushed the box-raft-try-promote-race branch 3 times, most recently from b89bdf3 to 6727c59 Compare October 13, 2025 13:29
@CuriousGeorgiy CuriousGeorgiy force-pushed the box-raft-try-promote-race branch from 6727c59 to 9519932 Compare October 20, 2025 21:29
@Gerold103 Gerold103 requested a review from Serpentian October 21, 2025 21:56
@Serpentian
Copy link
Contributor

We must be sure, that none of the raft_on_update triggers may cause the overwrite of the fiber's diag. We should either move raft_restore higher, as I proposed, or at least add assertions to the raft_schedule_broadcast, that the fiber's diag cannot be changed, when executing triggers (but IMHO, that's normal behavior and the first solution is better, though I don't know, will it properly work)

I'd like to note, that we should somehow cover that place, so that we don't introduce new triggers tomorrow, that'll clear the diag. Assertions in the raft_schedule_broadcast should be enough.

@CuriousGeorgiy CuriousGeorgiy force-pushed the box-raft-try-promote-race branch from 9519932 to 334118e Compare November 16, 2025 15:37
Copy link
Contributor

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch! Looks like we're safe here now

Copy link
Collaborator

@sergepetrenko sergepetrenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch!

Only one question from me.

@sergepetrenko sergepetrenko added backport/3.2 Automatically create a 3.2 backport PR backport/3.3 Automatically create a 3.3 backport PR backport/3.4 Automatically create a 3.4 backport PR backport/3.5 Automatically create a 3.5 backport PR labels Nov 19, 2025
Currently, we allow concurrent invocation of `box_raft_try_promote`, since
we either disable the `is_in_promote` guard in `box_promote` or omit it in
`box_cfg_xc`, while `box_raft_try_promote` yields to write the raft state.

This leads to a race over the `diag` of the fiber executing the
`box_raft_try_promote_f` trigger, and can lead to more bugs.

Let's forbid concurrent execution of `box_raft_try_promote` by enabling the
`is_in_promote` guard while `box_raft_try_promote` is called.

One potential caveat is the concurrent execution of `box_promote_qsync`
which is also guarded by the `is_in_promote`. However, it is executed in
the raft worker fiber and can be retried until `box_raft_try_promote`
execution finishes.

To be on the safer side, let's:
1. Call `raft_restore` before setting the fiber's `diag` to prevent any
    potential tampering with it from `raft->on_update` triggers.
2. Set `is_box_configured` right before calling `box_raft_try_promote` to
   maintain the invariant that it is always called by `box.cfg` first. Also
   move `box_broadcast_ballot` for consistency.

The `raft_leader_promote` test from the tarantoolgh-6033 test group should now test
that a concurrent promote fails rather than succeeds.

Closes tarantool#11703
Closes tarantool#11708

NO_DOC=<bugfix>
@CuriousGeorgiy CuriousGeorgiy force-pushed the box-raft-try-promote-race branch from 334118e to 6f3eefa Compare November 19, 2025 15:46
@CuriousGeorgiy CuriousGeorgiy added the full-ci Enables all tests for a pull request label Nov 19, 2025
Copy link
Collaborator

@sergepetrenko sergepetrenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sergepetrenko sergepetrenko removed the full-ci Enables all tests for a pull request label Nov 20, 2025
@sergepetrenko sergepetrenko merged commit b592845 into tarantool:master Nov 20, 2025
77 of 79 checks passed
@TarantoolBot
Copy link
Collaborator

Successfully created backport PR for release/3.2:

@TarantoolBot
Copy link
Collaborator

Successfully created backport PR for release/3.3:

@TarantoolBot
Copy link
Collaborator

Successfully created backport PR for release/3.4:

@TarantoolBot
Copy link
Collaborator

Successfully created backport PR for release/3.5:

@TarantoolBot
Copy link
Collaborator

Backport summary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/3.2 Automatically create a 3.2 backport PR backport/3.3 Automatically create a 3.3 backport PR backport/3.4 Automatically create a 3.4 backport PR backport/3.5 Automatically create a 3.5 backport PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

replication: box.ctl.promote() fails with assertion replication: initial promote attempt after bootstrap leads to assertion failure

7 participants