limbo: do not limit its size on replicas#12017
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a deadlock scenario where a replica with a smaller replication_synchro_queue_max_size than the master would become stuck when receiving synchronous transactions. The fix allows replicas to bypass the limbo size limit when applying transactions from the master, preventing the applier fiber from blocking while waiting for CONFIRM messages.
Key Changes:
- Modified
txn_limbo_would_block()to skip size checks for replicas applying remote transactions - Added comprehensive test coverage for the deadlock scenario
- Documented the bugfix in the changelog
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/box/txn_limbo.c | Added logic to bypass limbo size limit on replicas to prevent deadlock when applying remote transactions |
| test/replication-luatest/gh_11836_qsync_replica_small_max_syncho_size_test.lua | Added test case that reproduces the deadlock scenario with a replica having smaller max_size than master |
| changelogs/unreleased/gh-11836-replica-synchro-max-size-deadlock.md | Added changelog entry documenting the bugfix |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
test/replication-luatest/gh_11836_qsync_replica_small_max_syncho_size_test.lua
Outdated
Show resolved
Hide resolved
Serpentian
left a comment
There was a problem hiding this comment.
Great patch! Clean and simple, I've left some nits, but in general I'm ok
changelogs/unreleased/gh-11836-replica-synchro-max-size-deadlock.md
Outdated
Show resolved
Hide resolved
test/replication-luatest/gh_11836_qsync_replica_small_max_syncho_size_test.lua
Show resolved
Hide resolved
sergepetrenko
left a comment
There was a problem hiding this comment.
Thanks for the patch!
This commit looks good to me, but I have one related question.
06a081c to
5190e04
Compare
|
@Serpentian , I think your approval got revoked somehow. I am gonna need it again. |
changelogs/unreleased/gh-11836-replica-synchro-max-size-deadlock.md
Outdated
Show resolved
Hide resolved
There was a possible deadlock when a replica had box.cfg.replication_synchro_queue_max_size smaller than the master. The scenario was that the replica would receive some transactions, they would all enter the limbo and wait for CONFIRM in "submitted" state. But the master sends more transactions instead of CONFIRM. Those transactions block the applier fiber in txn_commit_submit(), because the fiber can't exceed the limbo max size and is waiting for free space. The free space however will never appear, because those "submitted" transactions aren't going anywhere until CONFIRM receipt. Which in turn will never happen, because the applier fiber is blocked on waiting for limbo space. The only way is to let the replica apply these transactions bypassing the limbo max size limitation. It makes no sense to block them. Otherwise their CONFIRM can't be received. This was probably working until commit 20aad15 ("limbo: handle spurious wakeups on space waiting") (not counting that before that it was broken in many other ways), but seems like wasn't covered by the tests. Closes #11836 NO_DOC=bugfix
5190e04 to
0fa1b3c
Compare
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin release/3.2
git worktree add -d .worktree/backport/release/3.2/12017 origin/release/3.2
cd .worktree/backport/release/3.2/12017
git switch --create backport/release/3.2/12017
git cherry-pick -x cc77a6ead00b9a07233f302dabae994fc79dde21 |
|
Successfully created backport PR for |
|
Successfully created backport PR for |
|
Successfully created backport PR for |
Backport summary
|
|
@sergepetrenko , the 3.2 backport isn't needed. On 3.2 the limbo queue is unlimited. There are no volatile entries at all, which means there is no way to get blocked on the limbo not having space. |
|
@Gerold103, yep, you're right. Thanks! |
There was a possible deadlock when a replica had
box.cfg.replication_synchro_queue_max_size smaller than the master.
The scenario was that the replica would receive some transactions, they would all enter the limbo and wait for CONFIRM in "submitted" state.
But the master sends more transactions instead of CONFIRM. Those transactions block the applier fiber in txn_commit_submit(), because the fiber can't exceed the limbo max size and is waiting for free space.
The free space however will never appear, because those "submitted" transactions aren't going anywhere until CONFIRM receipt. Which in turn will never happen, because the applier fiber is blocked on waiting for limbo space.
The only way is to let the replica apply these transactions bypassing the limbo max size limitation. It makes no sense to block them. Otherwise their CONFIRM can't be received.
This was probably working until
commit 20aad15 ("limbo: handle spurious wakeups on space waiting") (not counting that before that it was broken in many other ways), but seems like wasn't covered by the tests.
Closes #11836
NO_DOC=bugfix