Skip to content

limbo: do not limit its size on replicas#12017

Merged
Gerold103 merged 1 commit intomasterfrom
gerold103/gh-11836-applier-volatile-deadlock
Nov 11, 2025
Merged

limbo: do not limit its size on replicas#12017
Gerold103 merged 1 commit intomasterfrom
gerold103/gh-11836-applier-volatile-deadlock

Conversation

@Gerold103
Copy link
Collaborator

There was a possible deadlock when a replica had
box.cfg.replication_synchro_queue_max_size smaller than the master.

The scenario was that the replica would receive some transactions, they would all enter the limbo and wait for CONFIRM in "submitted" state.

But the master sends more transactions instead of CONFIRM. Those transactions block the applier fiber in txn_commit_submit(), because the fiber can't exceed the limbo max size and is waiting for free space.

The free space however will never appear, because those "submitted" transactions aren't going anywhere until CONFIRM receipt. Which in turn will never happen, because the applier fiber is blocked on waiting for limbo space.

The only way is to let the replica apply these transactions bypassing the limbo max size limitation. It makes no sense to block them. Otherwise their CONFIRM can't be received.

This was probably working until
commit 20aad15 ("limbo: handle spurious wakeups on space waiting") (not counting that before that it was broken in many other ways), but seems like wasn't covered by the tests.

Closes #11836

NO_DOC=bugfix

@Gerold103 Gerold103 self-assigned this Nov 4, 2025
@coveralls
Copy link

coveralls commented Nov 4, 2025

Coverage Status

coverage: 87.648% (-0.002%) from 87.65%
when pulling 0fa1b3c on gerold103/gh-11836-applier-volatile-deadlock
into 214b54c
on master
.

@Gerold103 Gerold103 marked this pull request as ready for review November 4, 2025 23:42
@Gerold103 Gerold103 requested a review from a team as a code owner November 4, 2025 23:42
@Gerold103 Gerold103 requested a review from Serpentian November 4, 2025 23:43
@Gerold103 Gerold103 requested a review from Copilot November 5, 2025 00:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a deadlock scenario where a replica with a smaller replication_synchro_queue_max_size than the master would become stuck when receiving synchronous transactions. The fix allows replicas to bypass the limbo size limit when applying transactions from the master, preventing the applier fiber from blocking while waiting for CONFIRM messages.

Key Changes:

  • Modified txn_limbo_would_block() to skip size checks for replicas applying remote transactions
  • Added comprehensive test coverage for the deadlock scenario
  • Documented the bugfix in the changelog

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/box/txn_limbo.c Added logic to bypass limbo size limit on replicas to prevent deadlock when applying remote transactions
test/replication-luatest/gh_11836_qsync_replica_small_max_syncho_size_test.lua Added test case that reproduces the deadlock scenario with a replica having smaller max_size than master
changelogs/unreleased/gh-11836-replica-synchro-max-size-deadlock.md Added changelog entry documenting the bugfix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@Serpentian Serpentian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great patch! Clean and simple, I've left some nits, but in general I'm ok

@Serpentian Serpentian removed their assignment Nov 7, 2025
Copy link
Collaborator

@sergepetrenko sergepetrenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch!

This commit looks good to me, but I have one related question.

@Gerold103 Gerold103 force-pushed the gerold103/gh-11836-applier-volatile-deadlock branch from 06a081c to 5190e04 Compare November 10, 2025 21:01
@Gerold103
Copy link
Collaborator Author

@Serpentian , I think your approval got revoked somehow. I am gonna need it again.

Copy link
Collaborator

@sergepetrenko sergepetrenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Serpentian Serpentian removed their assignment Nov 11, 2025
@sergepetrenko sergepetrenko added backport/3.2 Automatically create a 3.2 backport PR backport/3.3 Automatically create a 3.3 backport PR backport/3.4 Automatically create a 3.4 backport PR labels Nov 11, 2025
@sergepetrenko sergepetrenko added the backport/3.5 Automatically create a 3.5 backport PR label Nov 11, 2025
There was a possible deadlock when a replica had
box.cfg.replication_synchro_queue_max_size smaller than the
master.

The scenario was that the replica would receive some transactions,
they would all enter the limbo and wait for CONFIRM in "submitted"
state.

But the master sends more transactions instead of CONFIRM. Those
transactions block the applier fiber in txn_commit_submit(),
because the fiber can't exceed the limbo max size and is waiting
for free space.

The free space however will never appear, because those
"submitted" transactions aren't going anywhere until CONFIRM
receipt. Which in turn will never happen, because the applier
fiber is blocked on waiting for limbo space.

The only way is to let the replica apply these transactions
bypassing the limbo max size limitation. It makes no sense to
block them. Otherwise their CONFIRM can't be received.

This was probably working until commit
20aad15 ("limbo: handle
spurious wakeups on space waiting") (not counting that before that
it was broken in many other ways), but seems like wasn't covered
by the tests.

Closes #11836

NO_DOC=bugfix
@Gerold103 Gerold103 force-pushed the gerold103/gh-11836-applier-volatile-deadlock branch from 5190e04 to 0fa1b3c Compare November 11, 2025 21:31
@Gerold103 Gerold103 merged commit cc77a6e into master Nov 11, 2025
27 checks passed
@Gerold103 Gerold103 deleted the gerold103/gh-11836-applier-volatile-deadlock branch November 11, 2025 21:56
@TarantoolBot
Copy link
Collaborator

Backport failed for release/3.2, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin release/3.2
git worktree add -d .worktree/backport/release/3.2/12017 origin/release/3.2
cd .worktree/backport/release/3.2/12017
git switch --create backport/release/3.2/12017
git cherry-pick -x cc77a6ead00b9a07233f302dabae994fc79dde21

@TarantoolBot
Copy link
Collaborator

Successfully created backport PR for release/3.3:

@TarantoolBot
Copy link
Collaborator

Successfully created backport PR for release/3.4:

@TarantoolBot
Copy link
Collaborator

Successfully created backport PR for release/3.5:

@TarantoolBot
Copy link
Collaborator

Backport summary

@Gerold103
Copy link
Collaborator Author

@sergepetrenko , the 3.2 backport isn't needed. On 3.2 the limbo queue is unlimited. There are no volatile entries at all, which means there is no way to get blocked on the limbo not having space.

@sergepetrenko sergepetrenko removed the backport/3.2 Automatically create a 3.2 backport PR label Nov 12, 2025
@sergepetrenko
Copy link
Collaborator

@Gerold103, yep, you're right. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/3.3 Automatically create a 3.3 backport PR backport/3.4 Automatically create a 3.4 backport PR backport/3.5 Automatically create a 3.5 backport PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

qsync: replication can get stuck when replication_synchro_queue_max_size is reached

7 participants