Skip to content

fixes: deadlock when swap waitForFollowersToCatchUp#788

Merged
mattisonchao merged 12 commits intomainfrom
fix/deadlock
Oct 25, 2025
Merged

fixes: deadlock when swap waitForFollowersToCatchUp#788
mattisonchao merged 12 commits intomainfrom
fix/deadlock

Conversation

@mattisonchao
Copy link
Copy Markdown
Member

@mattisonchao mattisonchao commented Oct 24, 2025

Motivation

fixes: #845

In the current implementation. The shard controller might be stuck in the swap method due to waitForFollowersToCatchUp.

The root cause is the swap waiting for waitForFollowersToCatchUp, but the new follower needs to wait for newTermAndAddFollowerOp to be fenced and added as a follower.

Modification

  • Move waitForFollowersToCatchUp operation out of shardController main loop.

@mattisonchao mattisonchao self-assigned this Oct 24, 2025
@mattisonchao mattisonchao changed the title fixes: deadlock when swap waitForFollowersToCatchUp fixes: deadlock when swap waitForFollowersToCatchUp Oct 24, 2025
"oxia": "election-fencing-failed-followers",
"shard": fmt.Sprintf("%d", s.shard),
}, func() {
s.keepFencingFailedFollowers(s.shardMetadata.Term, s.shardMetadata.Ensemble, s.shardMetadata.Leader, followers)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are reading the metadata from the go-routine, outside of its mutex. We can make a clone before or acquire the mutex here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

for _, sa := range s.shardMetadata.Ensemble {
if sa == *s.shardMetadata.Leader {
group := sync.WaitGroup{}
defer group.Wait()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for waiting at the end of the method, if we are not going to return any error?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useless so far. Let me remove it. :)

@mattisonchao
Copy link
Copy Markdown
Member Author

rely on the #790

@mattisonchao mattisonchao merged commit 64bf1d0 into main Oct 25, 2025
8 checks passed
@mattisonchao mattisonchao deleted the fix/deadlock branch October 25, 2025 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: data lost when expanding nodes from 3 to 6

2 participants