Skip to content

Coordinator: swapNode() should keep retrying leader election until it succeeds#781

Merged
merlimat merged 2 commits intooxia-db:mainfrom
merlimat:fix-handling-swap-node-failure
Oct 24, 2025
Merged

Coordinator: swapNode() should keep retrying leader election until it succeeds#781
merlimat merged 2 commits intooxia-db:mainfrom
merlimat:fix-handling-swap-node-failure

Conversation

@merlimat
Copy link
Copy Markdown
Collaborator

@merlimat merlimat commented Oct 23, 2025

#845
During SwapNode() operation for rebalancing we're triggering a new leader election. If that leader election fails, the coordinator is not handling the retries (unlike the case where the leader election is triggered by a node failure).

The SwapNode() should actually block and wait until the leader election finally succeeds, so that we don't move on to swap other shards.

… succeeds

Signed-off-by: Matteo Merli <mmerli@apache.org>
Signed-off-by: Matteo Merli <mmerli@apache.org>
@mattisonchao
Copy link
Copy Markdown
Member

Hi, @merlimat
There are some other cases which need to be improved also. I opened a new PR to refine the whole shard state machine(shardController)

@merlimat
Copy link
Copy Markdown
Collaborator Author

Hi, @merlimat There are some other cases which need to be improved also. I opened a new PR to refine the whole shard state machine(shardController)

Yes, though that looks quite too big of a change. It's not easy to understand all the implications.

I'd prefer to punctually fix the current issues. If a refactor is needed, I'd do that outside of the scope of fixing the issues.

@merlimat
Copy link
Copy Markdown
Collaborator Author

Merging for now as I have more changes depending on this

@merlimat merlimat merged commit 8758e8f into oxia-db:main Oct 24, 2025
5 checks passed
@merlimat merlimat deleted the fix-handling-swap-node-failure branch October 24, 2025 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants