fix: data lost by change ensemble by mattisonchao · Pull Request #798 · oxia-db/oxia

mattisonchao · 2025-10-29T15:36:16Z

Motivation

fixes: #845

We will update the memory metadata in the current implementation of change ensemble(swap node), then trigger the leader election, which will persist the new ensemble to metadata without any precondition. If there is any failure to update the elected, the ensemble will be changed and might be changed next time.

Keeping moving shards without follower-caught-up validation will cause a data loss issue.

Modification

Introduce the status of the follower caught up in the election.
Move the caught-up validation into election ownership, and check that the followers caught up after the new leader was elected. which can also be cancelled by election#stop.
Check the IsReadyForChangeEnsemble before changing the ensemble, and fail fast if the precondition is not satisfied.
Implement copy-on-write for UpdateShardMetadata to avoid data racing.

Copilot

Pull Request Overview

This PR refactors the node swap functionality to use a more robust ChangeEnsemble action pattern. The key purpose is to ensure followers are caught up with the leader before completing ensemble changes to prevent data loss.

Replaced synchronous SwapNode with asynchronous ChangeEnsemble that waits for followers to catch up
Introduced follower catch-up monitoring to prevent premature ensemble changes
Removed unused helper functions and cleaned up election context management

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
coordinator/coordinator.go	Renamed handler from `handleActionSwap` to `handleActionChangeEnsemble` to reflect the new action type
coordinator/controllers/shard_controller_test.go	Updated tests to use new `ChangeEnsembleAction` API and added comprehensive test for data loss prevention
coordinator/controllers/shard_controller_election.go	Added follower catch-up monitoring logic and integrated `ChangeEnsembleAction` into election process
coordinator/controllers/shard_controller.go	Refactored from synchronous `SwapNode` to asynchronous `ChangeEnsemble` and removed unused helper functions
coordinator/balancer/scheduler.go	Updated to create `ChangeEnsembleAction` with callback instead of `SwapNodeAction`
coordinator/balancer/action_test.go	Removed test file for deprecated `SwapNodeAction`
coordinator/actions/swap.go	Replaced `SwapNodeAction` with `ChangeEnsembleAction` including proper error handling and wait mechanisms

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coordinator/coordinator.go

coordinator/controllers/shard_controller_test.go

coordinator/controllers/shard_controller_election.go

dao-jun · 2025-11-03T19:15:40Z

Can we just check if shardMeta.RemovedNodes is empty in shardController.swapNode? If it is, proceed with the swap; otherwise, skip it.

mattisonchao · 2025-11-04T08:39:59Z

Can we just check if shardMeta.RemovedNodes is empty in shardController.swapNode? If it is, proceed with the swap; otherwise, skip it.

Hi @dao-jun

Good point,

Unfortunately, the answer is no so far.

The current RemovedNodes mechanism is only used to record node deletions, not to protect the quorum. The value of RemovedNodes is cleared after the election. However, to avoid data loss due to load balancer shard movement, we should prevent further ensemble changes before a new quorum is formed.

Plus, I mentioned "so far" because we still have another issue that related to failure recovery. Once the issue is fixed, we can rely on the RemovedNodes.

dao-jun · 2025-11-18T03:55:15Z

coordinator/controllers/shard_controller_election.go

+			}, func() {
+				defer waitGroup.Done()
+				err := backoff.RetryNotify(func() error {
+					fs, err := e.provider.GetStatus(e.Context, server, &proto.GetStatusRequest{Shard: e.shard})


Wait until all followers caughtup with leader?

dao-jun · 2025-11-18T04:08:16Z

coordinator/controllers/shard_controller_election.go

 	term := mutShardMeta.Term
 	ensemble := mutShardMeta.Ensemble
 	leader := mutShardMeta.Leader
+	leaderEntry := candidatesStatus[*leader]


Did we ensure the coordinator receive newTermResponse from the current leader in fenceNewTermQuorum?

dao-jun · 2025-11-18T04:13:41Z

coordinator/controllers/shard_controller_election.go

+				"oxia":  "election-monitor-followers-caught-up",
+				"shard": fmt.Sprintf("%d", e.shard),
+			}, func() {
+				e.ensureFollowerCaught(ensemble, leader, leaderEntry)


As we fenced the leader, can the followers still caughtup with the leader?

fix: data lost by change ensemble

1ca3876

mattisonchao self-assigned this Oct 29, 2025

mattisonchao added 4 commits October 29, 2025 23:38

remove useless changes

87e3d2a

remove useless changes

f90d04d

add more tests

cba6c07

fixes lint

178bc0a

mattisonchao requested a review from Copilot October 29, 2025 16:48

Copilot AI reviewed Oct 29, 2025

View reviewed changes

mattisonchao added 3 commits October 30, 2025 01:00

add action test

36268dd

add license

addb266

fix lint

607d097

mattisonchao force-pushed the fix/data_lost branch from fdb1bd1 to 607d097 Compare October 29, 2025 17:01

fixes failure test

10cd5d7

mattisonchao marked this pull request as ready for review October 31, 2025 07:14

mattisonchao requested review from RobertIndie and merlimat as code owners October 31, 2025 07:14

mattisonchao and others added 2 commits October 31, 2025 15:14

Merge branch 'main' into fix/data_lost

3542f72

revert unneeded changes

592aa0d

Merge branch 'main' into fix/data_lost

a26a197

dao-jun reviewed Nov 18, 2025

View reviewed changes

merlimat approved these changes Nov 18, 2025

View reviewed changes

merlimat merged commit 7c38e1c into main Nov 18, 2025
8 checks passed

merlimat deleted the fix/data_lost branch November 18, 2025 21:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: data lost by change ensemble#798

fix: data lost by change ensemble#798
merlimat merged 12 commits intomainfrom
fix/data_lost

mattisonchao commented Oct 29, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dao-jun commented Nov 3, 2025

Uh oh!

mattisonchao commented Nov 4, 2025

Uh oh!

dao-jun Nov 18, 2025

Uh oh!

mattisonchao Nov 18, 2025

Uh oh!

dao-jun Nov 18, 2025

Uh oh!

dao-jun Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mattisonchao commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dao-jun commented Nov 3, 2025

Uh oh!

mattisonchao commented Nov 4, 2025

Uh oh!

dao-jun Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

mattisonchao Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

dao-jun Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

dao-jun Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mattisonchao commented Oct 29, 2025 •

edited

Loading