Skip to content

fix: data lost by change ensemble#798

Merged
merlimat merged 12 commits intomainfrom
fix/data_lost
Nov 18, 2025
Merged

fix: data lost by change ensemble#798
merlimat merged 12 commits intomainfrom
fix/data_lost

Conversation

@mattisonchao
Copy link
Copy Markdown
Member

@mattisonchao mattisonchao commented Oct 29, 2025

Motivation

fixes: #845

We will update the memory metadata in the current implementation of change ensemble(swap node), then trigger the leader election, which will persist the new ensemble to metadata without any precondition. If there is any failure to update the elected, the ensemble will be changed and might be changed next time.

Keeping moving shards without follower-caught-up validation will cause a data loss issue.

Modification

  1. Introduce the status of the follower caught up in the election.
  2. Move the caught-up validation into election ownership, and check that the followers caught up after the new leader was elected. which can also be cancelled by election#stop.
  3. Check the IsReadyForChangeEnsemble before changing the ensemble, and fail fast if the precondition is not satisfied.
  4. Implement copy-on-write for UpdateShardMetadata to avoid data racing.

@mattisonchao mattisonchao self-assigned this Oct 29, 2025
@mattisonchao mattisonchao requested a review from Copilot October 29, 2025 16:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the node swap functionality to use a more robust ChangeEnsemble action pattern. The key purpose is to ensure followers are caught up with the leader before completing ensemble changes to prevent data loss.

  • Replaced synchronous SwapNode with asynchronous ChangeEnsemble that waits for followers to catch up
  • Introduced follower catch-up monitoring to prevent premature ensemble changes
  • Removed unused helper functions and cleaned up election context management

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
coordinator/coordinator.go Renamed handler from handleActionSwap to handleActionChangeEnsemble to reflect the new action type
coordinator/controllers/shard_controller_test.go Updated tests to use new ChangeEnsembleAction API and added comprehensive test for data loss prevention
coordinator/controllers/shard_controller_election.go Added follower catch-up monitoring logic and integrated ChangeEnsembleAction into election process
coordinator/controllers/shard_controller.go Refactored from synchronous SwapNode to asynchronous ChangeEnsemble and removed unused helper functions
coordinator/balancer/scheduler.go Updated to create ChangeEnsembleAction with callback instead of SwapNodeAction
coordinator/balancer/action_test.go Removed test file for deprecated SwapNodeAction
coordinator/actions/swap.go Replaced SwapNodeAction with ChangeEnsembleAction including proper error handling and wait mechanisms

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mattisonchao mattisonchao marked this pull request as ready for review October 31, 2025 07:14
@dao-jun
Copy link
Copy Markdown
Contributor

dao-jun commented Nov 3, 2025

Can we just check if shardMeta.RemovedNodes is empty in shardController.swapNode? If it is, proceed with the swap; otherwise, skip it.

@mattisonchao
Copy link
Copy Markdown
Member Author

Can we just check if shardMeta.RemovedNodes is empty in shardController.swapNode? If it is, proceed with the swap; otherwise, skip it.

Hi @dao-jun

Good point,

Unfortunately, the answer is no so far.

The current RemovedNodes mechanism is only used to record node deletions, not to protect the quorum. The value of RemovedNodes is cleared after the election. However, to avoid data loss due to load balancer shard movement, we should prevent further ensemble changes before a new quorum is formed.

Plus, I mentioned "so far" because we still have another issue that related to failure recovery. Once the issue is fixed, we can rely on the RemovedNodes.

}, func() {
defer waitGroup.Done()
err := backoff.RetryNotify(func() error {
fs, err := e.provider.GetStatus(e.Context, server, &proto.GetStatusRequest{Shard: e.shard})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait until all followers caughtup with leader?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.

term := mutShardMeta.Term
ensemble := mutShardMeta.Ensemble
leader := mutShardMeta.Leader
leaderEntry := candidatesStatus[*leader]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we ensure the coordinator receive newTermResponse from the current leader in fenceNewTermQuorum?

"oxia": "election-monitor-followers-caught-up",
"shard": fmt.Sprintf("%d", e.shard),
}, func() {
e.ensureFollowerCaught(ensemble, leader, leaderEntry)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we fenced the leader, can the followers still caughtup with the leader?

@merlimat merlimat merged commit 7c38e1c into main Nov 18, 2025
8 checks passed
@merlimat merlimat deleted the fix/data_lost branch November 18, 2025 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: data lost when expanding nodes from 3 to 6

4 participants