Skip to content

Fixed intermitted test failure in TestShardController_SwapNodeWithLeaderElectionFailure#927

Merged
merlimat merged 2 commits intooxia-db:mainfrom
merlimat:fix-test
Mar 4, 2026
Merged

Fixed intermitted test failure in TestShardController_SwapNodeWithLeaderElectionFailure#927
merlimat merged 2 commits intooxia-db:mainfrom
merlimat:fix-test

Conversation

@merlimat
Copy link
Copy Markdown
Collaborator

@merlimat merlimat commented Mar 4, 2026

Fix flaky TestShardController_SwapNodeWithLeaderElectionFailure

After an election, followerCaughtUp is set asynchronously in a background goroutine (ensureFollowerCaught at shard_controller_election.go:513-515), but SteadyState is set synchronously when start() returns. The test waited for SteadyState and then called ChangeEnsemble, which could be rejected with ErrNotReadyForChangeEnsemble if the background goroutine hadn't finished yet.

The error was silently swallowed because wg.Wait() was never called in the test, and the test hung for 10 seconds waiting for NewTerm requests that were never sent.

Proof by elimination:

  • The test fails because expectNewTermRequest times out — no NewTerm RPC was sent
  • onChangeEnsemble has only two code paths:
    1. IsReadyForChangeEnsemble() returns false → return ErrNotReadyForChangeEnsemble (no NewTerm sent)
    2. Call onElectLeaderNewTerm is sent
  • Since no NewTerm was sent, path 1 was taken
  • IsReadyForChangeEnsemble() returns e.followerCaughtUp.Load(), which is false until the background goroutine completes

Fix:

  • Test: Retry ChangeEnsemble in a loop until it's accepted (not rejected with ErrNotReadyForChangeEnsemble). Applied to both TestShardController_SwapNodeWithLeaderElectionFailure and TestShardController_LeaderElectionShouldNotFailIfRemoveFails.
  • Production code: Added a warning log when the rejection happens, for observability.
  • Mock helpers: Added return after assert.Fail() in all 6 expect*Request functions to prevent nil pointer panics when the underlying timeout fires.

…derElectionFailure

Signed-off-by: Matteo Merli <mmerli@apache.org>
@merlimat merlimat merged commit b4819ad into oxia-db:main Mar 4, 2026
9 of 11 checks passed
@merlimat merlimat deleted the fix-test branch March 4, 2026 04:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant