feat: refactor follower controller for better lifecycle management by mattisonchao · Pull Request #887 · oxia-db/oxia

mattisonchao · 2026-02-20T15:00:31Z

Motivation

The follower controller had fragile lifecycle management with race conditions during shutdown and tightly coupled synchronization logic.

Modification

Refactor followerController to use atomic state and sync.RWMutex with double-check locking for safe concurrent access.
Extract LogSynchronizer into its own file for cleaner replication stream handling.
Introduce dserror package for domain errors; convert to gRPC status errors at the server boundary.
Make NewTerm idempotent during leader negotiations.
Fix InstallSnapshot to recover the database on failure and validate term before destructive operations.
Make MockServerReplicateStream context-aware to prevent goroutine leaks in tests.

…ement

…liability - Add context-aware Send/Recv to MockServerReplicateStream with Cancel() to prevent goroutine leaks in tests - Add double-check pattern on closed state after acquiring locks in follower controller to prevent race conditions during shutdown - Separate LogSynchronizer.Sync() and Close() into distinct operations, adding SyncAndClose() for the AppendEntries call path - Migrate error returns from constant.ErrAlreadyClosed/ErrInvalidTerm/ ErrInvalidStatus to dserror equivalents for consistent error handling - Allow idempotent NewTerm calls during negotiations (term == current is no longer rejected) - Fix InstallSnapshot to assign db before SendAndClose to avoid using a nil db reference on success path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…hronizer

mattisonchao · 2026-02-20T15:06:06Z

Code Review Analysis

Critical

1. Potential deadlock between NewTerm and becomeFollower callback
follower_controller.go — The becomeFollower callback (line 275-279) acquires fc.rwMutex.Lock() inside the bgAppender goroutine. Meanwhile, NewTerm() acquires fc.rwMutex.Lock() (line 297), then calls fc.logSynchronizer.Close() which calls ls.waitGroup.Wait(). If bgAppender is inside becomeFollower() waiting for the lock, and NewTerm holds the lock waiting for bgAppender to finish — deadlock.

2. Race condition: term check before lock in NewTerm allows term regression
follower_controller.go:293-297 — The term comparison newTerm < fc.term.Load() happens before acquiring rwMutex. Between the check and the lock, another concurrent NewTerm could set a higher term. The stale check then allows overwriting with a lower term, violating the monotonically-increasing term invariant.

High

3. Delete() uses WAL and DB after Close() already closed them
follower_controller.go:643-652 — Close() on line 643 closes fc.wal and fc.db, then lines 651-652 call fc.wal.Delete() and fc.db.Delete() on the already-closed resources. This could panic or silently fail to delete data.

4. initDatabase silently ignores ReadTerm error
follower_controller.go:125 — The error from db.ReadTerm() is never checked. The := shadows the named return err, so execution continues with potentially undefined term/options values.

Medium

5. Unbuffered signal channels cause lost notifications
follower_controller.go:192 — stateApplierCond is created with zero buffer. Signals sent via PushNoBlock are dropped when the receiver is busy. Same issue with syncCond in log_synchronizer.go:166. This can delay entry application or WAL sync indefinitely if no new appends arrive.

6. InstallSnapshot holds write lock during entire stream receive
follower_controller.go:483-484 — The lock is held for the entire snapshot transfer including all stream.Recv() calls, blocking GetStatus, NewTerm, Truncate, and AppendEntries for potentially minutes.

Low

7. InstallSnapshot does not update fc.status on success — Status remains whatever it was before (e.g. FENCED), which may prevent accepting append entries without an additional NewTerm cycle.

8. Test TestFollower_DisconnectLeader reuses an already-cancelled stream (line 851-854) — The test passes for the wrong reason since the stream context is already cancelled before AppendEntries is called.

oxiad/dataserver/controller/follow/follower_controller.go

…m and onAppend - Change fc.status from proto.ServingStatus to atomic.Int32 to allow lock-free status updates from the LogSynchronizer's onAppend callback - Rename becomeFollower to onAppend in LogSynchronizer - Simplify Status() and GetStatus() to read directly from atomic

oxiad/dataserver/controller/follow/follower_controller.go

…revent term regression

oxiad/dataserver/controller/follow/follower_controller.go

- Protect fc.db access in processCommitRequest with RLock to prevent race with InstallSnapshot swapping the database - Recover database on failed snapshot install instead of leaving follower in a broken state with closed db - Check ReadTerm error in initDatabase (was silently ignored due to variable shadowing) - Reorganize struct fields with clear documentation of locking requirements: invariants, atomic state, and guarded resources

oxiad/dataserver/controller/follow/follower_controller.go

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The follower controller now returns domain errors (dserror package) instead of gRPC status errors. Add toGRPCError conversion at the gRPC handler boundary so remote callers (e.g. coordinator) can still match on expected gRPC error codes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Make stateApplierCond a buffered channel (size 1) to prevent missed signals - Add rwMutex.RLock in test helper to avoid data race on logSynchronizer access - Remove stale comment in log_synchronizer append path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The syncCond channel between bgAppender and bgSyncer was unbuffered. Combined with PushNoBlock (non-blocking send), if bgSyncer hadn't reached its select when bgAppender pushed, the signal was silently dropped, causing bgSyncer to never wake up and the test to hang. This is the same class of bug fixed in the previous commit for stateApplierCond. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move UpdateTerm and atomic offset stores before SendAndClose to prevent term loss if the node crashes between sending the response and persisting the term. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The snapshot DB already contains the leader's persisted term and TermOptions. Reading them via initDatabase is sufficient. The explicit UpdateTerm call was overwriting the leader's TermOptions (including notification settings) with empty defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mattisonchao · 2026-02-22T13:25:15Z

Updated review after detailed analysis against main:

Items from the original review — status:

Deadlock between NewTerm and onAppend — Fixed in 4883d79 (status is now atomic, no lock in callback).
Term regression race in NewTerm — Fixed in 06c2cb8 (double-check after lock acquisition).
Delete() uses WAL/DB after Close() — Same pattern as main (original also closed before deleting via saved references). Acceptable.
initDatabase ignores ReadTerm error — Not an issue, error is checked on line 128.
Unbuffered signal channels — Fixed in 224bfc1 and c8d6299 (channels are now buffered).
InstallSnapshot holds lock during stream — By design; same as main which held sync.Mutex throughout handleSnapshot.
InstallSnapshot doesn't update status — Acceptable; coordinator always calls NewTerm after snapshot.
Test reuses cancelled stream — Fixed in test changes (new stream created after NewTerm).

Two regressions found and fixed:

InstallSnapshot was persisting term after sending ack — crash could lose term (037bd6d).
InstallSnapshot was overwriting TermOptions with empty defaults — removed redundant UpdateTerm since snapshot DB already contains the leader's term and options (54ea495).

Update prepareTestDb to store the term in the snapshot database, reflecting that real leader snapshots already contain the persisted term. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…se (#893) ### Motivation `TestCoordinatorE2E` is flaky due to a data race between the WAL `runSync` goroutine calling `segment.Flush()` (which reads the mmap) and `wal.Close()` calling `segment.Close()` → `MMap.Unmap()` (which writes/invalidates the mmap). The race was latent in the WAL code and surfaced after the follower controller lifecycle refactor in #887 changed shutdown timing. CI failure: https://github.com/oxia-db/oxia/actions/runs/22312993235/job/64549930899 ### Modification - Add `RLock`/`RUnlock` to `readWriteSegment.Flush()` to synchronize against the write lock already held by `Close()`. This ensures either `Flush()` completes before `Unmap()` begins, or `Unmap()` completes first and `Flush()` sees `txnMappedFile == nil` and returns safely.

mattisonchao and others added 4 commits February 12, 2026 15:45

feat: refactor the follower controller to have better lifecycle manag…

b35e216

…ement

step commit

2f77db6

fix: treat io.EOF from stream.Recv() as normal termination in LogSync…

5dfe857

…hronizer

mattisonchao requested review from RobertIndie, coderzc and merlimat as code owners February 20, 2026 15:00

Merge branch 'main' into feat.refactor.lifecycle

6528389

github-code-quality bot found potential problems Feb 20, 2026

View reviewed changes

oxiad/dataserver/controller/follow/follower_controller.go Fixed Show fixed Hide fixed

oxiad/dataserver/controller/follow/follower_controller.go Fixed Show fixed Hide fixed

github-code-quality bot found potential problems Feb 20, 2026

View reviewed changes

oxiad/dataserver/controller/follow/follower_controller.go Fixed Show fixed Hide fixed

fix: add double-check for term after lock acquisition in NewTerm to p…

06c2cb8

…revent term regression

github-code-quality bot found potential problems Feb 20, 2026

View reviewed changes

oxiad/dataserver/controller/follow/follower_controller.go Fixed Show fixed Hide fixed

mattisonchao self-assigned this Feb 22, 2026

github-code-quality bot found potential problems Feb 22, 2026

View reviewed changes

oxiad/dataserver/controller/follow/follower_controller.go Fixed Show fixed Hide fixed

mattisonchao and others added 8 commits February 22, 2026 15:12

fix: resolve lint and license issues

9693ad6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: add license header to errors.go

c360238

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: log state applier error and remove superfluous io.Closer embedding

34eb5fc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: persist snapshot term before sending ack to coordinator

037bd6d

Move UpdateTerm and atomic offset stores before SendAndClose to prevent term loss if the node crashes between sending the response and persisting the term. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: persist term in test snapshot DB to match production behavior

9a72ae2

Update prepareTestDb to store the term in the snapshot database, reflecting that real leader snapshots already contain the persisted term. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mattisonchao merged commit 1724010 into main Feb 22, 2026
9 checks passed

mattisonchao deleted the feat.refactor.lifecycle branch February 22, 2026 14:12

mattisonchao mentioned this pull request Feb 23, 2026

fix: add read lock to WAL segment Flush to prevent data race with Close #893

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: refactor follower controller for better lifecycle management#887

feat: refactor follower controller for better lifecycle management#887
mattisonchao merged 17 commits intomainfrom
feat.refactor.lifecycle

mattisonchao commented Feb 20, 2026 •

edited

Loading

Uh oh!

mattisonchao commented Feb 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattisonchao commented Feb 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mattisonchao commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Uh oh!

mattisonchao commented Feb 20, 2026

Code Review Analysis

Critical

High

Medium

Low

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mattisonchao commented Feb 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mattisonchao commented Feb 20, 2026 •

edited

Loading