feat: refactor follower controller for better lifecycle management#887
feat: refactor follower controller for better lifecycle management#887mattisonchao merged 17 commits intomainfrom
Conversation
…liability - Add context-aware Send/Recv to MockServerReplicateStream with Cancel() to prevent goroutine leaks in tests - Add double-check pattern on closed state after acquiring locks in follower controller to prevent race conditions during shutdown - Separate LogSynchronizer.Sync() and Close() into distinct operations, adding SyncAndClose() for the AppendEntries call path - Migrate error returns from constant.ErrAlreadyClosed/ErrInvalidTerm/ ErrInvalidStatus to dserror equivalents for consistent error handling - Allow idempotent NewTerm calls during negotiations (term == current is no longer rejected) - Fix InstallSnapshot to assign db before SendAndClose to avoid using a nil db reference on success path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Code Review AnalysisCritical1. Potential deadlock between 2. Race condition: term check before lock in High3. 4. Medium5. Unbuffered signal channels cause lost notifications 6. Low7. 8. Test |
…m and onAppend - Change fc.status from proto.ServingStatus to atomic.Int32 to allow lock-free status updates from the LogSynchronizer's onAppend callback - Rename becomeFollower to onAppend in LogSynchronizer - Simplify Status() and GetStatus() to read directly from atomic
…revent term regression
- Protect fc.db access in processCommitRequest with RLock to prevent race with InstallSnapshot swapping the database - Recover database on failed snapshot install instead of leaving follower in a broken state with closed db - Check ReadTerm error in initDatabase (was silently ignored due to variable shadowing) - Reorganize struct fields with clear documentation of locking requirements: invariants, atomic state, and guarded resources
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The follower controller now returns domain errors (dserror package) instead of gRPC status errors. Add toGRPCError conversion at the gRPC handler boundary so remote callers (e.g. coordinator) can still match on expected gRPC error codes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Make stateApplierCond a buffered channel (size 1) to prevent missed signals - Add rwMutex.RLock in test helper to avoid data race on logSynchronizer access - Remove stale comment in log_synchronizer append path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The syncCond channel between bgAppender and bgSyncer was unbuffered. Combined with PushNoBlock (non-blocking send), if bgSyncer hadn't reached its select when bgAppender pushed, the signal was silently dropped, causing bgSyncer to never wake up and the test to hang. This is the same class of bug fixed in the previous commit for stateApplierCond. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move UpdateTerm and atomic offset stores before SendAndClose to prevent term loss if the node crashes between sending the response and persisting the term. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The snapshot DB already contains the leader's persisted term and TermOptions. Reading them via initDatabase is sufficient. The explicit UpdateTerm call was overwriting the leader's TermOptions (including notification settings) with empty defaults. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Updated review after detailed analysis against Items from the original review — status:
Two regressions found and fixed:
|
Update prepareTestDb to store the term in the snapshot database, reflecting that real leader snapshots already contain the persisted term. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…se (#893) ### Motivation `TestCoordinatorE2E` is flaky due to a data race between the WAL `runSync` goroutine calling `segment.Flush()` (which reads the mmap) and `wal.Close()` calling `segment.Close()` → `MMap.Unmap()` (which writes/invalidates the mmap). The race was latent in the WAL code and surfaced after the follower controller lifecycle refactor in #887 changed shutdown timing. CI failure: https://github.com/oxia-db/oxia/actions/runs/22312993235/job/64549930899 ### Modification - Add `RLock`/`RUnlock` to `readWriteSegment.Flush()` to synchronize against the write lock already held by `Close()`. This ensures either `Flush()` completes before `Unmap()` begins, or `Unmap()` completes first and `Flush()` sees `txnMappedFile == nil` and returns safely.
Motivation
The follower controller had fragile lifecycle management with race conditions during shutdown and tightly coupled synchronization logic.
Modification
followerControllerto use atomic state andsync.RWMutexwith double-check locking for safe concurrent access.LogSynchronizerinto its own file for cleaner replication stream handling.dserrorpackage for domain errors; convert to gRPC status errors at the server boundary.NewTermidempotent during leader negotiations.InstallSnapshotto recover the database on failure and validate term before destructive operations.MockServerReplicateStreamcontext-aware to prevent goroutine leaks in tests.