fix(#4754): implement LifeCycle transitions so StateMachineUpdater.reload() proceeds after snapshot install#4755
Conversation
…load() can proceed after snapshot install StateMachineUpdater.reload() (Ratis 3.2.2:230) asserts getLifeCycleState()==PAUSED before calling reinitialize(). BaseStateMachine.pause() is a no-op so the lifecycle never left NEW; the assertion threw IllegalStateException, killing the updater thread and permanently closing the Raft division so the follower rejected all AppendEntries as CLOSED. Three changes to ArcadeStateMachine: - initialize(): transition lifecycle NEW->STARTING->RUNNING after super.initialize() - pause() (new override): RUNNING->PAUSING->PAUSED; idempotent if already PAUSED - reinitialize(): after restoring lastAppliedIndex, if lifecycle is PAUSED transition back PAUSED->STARTING->RUNNING so the updater resumes log apply after reload() Also fixes notifyInstallSnapshotFromLeader: it was returning firstTermIndexInLog (the first log entry AFTER the snapshot) instead of the snapshot's own TermIndex (getIndex()-1), and never updated SimpleStateMachineStorage with the installed snapshot metadata. StateMachineUpdater.reload() calls Objects.requireNonNull(getLatestSnapshot()) immediately after reinitialize(); without the storage update this would NPE.
|
Tick the box to add this pull request to the merge queue (same as
|
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 9 |
🟢 Coverage 35.48% diff coverage · -7.32% coverage variation
Metric Results Coverage variation ✅ -7.32% coverage variation Diff coverage ✅ 35.48% diff coverage Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (23b1350) 132613 98512 74.29% Head commit (e2c957c) 164445 (+31832) 110113 (+11601) 66.96% (-7.32%) Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4755) 31 11 35.48% Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
There was a problem hiding this comment.
Code Review
This pull request resolves issue #4754 where a follower fails to rejoin the Raft quorum after a snapshot installation. It fixes the lifecycle transitions in ArcadeStateMachine by properly transitioning states during initialization, pausing, and reinitialization, and corrects the snapshot TermIndex calculation and metadata registration in notifyInstallSnapshotFromLeader. The review feedback highlights a potential race condition where snapshot installation could run concurrently with background recovery mechanisms, and suggests adding a defensive null check on the parent directory of the snapshot file to prevent a potential NullPointerException.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| // Compute the installed snapshot TermIndex. firstTermIndexInLog is the first log entry | ||
| // AFTER the snapshot, so the snapshot covers all entries up to getIndex()-1. | ||
| // Returning firstTermIndexInLog itself (as the old code did) caused two bugs: | ||
| // 1. SnapshotInstallationHandler called state.reloadStateMachine(firstTermIndexInLog) which | ||
| // purged log entries up to firstTermIndexInLog.getIndex() instead of firstTermIndexInLog.getIndex()-1. | ||
| // 2. StateMachineUpdater.reload() calls getLatestSnapshot().getIndex() and expects it to match | ||
| // the TermIndex we return; returning firstTermIndexInLog while storage was never updated | ||
| // caused NullPointerException (and before that, IllegalStateException from the PAUSED check). | ||
| final long snapshotIndex = Math.max(0L, firstTermIndexInLog.getIndex() - 1); | ||
| final long snapshotTerm = firstTermIndexInLog.getTerm(); | ||
| final TermIndex installedTermIndex = TermIndex.valueOf(snapshotTerm, snapshotIndex); | ||
|
|
||
| // Register the snapshot in SimpleStateMachineStorage. StateMachineUpdater.reload() calls | ||
| // getLatestSnapshot() immediately after reinitialize() and requires a non-null result. | ||
| // We write an empty marker file (ArcadeDB's snapshot is the database files themselves, | ||
| // not a Ratis-managed snapshot file) and update the in-memory latestSnapshot reference. | ||
| final File snapshotFile = storage.getSnapshotFile(snapshotTerm, snapshotIndex); | ||
| snapshotFile.getParentFile().mkdirs(); | ||
| if (!snapshotFile.exists()) | ||
| snapshotFile.createNewFile(); | ||
| storage.updateLatestSnapshot(new SingleFileSnapshotInfo( | ||
| new FileInfo(snapshotFile.toPath(), null), snapshotTerm, snapshotIndex)); | ||
|
|
||
| // Advance the local applied-index to the snapshot point so that the StateMachineUpdater | ||
| // knows which log entries have been consumed by this install. | ||
| lastAppliedIndex.set(snapshotIndex); | ||
| updateLastAppliedTermIndex(snapshotTerm, snapshotIndex); | ||
| writePersistedAppliedIndex(snapshotIndex); | ||
|
|
||
| LogManager.instance().log(this, Level.INFO, "Full resync from leader completed (snapshotIndex=%d)", snapshotIndex); | ||
| clearDivergedState(); | ||
| return firstTermIndexInLog; | ||
| return installedTermIndex; |
There was a problem hiding this comment.
Race Condition / Concurrent Snapshot Installation Risk
There is a potential race condition between notifyInstallSnapshotFromLeader and the background watchdog/lag recovery mechanisms.
notifyInstallSnapshotFromLeader runs asynchronously via CompletableFuture.supplyAsync. However, it does not set the snapshotDownloadInProgress flag or needsSnapshotDownload flag while it is running.
If the snapshot installation takes some time, the HealthMonitor (which runs periodically) might detect that the follower is persistently lagging and call recoverFromPersistentLag(). Since isSnapshotDownloadPending() checks both needsSnapshotDownload and snapshotDownloadInProgress (both of which will be false during notifyInstallSnapshotFromLeader), it will proceed to submit triggerSnapshotDownload() to the lifecycleExecutor.
This will result in SnapshotInstaller.install(...) running concurrently with reconciler.reconcileDatabasesFromLeader(...), leading to file locking conflicts, database corruption, or installation failures.
Recommended Fix
Set snapshotDownloadInProgress to true at the start of the asynchronous task in notifyInstallSnapshotFromLeader and clear it in a finally block. This ensures that isSnapshotDownloadPending() correctly returns true and prevents any concurrent snapshot downloads.
Since the start of the supplyAsync block is outside the current diff hunk, you can implement this by wrapping the body of the lambda in notifyInstallSnapshotFromLeader like so:
return CompletableFuture.supplyAsync(() -> {
if (!snapshotDownloadInProgress.compareAndSet(false, true)) {
throw new RuntimeException("Another snapshot download or installation is already in progress");
}
try {
// ... existing installation logic ...
} finally {
snapshotDownloadInProgress.set(false);
}
});| final File snapshotFile = storage.getSnapshotFile(snapshotTerm, snapshotIndex); | ||
| snapshotFile.getParentFile().mkdirs(); | ||
| if (!snapshotFile.exists()) | ||
| snapshotFile.createNewFile(); |
There was a problem hiding this comment.
Defensive programming: snapshotFile.getParentFile() can theoretically return null if the path does not contain a parent directory component. To prevent a potential NullPointerException, we should perform a null check before calling mkdirs().
| final File snapshotFile = storage.getSnapshotFile(snapshotTerm, snapshotIndex); | |
| snapshotFile.getParentFile().mkdirs(); | |
| if (!snapshotFile.exists()) | |
| snapshotFile.createNewFile(); | |
| final File snapshotFile = storage.getSnapshotFile(snapshotTerm, snapshotIndex); | |
| final File parentDir = snapshotFile.getParentFile(); | |
| if (parentDir != null) { | |
| parentDir.mkdirs(); | |
| } | |
| if (!snapshotFile.exists()) | |
| snapshotFile.createNewFile(); |
|
Review: fix(#4754) StateMachineUpdater.reload() after snapshot install Thanks for the thorough write-up and precise root-cause analysis. The lifecycle reasoning is sound: the NEW->STARTING->RUNNING start in initialize(), the RUNNING->PAUSING->PAUSED in pause(), and the PAUSED->STARTING->RUNNING resume in reinitialize() all match Ratis's LifeCycle valid-transition table, and the idempotent pause() guard is a nice touch. Wiring updateLatestSnapshot(...) plus returning getIndex()-1 as the snapshot TermIndex also looks correct against the Ratis SnapshotInstallationHandler/reload() contract. A few things worth addressing: 1. Docs claim a test change that isn't in this PR (medium). docs/4754-...md says: 'Updated test: RaftFullSnapshotResyncIT.java - Now enabled - tests the full 3-node path'. But this PR does NOT touch RaftFullSnapshotResyncIT.java, and that test was already enabled (not @disabled) on main before this PR. The PR body's test plan instead cites Issue4749SnapshotInstallClosesRegisteredDatabaseIT. The docs and the PR body disagree. Please either include the IT change you intended, or correct the docs so they describe what actually shipped. 2. Unit test doesn't actually cover initialize() (medium). ArcadeStateMachineLifecycleTest manually drives getLifeCycle().transition(STARTING)/transition(RUNNING) to simulate initialize(), rather than invoking initialize(). So the core fix (that initialize() starts the lifecycle) is never asserted; a regression deleting those two transition lines from initialize() would leave all three tests green. The real end-to-end coverage rests on RaftFullSnapshotResyncIT, which is an IT (not run in the unit phase) and is unchanged here. Worth either asserting initialize()'s post-condition directly, or confirming the IT genuinely exercises this path in CI. 3. Snapshot term may be paired with the wrong index (minor). snapshotIndex = firstTermIndexInLog.getIndex() - 1 but snapshotTerm = firstTermIndexInLog.getTerm(). firstTermIndexInLog is the first entry AFTER the snapshot, so its term is that entry's term, yet it's attached to firstAvailable-1, which may belong to an earlier term. The resulting (term, index) pair is propagated to updateLatestSnapshot, updateLastAppliedTermIndex, and writePersistedAppliedIndex. Likely benign since the real data is reconciled from the leader, but the pairing is technically inconsistent; a short justifying comment (or deriving the correct term) would help. 4. new FileInfo(snapshotFile.toPath(), null) passes a null digest (minor). Confirm no Ratis path (e.g. getLatestSnapshot() consumers, future snapshot verification) ever tries to validate the checksum of this empty marker file; a null FileDigest could surface later as an NPE/verification failure on a path not exercised by the current tests. 5. Ignored return values (minor/style). snapshotFile.getParentFile().mkdirs() and snapshotFile.createNewFile() ignore their boolean results. The createNewFile() IOException still propagates and fails the install loudly (good), but a failed mkdirs() is silent; consider checking/logging. 6. Check-then-act on the lifecycle (minor). pause() and reinitialize() read getLifeCycleState() then call transition(...) non-atomically. Safe today because Ratis serializes install/reload on a single thread, but if that ever changes, transition() would throw an uncaught IllegalStateException. The LifeCycle atomic helpers (e.g. compareAndTransition) would be more robust. Nit: pause() doesn't call super.pause(). Fine while it's a no-op, but a future Ratis version could add behavior there. Overall a well-targeted fix with good inline documentation. Items 1 and 2 are the ones I'd most like to see resolved before merge, since they affect whether the change is actually verified end-to-end. |
…shot downloads - docs: remove incorrect claim that RaftFullSnapshotResyncIT was modified in this PR - test: add initializeStartsLifecycleToRunning using real RaftStorage + JDK proxy stub to directly assert that initialize() transitions lifecycle to RUNNING (closes review gap where manually-started lifecycle tests wouldn't catch a regression in initialize() itself) - fix: set snapshotDownloadInProgress flag during notifyInstallSnapshotFromLeader so HealthMonitor's recoverFromPersistentLag() does not start a concurrent triggerSnapshotDownload() that would race on the same database files - fix: null-check snapshotFile.getParentFile() before calling mkdirs() - docs: add inline comments explaining term/index pairing choice, null FileDigest safety, and lifecycle thread-safety invariant
PR Review: LifeCycle transitions for
|
… file safety - fix: use compareAndSet(false,true) in notifyInstallSnapshotFromLeader instead of unconditional set(true), and only clear the flag in finally if we won the CAS; prevents clearing a flag owned by a concurrently running triggerSnapshotDownload() (the previous unconditional set+clear would let a third path start a third install) - docs: explain that the empty snapshot marker file is safe across restarts because SimpleStateMachineStorage.loadLatestSnapshot() tolerates a null MD5 hash, and ArcadeDB never validates snapshot content via Ratis's chunk-verification path
Code Review - PR #4755: LifeCycle transitions for
|
…m-dashes, log mkdirs failure - pause(): add Javadoc documenting the Ratis 3.2.2 invariant that all 3 call sites of StateMachine.pause() are paired with reinitialize(); verified against source of SnapshotInstallationHandler, ServerState, and RaftServerImpl - notifyInstallSnapshotFromLeader: soften CAS comment to not overstate ordering guarantee when the CAS fails (the two concurrent installs are not serialized in that case) - notifyInstallSnapshotFromLeader: check and log if mkdirs() fails instead of silently relying on a later createNewFile() IOException - remove em-dash characters from ArcadeStateMachine.java, test, and tracking doc per CLAUDE.md
|
Code Review - PR 4755 Reviewed the lifecycle fix and the notifyInstallSnapshotFromLeader rework. Overall this is a well-targeted, carefully documented fix for a real and nasty failure mode (zombie follower after snapshot install). The root-cause analysis is convincing and the inline comments are excellent. A few points worth considering before merge. Correctness / potential issues
Test coverage
Minor / nits
Style / conventions Adheres to CLAUDE.md: final used throughout, classes imported (no FQNs), no em-dashes. The HA concurrency rules note notifyInstallSnapshotFromLeader already runs on the JDK common pool with a NOTE (concurrency) tag; this PR does not change that, which is consistent. Nice work on the analysis and documentation density. The lifecycle fix itself looks correct; my main asks are (2) the term-boundary question and (3)/(4) hardening the concurrent-install and monotonicity edges, plus a unit test around the snapshot-metadata path. Generated with Claude Code |
PR Review: fix(#4754) LifeCycle transitions for
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4755 +/- ##
============================================
- Coverage 65.26% 65.25% -0.01%
Complexity 635 635
============================================
Files 1675 1675
Lines 132613 132642 +29
Branches 28366 28372 +6
============================================
+ Hits 86549 86556 +7
- Misses 34138 34160 +22
Partials 11926 11926 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Summary
Closes #4754
After #4749 fixed the earlier crash in
notifyInstallSnapshotFromLeader, the snapshot install now completes successfully - but thenStateMachineUpdater.reload()(Ratis 3.2.2 line 230) immediately throwsIllegalStateExceptionbecausegetLifeCycleState() != PAUSED. This kills theStateMachineUpdaterthread, closes the Raft division, and the follower permanently rejects allAppendEntriesasCLOSED- making it an unrecoverable zombie.Root cause:
BaseStateMachine.pause()is a no-op. TheLifeCycleobject inBaseStateMachinenever transitions, sogetLifeCycleState()always returnedNEW. Ratis'sSnapshotInstallationHandlercallsstateMachine.pause()before signallingRELOADmode, expecting the lifecycle to becomePAUSED- but with a no-oppause(), the check at line 230 always fails.Three changes to
ArcadeStateMachine:initialize(): transition lifecycleNEW→STARTING→RUNNINGaftersuper.initialize()so the lifecycle is in a valid state before any snapshot install can be triggered.pause()(new override): transitionRUNNING→PAUSING→PAUSED; idempotent if alreadyPAUSED, logs a WARNING for unexpected states.reinitialize(): after restoringlastAppliedIndexfrom storage, if lifecycle isPAUSED(called fromStateMachineUpdater.reload()), transitionPAUSED→STARTING→RUNNINGso the updater resumes log apply.Also fixes
notifyInstallSnapshotFromLeader:firstTermIndexInLog(the first log entry AFTER the snapshot) instead of the snapshot's ownTermIndex(getIndex()-1).SimpleStateMachineStoragewith the installed snapshot metadata.StateMachineUpdater.reload()callsObjects.requireNonNull(getLatestSnapshot())immediately afterreinitialize(); without this update it would NPE after the lifecycle fix unblocked the PAUSED check.storage.updateLatestSnapshot(...), advanceslastAppliedIndex/updateLastAppliedTermIndex/writePersistedAppliedIndexto the snapshot point, and returns the correctTermIndex.Test plan
ArcadeStateMachineLifecycleTest(new, 3 tests): verifies the exact lifecycle contract thatStateMachineUpdater.reload()relies on -pause()transitions toPAUSED,reinitialize()transitions back toRUNNING, idempotentpause().Issue4749SnapshotInstallClosesRegisteredDatabaseIT: still passes (regression for the prior fix).🤖 Generated with Claude Code