Skip to content

fix(#4754): implement LifeCycle transitions so StateMachineUpdater.reload() proceeds after snapshot install#4755

Merged
robfrank merged 5 commits into
mainfrom
fix/4754-statemachineupdater-reload-after-snapshot
Jun 27, 2026
Merged

fix(#4754): implement LifeCycle transitions so StateMachineUpdater.reload() proceeds after snapshot install#4755
robfrank merged 5 commits into
mainfrom
fix/4754-statemachineupdater-reload-after-snapshot

Conversation

@robfrank

Copy link
Copy Markdown
Collaborator

Summary

Closes #4754

After #4749 fixed the earlier crash in notifyInstallSnapshotFromLeader, the snapshot install now completes successfully - but then StateMachineUpdater.reload() (Ratis 3.2.2 line 230) immediately throws IllegalStateException because getLifeCycleState() != PAUSED. This kills the StateMachineUpdater thread, closes the Raft division, and the follower permanently rejects all AppendEntries as CLOSED - making it an unrecoverable zombie.

Root cause: BaseStateMachine.pause() is a no-op. The LifeCycle object in BaseStateMachine never transitions, so getLifeCycleState() always returned NEW. Ratis's SnapshotInstallationHandler calls stateMachine.pause() before signalling RELOAD mode, expecting the lifecycle to become PAUSED - but with a no-op pause(), the check at line 230 always fails.

Three changes to ArcadeStateMachine:

  • initialize(): transition lifecycle NEW→STARTING→RUNNING after super.initialize() so the lifecycle is in a valid state before any snapshot install can be triggered.
  • pause() (new override): transition RUNNING→PAUSING→PAUSED; idempotent if already PAUSED, logs a WARNING for unexpected states.
  • reinitialize(): after restoring lastAppliedIndex from storage, if lifecycle is PAUSED (called from StateMachineUpdater.reload()), transition PAUSED→STARTING→RUNNING so the updater resumes log apply.

Also fixes notifyInstallSnapshotFromLeader:

  • Was returning firstTermIndexInLog (the first log entry AFTER the snapshot) instead of the snapshot's own TermIndex (getIndex()-1).
  • Never updated SimpleStateMachineStorage with the installed snapshot metadata. StateMachineUpdater.reload() calls Objects.requireNonNull(getLatestSnapshot()) immediately after reinitialize(); without this update it would NPE after the lifecycle fix unblocked the PAUSED check.
  • Now writes an empty marker file and calls storage.updateLatestSnapshot(...), advances lastAppliedIndex/updateLastAppliedTermIndex/writePersistedAppliedIndex to the snapshot point, and returns the correct TermIndex.

Test plan

  • ArcadeStateMachineLifecycleTest (new, 3 tests): verifies the exact lifecycle contract that StateMachineUpdater.reload() relies on - pause() transitions to PAUSED, reinitialize() transitions back to RUNNING, idempotent pause().
  • Full ha-raft unit suite (151 tests): all pass, no regressions.
  • Issue4749SnapshotInstallClosesRegisteredDatabaseIT: still passes (regression for the prior fix).

🤖 Generated with Claude Code

…load() can proceed after snapshot install

StateMachineUpdater.reload() (Ratis 3.2.2:230) asserts getLifeCycleState()==PAUSED before
calling reinitialize(). BaseStateMachine.pause() is a no-op so the lifecycle never left NEW;
the assertion threw IllegalStateException, killing the updater thread and permanently closing
the Raft division so the follower rejected all AppendEntries as CLOSED.

Three changes to ArcadeStateMachine:
- initialize(): transition lifecycle NEW->STARTING->RUNNING after super.initialize()
- pause() (new override): RUNNING->PAUSING->PAUSED; idempotent if already PAUSED
- reinitialize(): after restoring lastAppliedIndex, if lifecycle is PAUSED transition
  back PAUSED->STARTING->RUNNING so the updater resumes log apply after reload()

Also fixes notifyInstallSnapshotFromLeader: it was returning firstTermIndexInLog
(the first log entry AFTER the snapshot) instead of the snapshot's own TermIndex
(getIndex()-1), and never updated SimpleStateMachineStorage with the installed snapshot
metadata. StateMachineUpdater.reload() calls Objects.requireNonNull(getLatestSnapshot())
immediately after reinitialize(); without the storage update this would NPE.
@mergify

mergify Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

@codacy-production

codacy-production Bot commented Jun 27, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 9 complexity

Metric Results
Complexity 9

View in Codacy

🟢 Coverage 35.48% diff coverage · -7.32% coverage variation

Metric Results
Coverage variation -7.32% coverage variation
Diff coverage 35.48% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (23b1350) 132613 98512 74.29%
Head commit (e2c957c) 164445 (+31832) 110113 (+11601) 66.96% (-7.32%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4755) 31 11 35.48%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request resolves issue #4754 where a follower fails to rejoin the Raft quorum after a snapshot installation. It fixes the lifecycle transitions in ArcadeStateMachine by properly transitioning states during initialization, pausing, and reinitialization, and corrects the snapshot TermIndex calculation and metadata registration in notifyInstallSnapshotFromLeader. The review feedback highlights a potential race condition where snapshot installation could run concurrently with background recovery mechanisms, and suggests adding a defensive null check on the parent directory of the snapshot file to prevent a potential NullPointerException.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +701 to +732
// Compute the installed snapshot TermIndex. firstTermIndexInLog is the first log entry
// AFTER the snapshot, so the snapshot covers all entries up to getIndex()-1.
// Returning firstTermIndexInLog itself (as the old code did) caused two bugs:
// 1. SnapshotInstallationHandler called state.reloadStateMachine(firstTermIndexInLog) which
// purged log entries up to firstTermIndexInLog.getIndex() instead of firstTermIndexInLog.getIndex()-1.
// 2. StateMachineUpdater.reload() calls getLatestSnapshot().getIndex() and expects it to match
// the TermIndex we return; returning firstTermIndexInLog while storage was never updated
// caused NullPointerException (and before that, IllegalStateException from the PAUSED check).
final long snapshotIndex = Math.max(0L, firstTermIndexInLog.getIndex() - 1);
final long snapshotTerm = firstTermIndexInLog.getTerm();
final TermIndex installedTermIndex = TermIndex.valueOf(snapshotTerm, snapshotIndex);

// Register the snapshot in SimpleStateMachineStorage. StateMachineUpdater.reload() calls
// getLatestSnapshot() immediately after reinitialize() and requires a non-null result.
// We write an empty marker file (ArcadeDB's snapshot is the database files themselves,
// not a Ratis-managed snapshot file) and update the in-memory latestSnapshot reference.
final File snapshotFile = storage.getSnapshotFile(snapshotTerm, snapshotIndex);
snapshotFile.getParentFile().mkdirs();
if (!snapshotFile.exists())
snapshotFile.createNewFile();
storage.updateLatestSnapshot(new SingleFileSnapshotInfo(
new FileInfo(snapshotFile.toPath(), null), snapshotTerm, snapshotIndex));

// Advance the local applied-index to the snapshot point so that the StateMachineUpdater
// knows which log entries have been consumed by this install.
lastAppliedIndex.set(snapshotIndex);
updateLastAppliedTermIndex(snapshotTerm, snapshotIndex);
writePersistedAppliedIndex(snapshotIndex);

LogManager.instance().log(this, Level.INFO, "Full resync from leader completed (snapshotIndex=%d)", snapshotIndex);
clearDivergedState();
return firstTermIndexInLog;
return installedTermIndex;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Race Condition / Concurrent Snapshot Installation Risk

There is a potential race condition between notifyInstallSnapshotFromLeader and the background watchdog/lag recovery mechanisms.

notifyInstallSnapshotFromLeader runs asynchronously via CompletableFuture.supplyAsync. However, it does not set the snapshotDownloadInProgress flag or needsSnapshotDownload flag while it is running.

If the snapshot installation takes some time, the HealthMonitor (which runs periodically) might detect that the follower is persistently lagging and call recoverFromPersistentLag(). Since isSnapshotDownloadPending() checks both needsSnapshotDownload and snapshotDownloadInProgress (both of which will be false during notifyInstallSnapshotFromLeader), it will proceed to submit triggerSnapshotDownload() to the lifecycleExecutor.

This will result in SnapshotInstaller.install(...) running concurrently with reconciler.reconcileDatabasesFromLeader(...), leading to file locking conflicts, database corruption, or installation failures.

Recommended Fix

Set snapshotDownloadInProgress to true at the start of the asynchronous task in notifyInstallSnapshotFromLeader and clear it in a finally block. This ensures that isSnapshotDownloadPending() correctly returns true and prevents any concurrent snapshot downloads.

Since the start of the supplyAsync block is outside the current diff hunk, you can implement this by wrapping the body of the lambda in notifyInstallSnapshotFromLeader like so:

return CompletableFuture.supplyAsync(() -> {
  if (!snapshotDownloadInProgress.compareAndSet(false, true)) {
    throw new RuntimeException("Another snapshot download or installation is already in progress");
  }
  try {
    // ... existing installation logic ...
  } finally {
    snapshotDownloadInProgress.set(false);
  }
});

Comment on lines +717 to +720
final File snapshotFile = storage.getSnapshotFile(snapshotTerm, snapshotIndex);
snapshotFile.getParentFile().mkdirs();
if (!snapshotFile.exists())
snapshotFile.createNewFile();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Defensive programming: snapshotFile.getParentFile() can theoretically return null if the path does not contain a parent directory component. To prevent a potential NullPointerException, we should perform a null check before calling mkdirs().

Suggested change
final File snapshotFile = storage.getSnapshotFile(snapshotTerm, snapshotIndex);
snapshotFile.getParentFile().mkdirs();
if (!snapshotFile.exists())
snapshotFile.createNewFile();
final File snapshotFile = storage.getSnapshotFile(snapshotTerm, snapshotIndex);
final File parentDir = snapshotFile.getParentFile();
if (parentDir != null) {
parentDir.mkdirs();
}
if (!snapshotFile.exists())
snapshotFile.createNewFile();

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review: fix(#4754) StateMachineUpdater.reload() after snapshot install

Thanks for the thorough write-up and precise root-cause analysis. The lifecycle reasoning is sound: the NEW->STARTING->RUNNING start in initialize(), the RUNNING->PAUSING->PAUSED in pause(), and the PAUSED->STARTING->RUNNING resume in reinitialize() all match Ratis's LifeCycle valid-transition table, and the idempotent pause() guard is a nice touch. Wiring updateLatestSnapshot(...) plus returning getIndex()-1 as the snapshot TermIndex also looks correct against the Ratis SnapshotInstallationHandler/reload() contract.

A few things worth addressing:

1. Docs claim a test change that isn't in this PR (medium). docs/4754-...md says: 'Updated test: RaftFullSnapshotResyncIT.java - Now enabled - tests the full 3-node path'. But this PR does NOT touch RaftFullSnapshotResyncIT.java, and that test was already enabled (not @disabled) on main before this PR. The PR body's test plan instead cites Issue4749SnapshotInstallClosesRegisteredDatabaseIT. The docs and the PR body disagree. Please either include the IT change you intended, or correct the docs so they describe what actually shipped.

2. Unit test doesn't actually cover initialize() (medium). ArcadeStateMachineLifecycleTest manually drives getLifeCycle().transition(STARTING)/transition(RUNNING) to simulate initialize(), rather than invoking initialize(). So the core fix (that initialize() starts the lifecycle) is never asserted; a regression deleting those two transition lines from initialize() would leave all three tests green. The real end-to-end coverage rests on RaftFullSnapshotResyncIT, which is an IT (not run in the unit phase) and is unchanged here. Worth either asserting initialize()'s post-condition directly, or confirming the IT genuinely exercises this path in CI.

3. Snapshot term may be paired with the wrong index (minor). snapshotIndex = firstTermIndexInLog.getIndex() - 1 but snapshotTerm = firstTermIndexInLog.getTerm(). firstTermIndexInLog is the first entry AFTER the snapshot, so its term is that entry's term, yet it's attached to firstAvailable-1, which may belong to an earlier term. The resulting (term, index) pair is propagated to updateLatestSnapshot, updateLastAppliedTermIndex, and writePersistedAppliedIndex. Likely benign since the real data is reconciled from the leader, but the pairing is technically inconsistent; a short justifying comment (or deriving the correct term) would help.

4. new FileInfo(snapshotFile.toPath(), null) passes a null digest (minor). Confirm no Ratis path (e.g. getLatestSnapshot() consumers, future snapshot verification) ever tries to validate the checksum of this empty marker file; a null FileDigest could surface later as an NPE/verification failure on a path not exercised by the current tests.

5. Ignored return values (minor/style). snapshotFile.getParentFile().mkdirs() and snapshotFile.createNewFile() ignore their boolean results. The createNewFile() IOException still propagates and fails the install loudly (good), but a failed mkdirs() is silent; consider checking/logging.

6. Check-then-act on the lifecycle (minor). pause() and reinitialize() read getLifeCycleState() then call transition(...) non-atomically. Safe today because Ratis serializes install/reload on a single thread, but if that ever changes, transition() would throw an uncaught IllegalStateException. The LifeCycle atomic helpers (e.g. compareAndTransition) would be more robust.

Nit: pause() doesn't call super.pause(). Fine while it's a no-op, but a future Ratis version could add behavior there.

Overall a well-targeted fix with good inline documentation. Items 1 and 2 are the ones I'd most like to see resolved before merge, since they affect whether the change is actually verified end-to-end.

…shot downloads

- docs: remove incorrect claim that RaftFullSnapshotResyncIT was modified in this PR
- test: add initializeStartsLifecycleToRunning using real RaftStorage + JDK proxy stub to
  directly assert that initialize() transitions lifecycle to RUNNING (closes review gap where
  manually-started lifecycle tests wouldn't catch a regression in initialize() itself)
- fix: set snapshotDownloadInProgress flag during notifyInstallSnapshotFromLeader so
  HealthMonitor's recoverFromPersistentLag() does not start a concurrent triggerSnapshotDownload()
  that would race on the same database files
- fix: null-check snapshotFile.getParentFile() before calling mkdirs()
- docs: add inline comments explaining term/index pairing choice, null FileDigest safety,
  and lifecycle thread-safety invariant
@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

PR Review: LifeCycle transitions for StateMachineUpdater.reload() after snapshot install

Thanks for the detailed write-up - the root-cause analysis (no-op BaseStateMachine.pause() leaving the lifecycle in NEW, and the wrong return value / missing storage update in notifyInstallSnapshotFromLeader) is clear and the fix is well targeted. The lifecycle transitions themselves (NEW->STARTING->RUNNING, RUNNING->PAUSING->PAUSED, PAUSED->STARTING->RUNNING) are all valid Ratis LifeCycle edges, so the contract StateMachineUpdater.reload() expects is now satisfied. A few things worth a look before merge:

Correctness / potential bugs

  1. Asymmetric single-flight guard on snapshotDownloadInProgress (race). notifyInstallSnapshotFromLeader now does snapshotDownloadInProgress.set(true) (unconditional) with finally { ...set(false) }, while triggerSnapshotDownload uses compareAndSet(false, true). The comment claims this prevents a concurrent install, and it does in one ordering, but not the other:

    • If triggerSnapshotDownload already won the CAS and is mid-install, notify will still set(true) (no-op) and run reconcileDatabasesFromLeader concurrently with the in-flight SnapshotInstaller.install() on the same DB files - the exact race the comment says is prevented.
    • Worse, when notify's finally runs set(false), it clears the flag the still-running triggerSnapshotDownload "owns", allowing a third path to CAS true and start yet another concurrent download.

    Ratis serializes the notify path, but the watchdog / HealthMonitor paths fire independently, so this window is real (though narrow). Consider having notify participate in the same CAS protocol (only set(false) in finally if it actually acquired the flag), or otherwise serialize the two entry points.

  2. Empty marker file left in the SimpleStateMachineStorage snapshot dir. You createNewFile() an empty file named via storage.getSnapshotFile(term, index) and register it through updateLatestSnapshot. updateLatestSnapshot only sets the in-memory reference, but on the next restart SimpleStateMachineStorage.getLatestSnapshot() re-scans that directory and parses files by name (and may attempt to read/verify them). Worth confirming an empty, digest-less file there does not trip recovery (e.g. an md5 read or chunk verification) on a subsequent restart. If it can, a short note or a follow-up cleanup is warranted.

  3. close() doesn't transition the lifecycle. Now that initialize() drives the lifecycle to RUNNING/PAUSED, close() only shuts the executor and calls super.close(), leaving the lifecycle non-CLOSED. Probably benign (BaseStateMachine does not assert here), but a CLOSING->CLOSED transition would keep the state consistent and avoid surprises if Ratis ever inspects it.

Style / minor

  • pause() reads getLifeCycleState() then calls transition() in two steps - fine because Ratis calls it single-threaded, but if you want to be defensive against an unexpected concurrent transition, LifeCycle offers compare-and-transition helpers.
  • parentDir.mkdirs() return value is ignored; if it fails, createNewFile() throws IOException (caught), so behavior is acceptable - just noting it relies on the later failure.
  • The snapshot-term choice (firstTermIndexInLog.getTerm() as an upper bound) is well documented and only affects metadata - good call-out in the comment.

Test coverage

  • The new ArcadeStateMachineLifecycleTest cleanly nails the lifecycle contract (pause -> PAUSED, reinitialize -> RUNNING, idempotent pause, initialize via a proxy RaftServer). Nice.
  • Gap: none of the notifyInstallSnapshotFromLeader changes are unit-covered - the TermIndex math (getIndex()-1, the Math.max(0,...) floor), the storage updateLatestSnapshot write, and the applied-index advancement all rely solely on the IT. Given the bug history here, a focused unit test asserting the returned TermIndex and that getLatestSnapshot() is non-null after the call would be valuable and cheap.
  • Doc/PR-description mismatch: the PR body says "3 tests" but the test class and docs/...md say 4. Worth aligning.

Overall a solid, well-reasoned fix for an unrecoverable-zombie scenario. The race in point (1) is the main item I'd want resolved (or explicitly argued safe) before merge.

… file safety

- fix: use compareAndSet(false,true) in notifyInstallSnapshotFromLeader instead of
  unconditional set(true), and only clear the flag in finally if we won the CAS;
  prevents clearing a flag owned by a concurrently running triggerSnapshotDownload()
  (the previous unconditional set+clear would let a third path start a third install)
- docs: explain that the empty snapshot marker file is safe across restarts because
  SimpleStateMachineStorage.loadLatestSnapshot() tolerates a null MD5 hash, and
  ArcadeDB never validates snapshot content via Ratis's chunk-verification path
@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review - PR #4755: LifeCycle transitions for StateMachineUpdater.reload()

Thanks for the detailed write-up - the root-cause analysis (no-op BaseStateMachine.pause() leaving the lifecycle in NEW, so the getLifeCycleState() == PAUSED precondition in StateMachineUpdater.reload() throws and zombifies the follower) is clear and the fix is well targeted. The inline comments and the docs/ note make the intent easy to follow. A few things worth a closer look before merge.

1. pause() override must be paired with a resume on every path (most important)

The override now performs a real RUNNING -> PAUSING -> PAUSED transition, but the only place that transitions back to RUNNING is reinitialize(), which Ratis only calls from reload(). So the correctness of this change rests entirely on the invariant:

Ratis never calls StateMachine.pause() except immediately before a reload() that calls reinitialize().

If Ratis 3.2.2 calls pause() on any other path (e.g. some snapshot/takeSnapshot or shutdown bookkeeping flow) that does not lead to reinitialize(), the state machine gets stuck in PAUSED permanently - arguably worse than the current bug because it is silent. The previous no-op pause() was tolerant of being called anywhere. Could you confirm (ideally cite the call sites in SnapshotInstallationHandler / RaftServerImpl) that pause() in 3.2.2 is only ever invoked on the reload path? If it can be called elsewhere, consider resuming more defensively rather than only inside reinitialize().

2. The single-flight comment overstates the ordering guarantee

In notifyInstallSnapshotFromLeader, when the CAS loses (a triggerSnapshotDownload() is already running), the method still proceeds to reconciler.reconcileDatabasesFromLeader(...). The comment claims the in-flight watchdog download "will have installed current data before we call reconcileDatabasesFromLeader()" - but nothing synchronizes the two; both can be installing/swapping the same databases concurrently. This was already possible before this PR (the old code never touched the flag), so it is not a regression, but the comment reads as if ordering is guaranteed when it is not. I'd soften the comment, or better, have the notify path wait for / short-circuit on the in-flight download instead of racing it.

3. Snapshot term is approximated across a possible term boundary

snapshotTerm = firstTermIndexInLog.getTerm() is used as the term for the entry at getIndex() - 1. If a term boundary falls exactly there, the recorded term is off. You've documented that this only feeds the marker-file name and Ratis snapshot-index metadata, which is fair - but worth confirming Ratis never compares this term against the log when deciding what to purge in reloadStateMachine(...), since a mismatch there could affect log truncation.

4. Ignored mkdirs() / createNewFile() return values

if (parentDir != null) parentDir.mkdirs();
if (!snapshotFile.exists()) snapshotFile.createNewFile();

Both return booleans that are dropped. If either fails (permissions, disk full), storage.updateLatestSnapshot(...) registers a SingleFileSnapshotInfo pointing at a file that does not exist, and the failure only surfaces later and confusingly. Consider checking the results and failing fast, consistent with the SEVERE handling already in the catch block.

5. Test coverage gap on the notifyInstallSnapshotFromLeader changes

ArcadeStateMachineLifecycleTest nicely locks in the pause()/reinitialize() contract. But the other half of this PR - the Math.max(0, getIndex()-1) index computation, the marker-file creation, the updateLatestSnapshot(...) call, and the CAS-flag acquire/finally-clear semantics - has no direct unit test; it relies solely on the existing IT. The index math (the off-by-one was the prior bug) and the flag ownership logic are exactly what a focused unit test should pin down. Worth adding one even if it needs small storage stubbing.

6. Minor

  • Em dash in committed code/docs: ArcadeStateMachine.java:731 and :746 (and the new docs/ file) use the em-dash character. CLAUDE.md asks to avoid it; use a normal dash.
  • Doc/PR count drift: the PR description says "3 tests" and "151 tests", while docs/4754-...md says "4 tests" and "153 tests". The test file actually has 4 methods. Worth reconciling.
  • Lifecycle start ordering in initialize(): the STARTING -> RUNNING transitions happen before storage.init() / reinitialize(). If storage.init() throws, the lifecycle is left RUNNING on a half-initialized SM. Minor, but transitioning only after storage/reinit succeeds would be cleaner.

Summary

Solid, well-documented fix for a real zombie-follower bug, and the lifecycle test is a good regression guard. The blocking question is #1 - please verify the pause()-without-reinitialize() invariant holds in Ratis 3.2.2, since the whole fix depends on it. #4 and #5 are low-effort hardening I'd recommend before merge; #2/#3/#6 are clarifications/nits.

Review by Claude (Opus 4.8). I could not unpack the Ratis 3.2.2 jar in this environment to confirm the pause() call sites myself, so please treat #1 and #3 as verification requests rather than confirmed defects.

…m-dashes, log mkdirs failure

- pause(): add Javadoc documenting the Ratis 3.2.2 invariant that all 3 call sites
  of StateMachine.pause() are paired with reinitialize(); verified against source of
  SnapshotInstallationHandler, ServerState, and RaftServerImpl
- notifyInstallSnapshotFromLeader: soften CAS comment to not overstate ordering guarantee
  when the CAS fails (the two concurrent installs are not serialized in that case)
- notifyInstallSnapshotFromLeader: check and log if mkdirs() fails instead of silently
  relying on a later createNewFile() IOException
- remove em-dash characters from ArcadeStateMachine.java, test, and tracking doc per CLAUDE.md
@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review - PR 4755

Reviewed the lifecycle fix and the notifyInstallSnapshotFromLeader rework. Overall this is a well-targeted, carefully documented fix for a real and nasty failure mode (zombie follower after snapshot install). The root-cause analysis is convincing and the inline comments are excellent. A few points worth considering before merge.

Correctness / potential issues

  1. close() no longer transitions the lifecycle out of RUNNING/PAUSED. Before this PR the lifecycle stayed in NEW for the whole life of the object; now initialize() drives it to RUNNING. ArcadeStateMachine.close() only does lifecycleExecutor.shutdownNow() + super.close(), and BaseStateMachine.close() does not move the lifecycle. So a closed state machine is left in RUNNING (or PAUSED). Harmless today since nothing asserts CLOSED, but it is an inconsistency that could bite a future caller. Consider transitioning RUNNING/PAUSING/PAUSED -> CLOSING -> CLOSED in close() (guarded, like pause() is) to keep the lifecycle honest.

  2. Snapshot term is the term of the entry AFTER the snapshot, not at the snapshot index. snapshotTerm = firstTermIndexInLog.getTerm() is the term of index snapshotIndex+1. The reported snapshot TermIndex becomes (termOf(snapshotIndex+1), snapshotIndex). In the common case (no term boundary exactly between snapshotIndex and snapshotIndex+1) this equals the true term at snapshotIndex, but if a leader change landed exactly at that boundary the reported term is off by a term. The comment frames this as a safe upper bound that does not affect data correctness - true for ArcadeDBs HTTP file-shipping data path, but the value is also what Ratis stores as lastIncludedTerm and uses for the AppendEntries consistency check at prevLogIndex = snapshotIndex. Worth confirming this cannot cause a reject/decrement loop on the next heartbeat in the term-boundary case. If firstTermIndexInLog can expose the prior entrys term, that would be the more correct value to store.

  3. Residual concurrent-install race when the CAS is lost. When compareAndSet(false, true) fails, the method still proceeds to reconcileDatabasesFromLeader() while another triggerSnapshotDownload() is installing the SAME database files concurrently. The comment acknowledges this and leans on SnapshotInstaller being crash-safe with atomic directory swaps, but two installs racing on the same target/temp staging is a stronger claim than crash-safety. Since notifyInstallSnapshotFromLeader must return a valid TermIndex (it cannot simply skip), a follow-up that serializes on a per-database mutex or awaits the in-flight download would be safer than running both. At minimum this residual race deserves a tracked issue rather than only an inline comment.

  4. Applied-index monotonicity. lastAppliedIndex.set(snapshotIndex) + updateLastAppliedTermIndex(snapshotTerm, snapshotIndex) assume the snapshot point is at or ahead of the current applied index. For a lagging follower that holds, but if an install is ever triggered when local apply is already past snapshotIndex, BaseStateMachine.updateLastAppliedTermIndex can assert on a non-monotonic update. A guard (only advance if snapshotIndex > current) would make this robust.

Test coverage

  • The new ArcadeStateMachineLifecycleTest nicely pins the lifecycle contract, and initializeStartsLifecycleToRunning closes the obvious regression gap in initialize() itself. Good.
  • However, the most bug-prone new code - the notifyInstallSnapshotFromLeader metadata registration (term/index computation, storage.updateLatestSnapshot, marker-file creation, the CAS finally block) - has no unit coverage. The PR relies on the existing IT end-to-end. A focused unit test driving notifyInstallSnapshotFromLeader against a real SimpleStateMachineStorage and asserting (a) getLatestSnapshot() is non-null with the expected (term, index), (b) the returned TermIndex equals firstTermIndexInLog.getIndex()-1, and (c) the flag is cleared in finally only when acquired, would protect exactly what this PR changes. The firstTermIndexInLog.getIndex() == 0 -> Math.max(0, -1) edge is also untested.

Minor / nits

  • snapshotFile.createNewFile() return value is ignored (you do check mkdirs()). Low risk since IOException is caught, but logging a false return would be consistent with the mkdirs() handling.
  • pause() logging an unexpected state and silently skipping is reasonable, but if it ever fires the subsequent StateMachineUpdater.reload() precondition still throws - so that WARNING is the only breadcrumb before a hard failure. Just flagging that the WARNING is load-bearing for diagnosis.
  • Keep the doc test counts (docs/4754-...md) in sync if more tests are added later.

Style / conventions

Adheres to CLAUDE.md: final used throughout, classes imported (no FQNs), no em-dashes. The HA concurrency rules note notifyInstallSnapshotFromLeader already runs on the JDK common pool with a NOTE (concurrency) tag; this PR does not change that, which is consistent.

Nice work on the analysis and documentation density. The lifecycle fix itself looks correct; my main asks are (2) the term-boundary question and (3)/(4) hardening the concurrent-install and monotonicity edges, plus a unit test around the snapshot-metadata path.

Generated with Claude Code

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

PR Review: fix(#4754) LifeCycle transitions for StateMachineUpdater.reload()

Thorough, well-documented fix. The root-cause analysis (no-op BaseStateMachine.pause() leaving the lifecycle in NEW, so Ratis's reload() precondition getLifeCycleState() == PAUSED throws and zombifies the follower) is convincing, and the three-part lifecycle fix (initialize start, pause override, reinitialize resume) matches the contract Ratis drives. Code style follows CLAUDE.md (braceless single-statement ifs, final params/locals, no System.out). A few items below.

Test coverage (main gap)

  • The new ArcadeStateMachineLifecycleTest only exercises the pause() / reinitialize() lifecycle contract. The riskier changes in notifyInstallSnapshotFromLeader() have no direct test: the snapshot TermIndex computation (getIndex()-1, Math.max(0L, …)), the marker-file creation + storage.updateLatestSnapshot(...), the advance of lastAppliedIndex/writePersistedAppliedIndex, and the CAS single-flight finally. These are exactly the parts most likely to regress (filesystem + return-value correctness). Consider a test that invokes notifyInstallSnapshotFromLeader (with a stubbed reconciler) and asserts getLatestSnapshot() is non-null and the returned TermIndex equals (term, firstIndex-1).
  • Minor: the PR body says "3 tests" but the file has 4 test methods (and the docs file says 4). Worth reconciling.

Correctness / robustness

  1. pause() in an unexpected state doesn't actually prevent the crash. When current is neither RUNNING nor PAUSED (e.g. NEW), the method logs a WARNING and skips the transition - but StateMachineUpdater.reload() then still asserts == PAUSED and throws IllegalStateException, reproducing the very failure this PR fixes (just with a log line first). The javadoc claims this avoids "crashing the caller", which is only true for the already-PAUSED case. Please soften the javadoc so it isn't read as a guarantee.

  2. Snapshot term is an upper bound (firstTermIndexInLog.getTerm()). The true term of the entry at snapshotIndex = firstIndex-1 can be lower than the first-remaining entry's term if a term boundary sits exactly there. The comment asserts this term only feeds file naming / metadata. Please double-check that reloadStateMachine(...) / log-purge in Ratis 3.2.2 never uses this term for log matching - if it does, an inflated term could mis-purge. Low risk, but it's the one value here that is a deliberate approximation.

  3. Residual concurrent-install race (already documented). If the CAS loses (acquiredSnapshotFlag == false), the method still runs reconcileDatabasesFromLeader() concurrently with the in-flight triggerSnapshotDownload(), both writing the same DB directories. You rely on SnapshotInstaller's atomic swap, but two installs racing on the same target can still interleave swaps. The win/lose ownership of the finally clear is correct; the deferral is reasonable - just make sure a follow-up issue tracks closing this rather than leaving it only in a code comment.

  4. Empty marker file persists across restarts. loadLatestSnapshot() will rediscover this zero-byte file. The comment argues ArcadeDB never uses Ratis chunk verification, so it's safe. That holds only while the notification-based install path is the only one in use; if a config ever enables the chunk-based path, streaming an empty snapshot to a peer would be wrong. Worth a guard or at least a pointer to where that assumption is enforced.

Minor

  • snapshotFile.createNewFile() return value is ignored (acceptable - guarded by exists() and IOException flows to the outer catch).
  • The docs/4754-…md file embeds a "Review Cycles" table (SHAs, bot outcomes, "TIMEOUT") and a "Final State" note. That's process metadata, not durable design docs - consider trimming those sections before merge so the file stays useful to future readers.

Performance / security

No concerns. The added work (one mkdirs + empty-file create per snapshot install) runs on a rare, already-async path. No sensitive data logged.

Overall: solid, narrowly-scoped fix with good inline rationale. The chief ask is direct test coverage for the notifyInstallSnapshotFromLeader changes, plus the small pause() javadoc clarification.

@robfrank robfrank merged commit c04b9fb into main Jun 27, 2026
21 of 25 checks passed
@robfrank robfrank deleted the fix/4754-statemachineupdater-reload-after-snapshot branch June 27, 2026 19:31
@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 32.25806% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.25%. Comparing base (23b1350) to head (e2c957c).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...om/arcadedb/server/ha/raft/ArcadeStateMachine.java 32.25% 20 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4755      +/-   ##
============================================
- Coverage     65.26%   65.25%   -0.01%     
  Complexity      635      635              
============================================
  Files          1675     1675              
  Lines        132613   132642      +29     
  Branches      28366    28372       +6     
============================================
+ Hits          86549    86556       +7     
- Misses        34138    34160      +22     
  Partials      11926    11926              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HA: follower never rejoins — StateMachineUpdater.reload() throws IllegalStateException immediately after a successful snapshot install

1 participant