Summary
On a 3-node Raft HA cluster, when a follower performs a full snapshot resync, the snapshot install now completes ("Full resync from leader completed"), but immediately afterwards Ratis's StateMachineUpdater.reload() fails a precondition and throws java.lang.IllegalStateException. This kills the StateMachineUpdater thread, the division transitions to CLOSING/CLOSED, and the follower then rejects all AppendEntries/heartbeats with ServerNotReadyException: ... current state is CLOSED.
The follower ends up as a zombie replica: the snapshot data is present and queryable, but its Raft division is dead, its commit index is frozen (the leader reports a fixed lag, e.g. 21125), and it never rejoins quorum. It does not self-heal, and because arcadedb.ha.raftPersistStorage defaults to false, restarting the follower forces another full snapshot install, which hits this same assertion again — so the follower is unrecoverable on this build.
This is the next failure on the resync path after #4749: with #4749 fixed the install no longer crashes on ServerDatabase.close(), so it now runs to completion — and trips this reload precondition instead.
Version / environment
Symptom
Follower (arcadesplit-2) log — install succeeds, then the assertion fires in the very next millisecond:
[ArcadeStateMachine] Snapshot installation requested from leader (firstLogIndex=(t:1, i:99974)). Starting full resync...
[ArcadeStateMachine] Installing snapshot for database 'TestBigDB' from leader arcadesplit-0:2480...
[ArcadeStateMachine] Installing snapshot for database 'graph' from leader arcadesplit-0:2480...
[ArcadeStateMachine] Full resync from leader completed
SEVER [StateMachineUpdater] arcadesplit-2_2434@group-…-StateMachineUpdater caught a Throwable.
java.lang.IllegalStateException
at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:35)
at org.apache.ratis.server.impl.StateMachineUpdater.reload(StateMachineUpdater.java:230)
at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:191)
at java.base/java.lang.Thread.run(Thread.java:1583)
Then, repeating forever, the division is closed and rejects replication:
SEVER [RaftServer$Division] arcadesplit-2_2434@group-…: Failed appendEntries* arcadesplit-0_2434->arcadesplit-2_2434 …,leaderCommit=99974,…,HEARTBEAT
org.apache.ratis.protocol.exceptions.ServerNotReadyException: arcadesplit-2_2434@group-… is not in [STARTING, RUNNING]: current state is CLOSING (then CLOSED)
Cluster status confirms the zombie state — node-2 holds the data but its lag never moves:
arcadesplit-0 (leader): replicaLags { arcadesplit-1_2434: 0, arcadesplit-2_2434: 21125 }
all three nodes report the same Transaction edge count (no writes since the break)
Mechanism / root-cause hypothesis
StateMachineUpdater.reload() (Ratis) runs after an install-snapshot to reinitialize the state machine from the freshly installed snapshot, and asserts an invariant about the reloaded state — typically that the reinitialized state machine exposes a valid latest snapshot (non-null, index > 0 / consistent with lastAppliedIndex), or that the updater is in the expected RELOAD state. The IllegalStateException (a bare Preconditions.assertTrue failure at StateMachineUpdater.java:230) means that invariant is false right after ArcadeDB reports "Full resync from leader completed".
So the ArcadeDB side (ArcadeStateMachine install/reinitialize path) appears to finish the resync without leaving Ratis in the state reload() requires — e.g. not updating/exposing the installed snapshot's index, or completing the install on a path that doesn't drive the updater through the expected RELOAD → RUNNING transition. (I couldn't capture the exact failing precondition from the logs — only the two Installing snapshot … INFO lines precede Full resync from leader completed — so the specific invariant is a hypothesis; a Ratis-side debug log or the assertion message would pin it.)
Impact
follower needs full snapshot -> install completes ("Full resync from leader completed")
-> StateMachineUpdater.reload() Preconditions.assertTrue -> IllegalStateException
-> StateMachineUpdater thread dies -> division CLOSING/CLOSED
-> follower rejects all AppendEntries (ServerNotReadyException: CLOSED)
-> commit index frozen, lag never decreases, follower never rejoins
-> cluster permanently degraded to 2/3 (no fault tolerance); no self-heal
Because raftPersistStorage defaults to false, a restart drops the follower's Raft state and forces another full snapshot → the assertion fires again → the follower cannot be recovered on this build short of rebuilding the whole cluster from empty (so no node ever resyncs).
Reproduction
- 3-node HA cluster,
-Xmx6g, replicationFactor=3.
- Build a multi-million-edge graph under load.
- Force a follower into a full snapshot resync — stop it, remove its data volume, start it (so it must full-install from the leader). (Also reproduces by leaving a follower far enough behind that the leader ships a snapshot.)
- The install completes, then
StateMachineUpdater.reload() throws IllegalStateException; the follower's division closes and it never rejoins. 100% reproducible across attempts.
Suggested investigation
- In
ArcadeStateMachine, ensure the install-snapshot / reinitialize completion path leaves the state machine reporting the installed snapshot (correct getLatestSnapshot() index ≥ the install index) and drives the Ratis updater through the expected RELOAD → RUNNING transition before any further apply/heartbeat is processed.
- Add the failing assertion's message (or a Ratis debug log around
StateMachineUpdater.reload) so the exact invariant is visible.
Related
Happy to provide full follower/leader logs or a heap-independent repro.
Summary
On a 3-node Raft HA cluster, when a follower performs a full snapshot resync, the snapshot install now completes ("Full resync from leader completed"), but immediately afterwards Ratis's
StateMachineUpdater.reload()fails a precondition and throwsjava.lang.IllegalStateException. This kills theStateMachineUpdaterthread, the division transitions toCLOSING/CLOSED, and the follower then rejects allAppendEntries/heartbeats withServerNotReadyException: ... current state is CLOSED.The follower ends up as a zombie replica: the snapshot data is present and queryable, but its Raft division is dead, its commit index is frozen (the leader reports a fixed lag, e.g.
21125), and it never rejoins quorum. It does not self-heal, and becausearcadedb.ha.raftPersistStoragedefaults to false, restarting the follower forces another full snapshot install, which hits this same assertion again — so the follower is unrecoverable on this build.This is the next failure on the resync path after #4749: with #4749 fixed the install no longer crashes on
ServerDatabase.close(), so it now runs to completion — and trips this reload precondition instead.Version / environment
arcadesplit-0/1/2), Raft, Docker,-Xmx6gper nodeTestBigDBand the internalgraphSymptom
Follower (
arcadesplit-2) log — install succeeds, then the assertion fires in the very next millisecond:Then, repeating forever, the division is closed and rejects replication:
Cluster status confirms the zombie state — node-2 holds the data but its lag never moves:
Mechanism / root-cause hypothesis
StateMachineUpdater.reload()(Ratis) runs after an install-snapshot to reinitialize the state machine from the freshly installed snapshot, and asserts an invariant about the reloaded state — typically that the reinitialized state machine exposes a valid latest snapshot (non-null, index > 0 / consistent withlastAppliedIndex), or that the updater is in the expectedRELOADstate. TheIllegalStateException(a barePreconditions.assertTruefailure atStateMachineUpdater.java:230) means that invariant is false right after ArcadeDB reports "Full resync from leader completed".So the ArcadeDB side (
ArcadeStateMachineinstall/reinitialize path) appears to finish the resync without leaving Ratis in the statereload()requires — e.g. not updating/exposing the installed snapshot's index, or completing the install on a path that doesn't drive the updater through the expectedRELOAD→RUNNINGtransition. (I couldn't capture the exact failing precondition from the logs — only the twoInstalling snapshot …INFO lines precedeFull resync from leader completed— so the specific invariant is a hypothesis; a Ratis-side debug log or the assertion message would pin it.)Impact
Because
raftPersistStoragedefaults to false, a restart drops the follower's Raft state and forces another full snapshot → the assertion fires again → the follower cannot be recovered on this build short of rebuilding the whole cluster from empty (so no node ever resyncs).Reproduction
-Xmx6g,replicationFactor=3.StateMachineUpdater.reload()throwsIllegalStateException; the follower's division closes and it never rejoins. 100% reproducible across attempts.Suggested investigation
ArcadeStateMachine, ensure the install-snapshot /reinitializecompletion path leaves the state machine reporting the installed snapshot (correctgetLatestSnapshot()index ≥ the install index) and drives the Ratis updater through the expectedRELOAD→RUNNINGtransition before any further apply/heartbeat is processed.StateMachineUpdater.reload) so the exact invariant is visible.Related
notifyInstallSnapshotFromLeaderviaServerDatabase.close()(fixed). This is the next failure once the install runs to completion.AppendEntries(separate; this report is a deterministic post-install assertion, not memory).Happy to provide full follower/leader logs or a heap-independent repro.