Skip to content

HA: follower never rejoins — StateMachineUpdater.reload() throws IllegalStateException immediately after a successful snapshot install #4754

Description

@robfrank

Summary

On a 3-node Raft HA cluster, when a follower performs a full snapshot resync, the snapshot install now completes ("Full resync from leader completed"), but immediately afterwards Ratis's StateMachineUpdater.reload() fails a precondition and throws java.lang.IllegalStateException. This kills the StateMachineUpdater thread, the division transitions to CLOSING/CLOSED, and the follower then rejects all AppendEntries/heartbeats with ServerNotReadyException: ... current state is CLOSED.

The follower ends up as a zombie replica: the snapshot data is present and queryable, but its Raft division is dead, its commit index is frozen (the leader reports a fixed lag, e.g. 21125), and it never rejoins quorum. It does not self-heal, and because arcadedb.ha.raftPersistStorage defaults to false, restarting the follower forces another full snapshot install, which hits this same assertion again — so the follower is unrecoverable on this build.

This is the next failure on the resync path after #4749: with #4749 fixed the install no longer crashes on ServerDatabase.close(), so it now runs to completion — and trips this reload precondition instead.

Version / environment

Symptom

Follower (arcadesplit-2) log — install succeeds, then the assertion fires in the very next millisecond:

[ArcadeStateMachine] Snapshot installation requested from leader (firstLogIndex=(t:1, i:99974)). Starting full resync...
[ArcadeStateMachine] Installing snapshot for database 'TestBigDB' from leader arcadesplit-0:2480...
[ArcadeStateMachine] Installing snapshot for database 'graph' from leader arcadesplit-0:2480...
[ArcadeStateMachine] Full resync from leader completed
SEVER [StateMachineUpdater] arcadesplit-2_2434@group-…-StateMachineUpdater caught a Throwable.
java.lang.IllegalStateException
    at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:35)
    at org.apache.ratis.server.impl.StateMachineUpdater.reload(StateMachineUpdater.java:230)
    at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:191)
    at java.base/java.lang.Thread.run(Thread.java:1583)

Then, repeating forever, the division is closed and rejects replication:

SEVER [RaftServer$Division] arcadesplit-2_2434@group-…: Failed appendEntries* arcadesplit-0_2434->arcadesplit-2_2434 …,leaderCommit=99974,…,HEARTBEAT
org.apache.ratis.protocol.exceptions.ServerNotReadyException: arcadesplit-2_2434@group-… is not in [STARTING, RUNNING]: current state is CLOSING   (then CLOSED)

Cluster status confirms the zombie state — node-2 holds the data but its lag never moves:

arcadesplit-0 (leader): replicaLags { arcadesplit-1_2434: 0, arcadesplit-2_2434: 21125 }
all three nodes report the same Transaction edge count (no writes since the break)

Mechanism / root-cause hypothesis

StateMachineUpdater.reload() (Ratis) runs after an install-snapshot to reinitialize the state machine from the freshly installed snapshot, and asserts an invariant about the reloaded state — typically that the reinitialized state machine exposes a valid latest snapshot (non-null, index > 0 / consistent with lastAppliedIndex), or that the updater is in the expected RELOAD state. The IllegalStateException (a bare Preconditions.assertTrue failure at StateMachineUpdater.java:230) means that invariant is false right after ArcadeDB reports "Full resync from leader completed".

So the ArcadeDB side (ArcadeStateMachine install/reinitialize path) appears to finish the resync without leaving Ratis in the state reload() requires — e.g. not updating/exposing the installed snapshot's index, or completing the install on a path that doesn't drive the updater through the expected RELOADRUNNING transition. (I couldn't capture the exact failing precondition from the logs — only the two Installing snapshot … INFO lines precede Full resync from leader completed — so the specific invariant is a hypothesis; a Ratis-side debug log or the assertion message would pin it.)

Impact

follower needs full snapshot -> install completes ("Full resync from leader completed")
  -> StateMachineUpdater.reload() Preconditions.assertTrue -> IllegalStateException
    -> StateMachineUpdater thread dies -> division CLOSING/CLOSED
      -> follower rejects all AppendEntries (ServerNotReadyException: CLOSED)
        -> commit index frozen, lag never decreases, follower never rejoins
          -> cluster permanently degraded to 2/3 (no fault tolerance); no self-heal

Because raftPersistStorage defaults to false, a restart drops the follower's Raft state and forces another full snapshot → the assertion fires again → the follower cannot be recovered on this build short of rebuilding the whole cluster from empty (so no node ever resyncs).

Reproduction

  1. 3-node HA cluster, -Xmx6g, replicationFactor=3.
  2. Build a multi-million-edge graph under load.
  3. Force a follower into a full snapshot resync — stop it, remove its data volume, start it (so it must full-install from the leader). (Also reproduces by leaving a follower far enough behind that the leader ships a snapshot.)
  4. The install completes, then StateMachineUpdater.reload() throws IllegalStateException; the follower's division closes and it never rejoins. 100% reproducible across attempts.

Suggested investigation

  • In ArcadeStateMachine, ensure the install-snapshot / reinitialize completion path leaves the state machine reporting the installed snapshot (correct getLatestSnapshot() index ≥ the install index) and drives the Ratis updater through the expected RELOADRUNNING transition before any further apply/heartbeat is processed.
  • Add the failing assertion's message (or a Ratis debug log around StateMachineUpdater.reload) so the exact invariant is visible.

Related

Happy to provide full follower/leader logs or a heap-independent repro.

Metadata

Metadata

Assignees

Type

Fields

No fields configured for Bug.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions