HA: follower never rejoins — StateMachineUpdater.reload() throws IllegalStateException immediately after a successful snapshot install

## Summary

On a 3-node Raft HA cluster, when a follower performs a full snapshot resync, the snapshot install now **completes** ("Full resync from leader completed"), but **immediately afterwards** Ratis's `StateMachineUpdater.reload()` fails a precondition and throws `java.lang.IllegalStateException`. This kills the `StateMachineUpdater` thread, the division transitions to `CLOSING`/`CLOSED`, and the follower then rejects all `AppendEntries`/heartbeats with `ServerNotReadyException: ... current state is CLOSED`.

The follower ends up as a **zombie replica**: the snapshot data is present and queryable, but its Raft division is dead, its commit index is frozen (the leader reports a fixed lag, e.g. `21125`), and it **never rejoins quorum**. It does **not** self-heal, and because `arcadedb.ha.raftPersistStorage` defaults to false, restarting the follower forces another full snapshot install, which hits this same assertion again — so the follower is **unrecoverable** on this build.

This is the **next failure on the resync path after #4749**: with #4749 fixed the install no longer crashes on `ServerDatabase.close()`, so it now runs to completion — and trips this reload precondition instead.

## Version / environment

- ArcadeDB **26.7.1-SNAPSHOT** (a build including the #4749 fix; install completes)
- 3-node HA cluster (`arcadesplit-0/1/2`), Raft, Docker, `-Xmx6g` per node
- Databases `TestBigDB` and the internal `graph`

## Symptom

Follower (`arcadesplit-2`) log — install succeeds, then the assertion fires in the very next millisecond:

```
[ArcadeStateMachine] Snapshot installation requested from leader (firstLogIndex=(t:1, i:99974)). Starting full resync...
[ArcadeStateMachine] Installing snapshot for database 'TestBigDB' from leader arcadesplit-0:2480...
[ArcadeStateMachine] Installing snapshot for database 'graph' from leader arcadesplit-0:2480...
[ArcadeStateMachine] Full resync from leader completed
SEVER [StateMachineUpdater] arcadesplit-2_2434@group-…-StateMachineUpdater caught a Throwable.
java.lang.IllegalStateException
    at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:35)
    at org.apache.ratis.server.impl.StateMachineUpdater.reload(StateMachineUpdater.java:230)
    at org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:191)
    at java.base/java.lang.Thread.run(Thread.java:1583)
```

Then, repeating forever, the division is closed and rejects replication:

```
SEVER [RaftServer$Division] arcadesplit-2_2434@group-…: Failed appendEntries* arcadesplit-0_2434->arcadesplit-2_2434 …,leaderCommit=99974,…,HEARTBEAT
org.apache.ratis.protocol.exceptions.ServerNotReadyException: arcadesplit-2_2434@group-… is not in [STARTING, RUNNING]: current state is CLOSING   (then CLOSED)
```

Cluster status confirms the zombie state — node-2 holds the data but its lag never moves:

```
arcadesplit-0 (leader): replicaLags { arcadesplit-1_2434: 0, arcadesplit-2_2434: 21125 }
all three nodes report the same Transaction edge count (no writes since the break)
```

## Mechanism / root-cause hypothesis

`StateMachineUpdater.reload()` (Ratis) runs after an install-snapshot to reinitialize the state machine from the freshly installed snapshot, and asserts an invariant about the reloaded state — typically that the reinitialized state machine exposes a valid latest snapshot (non-null, index > 0 / consistent with `lastAppliedIndex`), or that the updater is in the expected `RELOAD` state. The `IllegalStateException` (a bare `Preconditions.assertTrue` failure at `StateMachineUpdater.java:230`) means that invariant is false right after ArcadeDB reports "Full resync from leader completed".

So the ArcadeDB side (`ArcadeStateMachine` install/reinitialize path) appears to finish the resync **without leaving Ratis in the state `reload()` requires** — e.g. not updating/exposing the installed snapshot's index, or completing the install on a path that doesn't drive the updater through the expected `RELOAD` → `RUNNING` transition. (I couldn't capture the exact failing precondition from the logs — only the two `Installing snapshot …` INFO lines precede `Full resync from leader completed` — so the specific invariant is a hypothesis; a Ratis-side debug log or the assertion message would pin it.)

## Impact

```
follower needs full snapshot -> install completes ("Full resync from leader completed")
  -> StateMachineUpdater.reload() Preconditions.assertTrue -> IllegalStateException
    -> StateMachineUpdater thread dies -> division CLOSING/CLOSED
      -> follower rejects all AppendEntries (ServerNotReadyException: CLOSED)
        -> commit index frozen, lag never decreases, follower never rejoins
          -> cluster permanently degraded to 2/3 (no fault tolerance); no self-heal
```

Because `raftPersistStorage` defaults to false, a restart drops the follower's Raft state and forces another full snapshot → the assertion fires again → **the follower cannot be recovered on this build** short of rebuilding the whole cluster from empty (so no node ever resyncs).

## Reproduction

1. 3-node HA cluster, `-Xmx6g`, `replicationFactor=3`.
2. Build a multi-million-edge graph under load.
3. Force a follower into a full snapshot resync — stop it, remove its data volume, start it (so it must full-install from the leader). (Also reproduces by leaving a follower far enough behind that the leader ships a snapshot.)
4. The install completes, then `StateMachineUpdater.reload()` throws `IllegalStateException`; the follower's division closes and it never rejoins. 100% reproducible across attempts.

## Suggested investigation

- In `ArcadeStateMachine`, ensure the install-snapshot / `reinitialize` completion path leaves the state machine reporting the installed snapshot (correct `getLatestSnapshot()` index ≥ the install index) and drives the Ratis updater through the expected `RELOAD` → `RUNNING` transition before any further apply/heartbeat is processed.
- Add the failing assertion's message (or a Ratis debug log around `StateMachineUpdater.reload`) so the exact invariant is visible.

## Related

- #4749 — install crashed in `notifyInstallSnapshotFromLeader` via `ServerDatabase.close()` (fixed). This is the next failure once the install runs to completion.
- #4752 / PR #4753 — follower OOM deserializing catch-up `AppendEntries` (separate; this report is a deterministic post-install assertion, not memory).
- #4729 — HA snapshot serving timeout (same resync area, leader side).

Happy to provide full follower/leader logs or a heap-independent repro.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

HA: follower never rejoins — StateMachineUpdater.reload() throws IllegalStateException immediately after a successful snapshot install #4754

Summary

Version / environment

Symptom

Mechanism / root-cause hypothesis

Impact

Reproduction

Suggested investigation

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

HA: follower never rejoins — StateMachineUpdater.reload() throws IllegalStateException immediately after a successful snapshot install #4754

Description

Summary

Version / environment

Symptom

Mechanism / root-cause hypothesis

Impact

Reproduction

Suggested investigation

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions