Summary
On a 3-node HA cluster (Kubernetes StatefulSet), when a follower is told by the leader to perform a full snapshot resync, the snapshot installation crashes on every attempt because ArcadeStateMachine.notifyInstallSnapshotFromLeader calls ServerDatabase.close() on a shared, server-managed database — and ServerDatabase.close() is hardcoded to throw UnsupportedOperationException.
The follower can therefore never complete the snapshot install, never rejoins the Raft group, and the cluster permanently loses write quorum. All writes then fail with QuorumNotReachedException, and the node spins in a tight retry loop (we observed ~2.77M log lines in a 7-minute window).
Version
- ArcadeDB 26.6.1
- 3-node HA cluster (
*-0, *-1, *-2) on Kubernetes, Raft replication
What happens
Leader is node -1. Followers -0 and -2 are asked to do a full resync (firstLogIndex=(t:11, i:98885)) and fail repeatedly:
Installing snapshot for database '.raft' from leader <node-1>:2480...
Snapshot installation requested from leader (firstLogIndex=(t:11, i:98885)). Starting full resync...
Error during snapshot installation from leader
<node>_2434@group-XXXX: Failed to notify StateMachine to InstallSnapshot.
Exception: java.lang.RuntimeException: Error during Raft snapshot installation
Underlying cause (logged at SEVERE):
java.lang.UnsupportedOperationException: Embedded database taken from the server are shared and therefore cannot be closed
at com.arcadedb.server.ServerDatabase.close(ServerDatabase.java:103)
at com.arcadedb.server.ha.raft.ArcadeStateMachine.lambda$notifyInstallSnapshotFromLeader$4(ArcadeStateMachine.java:506)
Root cause
ServerDatabase.close() intentionally rejects closing a shared server database:
// ServerDatabase.java:103
throw new UnsupportedOperationException(
"Embedded database taken from the server are shared and therefore cannot be closed");
But the snapshot-install path (ArcadeStateMachine.notifyInstallSnapshotFromLeader, ~line 506) tries to close() that very database before swapping in the snapshot. So the install always aborts. This is deterministic, not transient — once a follower needs a full snapshot resync, it can never succeed.
Impact
notifyInstallSnapshotFromLeader -> ServerDatabase.close() -> UnsupportedOperationException
-> "Error during Raft snapshot installation" (every attempt, crash-loop)
-> follower never rejoins -> cluster loses write quorum
-> commits fail: com.arcadedb.network.binary.QuorumNotReachedException: Group commit entry failed: TimeoutException
-> all client writes return HTTP 500
The cluster does not self-heal; it requires manual intervention (recycling/wiping the stuck followers).
Expected behavior
During snapshot installation the state machine should release/replace the shared server database through a supported path (e.g. the server's database-management API) rather than calling ServerDatabase.close(), so a follower can complete a full resync and rejoin the quorum.
Possibly related
Happy to provide more logs if useful.
Summary
On a 3-node HA cluster (Kubernetes StatefulSet), when a follower is told by the leader to perform a full snapshot resync, the snapshot installation crashes on every attempt because
ArcadeStateMachine.notifyInstallSnapshotFromLeadercallsServerDatabase.close()on a shared, server-managed database — andServerDatabase.close()is hardcoded to throwUnsupportedOperationException.The follower can therefore never complete the snapshot install, never rejoins the Raft group, and the cluster permanently loses write quorum. All writes then fail with
QuorumNotReachedException, and the node spins in a tight retry loop (we observed ~2.77M log lines in a 7-minute window).Version
*-0,*-1,*-2) on Kubernetes, Raft replicationWhat happens
Leader is node
-1. Followers-0and-2are asked to do a full resync (firstLogIndex=(t:11, i:98885)) and fail repeatedly:Underlying cause (logged at
SEVERE):Root cause
ServerDatabase.close()intentionally rejects closing a shared server database:But the snapshot-install path (
ArcadeStateMachine.notifyInstallSnapshotFromLeader, ~line 506) tries toclose()that very database before swapping in the snapshot. So the install always aborts. This is deterministic, not transient — once a follower needs a full snapshot resync, it can never succeed.Impact
The cluster does not self-heal; it requires manual intervention (recycling/wiping the stuck followers).
Expected behavior
During snapshot installation the state machine should release/replace the shared server database through a supported path (e.g. the server's database-management API) rather than calling
ServerDatabase.close(), so a follower can complete a full resync and rejoin the quorum.Possibly related
DatabaseReconcilerout ofArcadeStateMachine) — touches the same classHappy to provide more logs if useful.