Skip to content

HA: Raft snapshot install crashes in notifyInstallSnapshotFromLeader — ServerDatabase.close() throws UnsupportedOperationException (26.6.1), follower never rejoins, cluster loses write quorum #4749

Description

@ivanfrias

Summary

On a 3-node HA cluster (Kubernetes StatefulSet), when a follower is told by the leader to perform a full snapshot resync, the snapshot installation crashes on every attempt because ArcadeStateMachine.notifyInstallSnapshotFromLeader calls ServerDatabase.close() on a shared, server-managed database — and ServerDatabase.close() is hardcoded to throw UnsupportedOperationException.

The follower can therefore never complete the snapshot install, never rejoins the Raft group, and the cluster permanently loses write quorum. All writes then fail with QuorumNotReachedException, and the node spins in a tight retry loop (we observed ~2.77M log lines in a 7-minute window).

Version

  • ArcadeDB 26.6.1
  • 3-node HA cluster (*-0, *-1, *-2) on Kubernetes, Raft replication

What happens

Leader is node -1. Followers -0 and -2 are asked to do a full resync (firstLogIndex=(t:11, i:98885)) and fail repeatedly:

Installing snapshot for database '.raft' from leader <node-1>:2480...
Snapshot installation requested from leader (firstLogIndex=(t:11, i:98885)). Starting full resync...
Error during snapshot installation from leader
<node>_2434@group-XXXX: Failed to notify StateMachine to InstallSnapshot.
    Exception: java.lang.RuntimeException: Error during Raft snapshot installation

Underlying cause (logged at SEVERE):

java.lang.UnsupportedOperationException: Embedded database taken from the server are shared and therefore cannot be closed
	at com.arcadedb.server.ServerDatabase.close(ServerDatabase.java:103)
	at com.arcadedb.server.ha.raft.ArcadeStateMachine.lambda$notifyInstallSnapshotFromLeader$4(ArcadeStateMachine.java:506)

Root cause

ServerDatabase.close() intentionally rejects closing a shared server database:

// ServerDatabase.java:103
throw new UnsupportedOperationException(
    "Embedded database taken from the server are shared and therefore cannot be closed");

But the snapshot-install path (ArcadeStateMachine.notifyInstallSnapshotFromLeader, ~line 506) tries to close() that very database before swapping in the snapshot. So the install always aborts. This is deterministic, not transient — once a follower needs a full snapshot resync, it can never succeed.

Impact

notifyInstallSnapshotFromLeader -> ServerDatabase.close() -> UnsupportedOperationException
  -> "Error during Raft snapshot installation" (every attempt, crash-loop)
    -> follower never rejoins -> cluster loses write quorum
      -> commits fail: com.arcadedb.network.binary.QuorumNotReachedException: Group commit entry failed: TimeoutException
        -> all client writes return HTTP 500

The cluster does not self-heal; it requires manual intervention (recycling/wiping the stuck followers).

Expected behavior

During snapshot installation the state machine should release/replace the shared server database through a supported path (e.g. the server's database-management API) rather than calling ServerDatabase.close(), so a follower can complete a full resync and rejoin the quorum.

Possibly related

Happy to provide more logs if useful.

Metadata

Metadata

Assignees

Type

Fields

No fields configured for Bug.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions