HA: follower OOMs deserializing inbound AppendEntries during snapshot/catch-up resync (replication batch not bounded to heap)

## Summary

On a 3-node Raft HA cluster, when a follower is far behind and must catch up (e.g. after it was wiped / rejoins and the leader streams a backlog), the follower **runs out of heap while deserializing an inbound `AppendEntries` gRPC message**. The Raft batch of log entries shipped by the leader is too large to parse within the follower's heap, so the follower dies with `OutOfMemoryError`. It then stops resolving on the network, the cluster is left degraded, and on restart it OOMs again on the next catch-up attempt — it does not self-heal.

This is the **next failure in the resync path after #4749**: with the #4749 fix in, the snapshot install no longer crashes on `ServerDatabase.close()`, so resync now *progresses* — and immediately hits this unbounded-batch OOM.

## Version / environment

- ArcadeDB **26.7.1-SNAPSHOT** (a build that includes the #4749 fix, i.e. after 2026-06-26)
- 3-node HA cluster (`arcadesplit-0/1/2`), Raft replication, Docker
- **`-Xmx6g`** heap per node
- DB on the order of a few million edges; relevant HA config in effect:
  `arcadedb.ha.replicationChunkMaxSize=16MB`, `arcadedb.ha.snapshotMaxEntrySize=10GB`,
  `arcadedb.ha.snapshotThreshold=100000`

## Symptom

Follower (`arcadesplit-2`) log, while catching up:

```
SEVER [StateMachineUpdater] arcadesplit-2_2434@group-…-StateMachineUpdater caught a Throwable.
java.lang.OutOfMemoryError: Java heap space
Exception in thread "grpc-default-executor-1" java.lang.OutOfMemoryError: Java heap space
    at java.base/java.util.Arrays.copyOfRangeByte(Arrays.java:3863)
    at java.base/java.util.Arrays.copyOfRange(Arrays.java:3854)
    at org.apache.ratis.thirdparty.com.google.protobuf.ByteString$ArraysByteArrayCopier.copyFrom(ByteString.java:105)
    at org.apache.ratis.thirdparty.com.google.protobuf.ByteString.copyFrom(ByteString.java:399)
    at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$ArrayDecoder.readBytes(CodedInputStream.java:887)
    at org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto$Builder.mergeFrom(RaftProtos.java:8097)
    at org.apache.ratis.proto.RaftProtos$LogEntryProto$Builder.mergeFrom(RaftProtos.java:9719)
    at org.apache.ratis.proto.RaftProtos$AppendEntriesRequestProto$Builder.mergeFrom(RaftProtos.java:18528)
    at org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:63)
    at org.apache.ratis.thirdparty.io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller.parse(ProtoLiteUtils.java:236)
    at org.apache.ratis.thirdparty.io.grpc.MethodDescriptor.parseRequest(MethodDescriptor.java:307)
```

The OOM is in the **inbound** gRPC marshaller parsing an `AppendEntriesRequestProto` → `LogEntryProto` → `StateMachineLogEntryProto` (the application transaction payload), where `ByteString.copyFrom` makes a full-array copy.

## Impact

```
follower catch-up -> leader ships large AppendEntries batch
  -> follower OOM in gRPC parse of AppendEntriesRequestProto
    -> StateMachineUpdater "caught a Throwable" (OOM) -> follower process dies
      -> its network alias stops resolving; leader logs
         "UNAVAILABLE: Unable to resolve host arcadesplit-2 / UnknownHostException"
        -> cluster degraded; on restart the follower OOMs again on the next catch-up (loop)
```

Two concerns:
1. The replication/catch-up path ships a batch large enough to OOM the receiver — it is **not bounded to the follower's heap**.
2. `OutOfMemoryError` is caught as a generic `Throwable` by `StateMachineUpdater`; an OOM is not safely recoverable and shouldn't be swallowed — better to never produce a batch that large.

## Reproduction

1. 3-node HA cluster, `-Xmx6g` per node, `replicationFactor=3`.
2. Build a few-million-edge graph under load.
3. Force a far-behind follower to resync — e.g. stop a follower, remove its data volume, start it so it must full-catch-up from the leader.
4. The follower OOMs while parsing inbound `AppendEntries` and dies; raising the heap (e.g. `-Xmx16g`) lets it get further / through, which confirms the batch sizing is the issue rather than a leak.

## Suggested fix

- Bound the in-flight catch-up/snapshot transfer to a fraction of the follower heap: cap the per-`AppendEntries` entry count and total bytes, and the gRPC max inbound message size, so a single message can always be parsed within heap.
- Stream/page large entries and snapshot chunks rather than buffering+copying whole `ByteString`s.
- Reconsider `snapshotMaxEntrySize` defaulting to 10 GB — a single entry larger than the heap can never be applied.
- Don't rely on `StateMachineUpdater` catching `OutOfMemoryError`; prevent the oversized allocation instead.

## Related

- #4749 — snapshot install crashed in `notifyInstallSnapshotFromLeader` via `ServerDatabase.close()` (fixed). This issue is the **next** failure on the same resync path once #4749 is in.
- #4729 — HA snapshot serving timed out (same resync area, leader side).

Happy to provide a heap dump / full follower log if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

HA: follower OOMs deserializing inbound AppendEntries during snapshot/catch-up resync (replication batch not bounded to heap) #4752

Summary

Version / environment

Symptom

Impact

Reproduction

Suggested fix

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

HA: follower OOMs deserializing inbound AppendEntries during snapshot/catch-up resync (replication batch not bounded to heap) #4752

Description

Summary

Version / environment

Symptom

Impact

Reproduction

Suggested fix

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions