Summary
On a 3-node Raft HA cluster, when a follower is far behind and must catch up (e.g. after it was wiped / rejoins and the leader streams a backlog), the follower runs out of heap while deserializing an inbound AppendEntries gRPC message. The Raft batch of log entries shipped by the leader is too large to parse within the follower's heap, so the follower dies with OutOfMemoryError. It then stops resolving on the network, the cluster is left degraded, and on restart it OOMs again on the next catch-up attempt — it does not self-heal.
This is the next failure in the resync path after #4749: with the #4749 fix in, the snapshot install no longer crashes on ServerDatabase.close(), so resync now progresses — and immediately hits this unbounded-batch OOM.
Version / environment
Symptom
Follower (arcadesplit-2) log, while catching up:
SEVER [StateMachineUpdater] arcadesplit-2_2434@group-…-StateMachineUpdater caught a Throwable.
java.lang.OutOfMemoryError: Java heap space
Exception in thread "grpc-default-executor-1" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOfRangeByte(Arrays.java:3863)
at java.base/java.util.Arrays.copyOfRange(Arrays.java:3854)
at org.apache.ratis.thirdparty.com.google.protobuf.ByteString$ArraysByteArrayCopier.copyFrom(ByteString.java:105)
at org.apache.ratis.thirdparty.com.google.protobuf.ByteString.copyFrom(ByteString.java:399)
at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$ArrayDecoder.readBytes(CodedInputStream.java:887)
at org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto$Builder.mergeFrom(RaftProtos.java:8097)
at org.apache.ratis.proto.RaftProtos$LogEntryProto$Builder.mergeFrom(RaftProtos.java:9719)
at org.apache.ratis.proto.RaftProtos$AppendEntriesRequestProto$Builder.mergeFrom(RaftProtos.java:18528)
at org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:63)
at org.apache.ratis.thirdparty.io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller.parse(ProtoLiteUtils.java:236)
at org.apache.ratis.thirdparty.io.grpc.MethodDescriptor.parseRequest(MethodDescriptor.java:307)
The OOM is in the inbound gRPC marshaller parsing an AppendEntriesRequestProto → LogEntryProto → StateMachineLogEntryProto (the application transaction payload), where ByteString.copyFrom makes a full-array copy.
Impact
follower catch-up -> leader ships large AppendEntries batch
-> follower OOM in gRPC parse of AppendEntriesRequestProto
-> StateMachineUpdater "caught a Throwable" (OOM) -> follower process dies
-> its network alias stops resolving; leader logs
"UNAVAILABLE: Unable to resolve host arcadesplit-2 / UnknownHostException"
-> cluster degraded; on restart the follower OOMs again on the next catch-up (loop)
Two concerns:
- The replication/catch-up path ships a batch large enough to OOM the receiver — it is not bounded to the follower's heap.
OutOfMemoryError is caught as a generic Throwable by StateMachineUpdater; an OOM is not safely recoverable and shouldn't be swallowed — better to never produce a batch that large.
Reproduction
- 3-node HA cluster,
-Xmx6g per node, replicationFactor=3.
- Build a few-million-edge graph under load.
- Force a far-behind follower to resync — e.g. stop a follower, remove its data volume, start it so it must full-catch-up from the leader.
- The follower OOMs while parsing inbound
AppendEntries and dies; raising the heap (e.g. -Xmx16g) lets it get further / through, which confirms the batch sizing is the issue rather than a leak.
Suggested fix
- Bound the in-flight catch-up/snapshot transfer to a fraction of the follower heap: cap the per-
AppendEntries entry count and total bytes, and the gRPC max inbound message size, so a single message can always be parsed within heap.
- Stream/page large entries and snapshot chunks rather than buffering+copying whole
ByteStrings.
- Reconsider
snapshotMaxEntrySize defaulting to 10 GB — a single entry larger than the heap can never be applied.
- Don't rely on
StateMachineUpdater catching OutOfMemoryError; prevent the oversized allocation instead.
Related
Happy to provide a heap dump / full follower log if useful.
Summary
On a 3-node Raft HA cluster, when a follower is far behind and must catch up (e.g. after it was wiped / rejoins and the leader streams a backlog), the follower runs out of heap while deserializing an inbound
AppendEntriesgRPC message. The Raft batch of log entries shipped by the leader is too large to parse within the follower's heap, so the follower dies withOutOfMemoryError. It then stops resolving on the network, the cluster is left degraded, and on restart it OOMs again on the next catch-up attempt — it does not self-heal.This is the next failure in the resync path after #4749: with the #4749 fix in, the snapshot install no longer crashes on
ServerDatabase.close(), so resync now progresses — and immediately hits this unbounded-batch OOM.Version / environment
arcadesplit-0/1/2), Raft replication, Docker-Xmx6gheap per nodearcadedb.ha.replicationChunkMaxSize=16MB,arcadedb.ha.snapshotMaxEntrySize=10GB,arcadedb.ha.snapshotThreshold=100000Symptom
Follower (
arcadesplit-2) log, while catching up:The OOM is in the inbound gRPC marshaller parsing an
AppendEntriesRequestProto→LogEntryProto→StateMachineLogEntryProto(the application transaction payload), whereByteString.copyFrommakes a full-array copy.Impact
Two concerns:
OutOfMemoryErroris caught as a genericThrowablebyStateMachineUpdater; an OOM is not safely recoverable and shouldn't be swallowed — better to never produce a batch that large.Reproduction
-Xmx6gper node,replicationFactor=3.AppendEntriesand dies; raising the heap (e.g.-Xmx16g) lets it get further / through, which confirms the batch sizing is the issue rather than a leak.Suggested fix
AppendEntriesentry count and total bytes, and the gRPC max inbound message size, so a single message can always be parsed within heap.ByteStrings.snapshotMaxEntrySizedefaulting to 10 GB — a single entry larger than the heap can never be applied.StateMachineUpdatercatchingOutOfMemoryError; prevent the oversized allocation instead.Related
notifyInstallSnapshotFromLeaderviaServerDatabase.close()(fixed). This issue is the next failure on the same resync path once HA: Raft snapshot install crashes in notifyInstallSnapshotFromLeader — ServerDatabase.close() throws UnsupportedOperationException (26.6.1), follower never rejoins, cluster loses write quorum #4749 is in.Happy to provide a heap dump / full follower log if useful.