Skip to content

HA: follower OOMs deserializing inbound AppendEntries during snapshot/catch-up resync (replication batch not bounded to heap) #4752

Description

@robfrank

Summary

On a 3-node Raft HA cluster, when a follower is far behind and must catch up (e.g. after it was wiped / rejoins and the leader streams a backlog), the follower runs out of heap while deserializing an inbound AppendEntries gRPC message. The Raft batch of log entries shipped by the leader is too large to parse within the follower's heap, so the follower dies with OutOfMemoryError. It then stops resolving on the network, the cluster is left degraded, and on restart it OOMs again on the next catch-up attempt — it does not self-heal.

This is the next failure in the resync path after #4749: with the #4749 fix in, the snapshot install no longer crashes on ServerDatabase.close(), so resync now progresses — and immediately hits this unbounded-batch OOM.

Version / environment

Symptom

Follower (arcadesplit-2) log, while catching up:

SEVER [StateMachineUpdater] arcadesplit-2_2434@group-…-StateMachineUpdater caught a Throwable.
java.lang.OutOfMemoryError: Java heap space
Exception in thread "grpc-default-executor-1" java.lang.OutOfMemoryError: Java heap space
    at java.base/java.util.Arrays.copyOfRangeByte(Arrays.java:3863)
    at java.base/java.util.Arrays.copyOfRange(Arrays.java:3854)
    at org.apache.ratis.thirdparty.com.google.protobuf.ByteString$ArraysByteArrayCopier.copyFrom(ByteString.java:105)
    at org.apache.ratis.thirdparty.com.google.protobuf.ByteString.copyFrom(ByteString.java:399)
    at org.apache.ratis.thirdparty.com.google.protobuf.CodedInputStream$ArrayDecoder.readBytes(CodedInputStream.java:887)
    at org.apache.ratis.proto.RaftProtos$StateMachineLogEntryProto$Builder.mergeFrom(RaftProtos.java:8097)
    at org.apache.ratis.proto.RaftProtos$LogEntryProto$Builder.mergeFrom(RaftProtos.java:9719)
    at org.apache.ratis.proto.RaftProtos$AppendEntriesRequestProto$Builder.mergeFrom(RaftProtos.java:18528)
    at org.apache.ratis.thirdparty.com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:63)
    at org.apache.ratis.thirdparty.io.grpc.protobuf.lite.ProtoLiteUtils$MessageMarshaller.parse(ProtoLiteUtils.java:236)
    at org.apache.ratis.thirdparty.io.grpc.MethodDescriptor.parseRequest(MethodDescriptor.java:307)

The OOM is in the inbound gRPC marshaller parsing an AppendEntriesRequestProtoLogEntryProtoStateMachineLogEntryProto (the application transaction payload), where ByteString.copyFrom makes a full-array copy.

Impact

follower catch-up -> leader ships large AppendEntries batch
  -> follower OOM in gRPC parse of AppendEntriesRequestProto
    -> StateMachineUpdater "caught a Throwable" (OOM) -> follower process dies
      -> its network alias stops resolving; leader logs
         "UNAVAILABLE: Unable to resolve host arcadesplit-2 / UnknownHostException"
        -> cluster degraded; on restart the follower OOMs again on the next catch-up (loop)

Two concerns:

  1. The replication/catch-up path ships a batch large enough to OOM the receiver — it is not bounded to the follower's heap.
  2. OutOfMemoryError is caught as a generic Throwable by StateMachineUpdater; an OOM is not safely recoverable and shouldn't be swallowed — better to never produce a batch that large.

Reproduction

  1. 3-node HA cluster, -Xmx6g per node, replicationFactor=3.
  2. Build a few-million-edge graph under load.
  3. Force a far-behind follower to resync — e.g. stop a follower, remove its data volume, start it so it must full-catch-up from the leader.
  4. The follower OOMs while parsing inbound AppendEntries and dies; raising the heap (e.g. -Xmx16g) lets it get further / through, which confirms the batch sizing is the issue rather than a leak.

Suggested fix

  • Bound the in-flight catch-up/snapshot transfer to a fraction of the follower heap: cap the per-AppendEntries entry count and total bytes, and the gRPC max inbound message size, so a single message can always be parsed within heap.
  • Stream/page large entries and snapshot chunks rather than buffering+copying whole ByteStrings.
  • Reconsider snapshotMaxEntrySize defaulting to 10 GB — a single entry larger than the heap can never be applied.
  • Don't rely on StateMachineUpdater catching OutOfMemoryError; prevent the oversized allocation instead.

Related

Happy to provide a heap dump / full follower log if useful.

Metadata

Metadata

Assignees

Type

Fields

No fields configured for Bug.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions