Skip to content

fix(#4752): make AppendEntries element limit configurable, default 64#4753

Merged
robfrank merged 4 commits into
mainfrom
fix/4752-bound-appendentries-catch-up-batch
Jun 27, 2026
Merged

fix(#4752): make AppendEntries element limit configurable, default 64#4753
robfrank merged 4 commits into
mainfrom
fix/4752-bound-appendentries-catch-up-batch

Conversation

@robfrank

@robfrank robfrank commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Closes #4752

Summary

  • Adds HA_APPEND_ELEMENT_LIMIT (arcadedb.ha.appendElementLimit) to GlobalConfiguration with a default of 64, replacing the previously hardcoded 256 in RaftPropertiesBuilder.
  • Wires the new config into RaftServerConfigKeys.Log.Appender.setBufferElementLimit so operators can tune the per-batch entry count.
  • Validates the value is >= 1 at properties-build time (a value of 0 or negative would silently defeat the per-batch element bound).
  • Adds RaftPropertiesBuilderTest (8 tests) covering the default/custom element limit, byte limit, gRPC message size, write-buffer validation, and the zero/negative element-limit rejection.

The lower default (64 vs 256) is a secondary cap that bounds the per-batch entry count during catch-up resync. The byte limit (arcadedb.ha.appendBufferSize, 4MB default) remains the dominant per-batch heap bound; the element count governs only when entries are small enough that many fit under the byte limit. Together with the existing arcadedb.ha.grpcMessageSizeMax knob, all three catch-up bounds the issue requested are now operator-tunable.

Test plan

  • RaftPropertiesBuilderTest - 8 unit tests: default element limit is 64, custom value applied, byte limit and gRPC message size read back correctly, write-buffer validation (happy path + exception), and zero/negative element-limit rejection.
  • RaftHAServerTest (45 tests) and RaftHAConfigurationIT (5 tests) pass without change.
  • GlobalConfigurationTest (10 tests) passes - new enum constant is valid.

Note

The root-cause dimension (entry count vs byte/message size) that drove the OOM in the incident is documented as a deferred item in the tracking doc and needs the incident heap dump to confirm; this PR makes all three bounds tunable regardless.

Add HA_APPEND_ELEMENT_LIMIT (arcadedb.ha.appendElementLimit) to
GlobalConfiguration with a default of 64, replacing the hardcoded 256
in RaftPropertiesBuilder. This bounds the per-batch in-memory footprint
on the follower during catch-up resync, where many batches may queue
before the state machine can apply them, reducing peak heap pressure and
mitigating the OutOfMemoryError observed on 6 GB nodes.

Adds RaftPropertiesBuilderTest to verify the config wiring.
@mergify

mergify Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Tick the box to add this pull request to the merge queue (same as @mergifyio queue).

  • Queue this pull request

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new configuration parameter HA_APPEND_ELEMENT_LIMIT to bound the maximum number of Raft log entries per AppendEntries batch, resolving follower OOM issues during catch-up resync. The review feedback correctly identifies that setting this limit to 0 does not disable the limit in Apache Ratis but instead freezes replication. It is recommended to update the configuration description, add validation to ensure the limit is strictly greater than 0, and include a corresponding unit test to verify this validation behavior.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +740 to +746
HA_APPEND_ELEMENT_LIMIT("arcadedb.ha.appendElementLimit", SCOPE.SERVER,
"""
Maximum number of Raft log entries per AppendEntries batch. Bounds the per-batch in-memory \
footprint on the follower during catch-up resync, where many batches may queue before the \
state machine can apply them. Lowering this value reduces peak heap pressure on followers \
catching up from a far-behind state. 0 means no element-count limit (only the byte limit applies).""",
Integer.class, 64),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Setting arcadedb.ha.appendElementLimit to 0 will not disable the element-count limit in Apache Ratis; instead, it will treat the buffer limit as zero, which will immediately mark the buffer as full and completely freeze replication. The description should be updated to reflect that the value must be greater than 0, and validation should be added to prevent invalid configurations.

Suggested change
HA_APPEND_ELEMENT_LIMIT("arcadedb.ha.appendElementLimit", SCOPE.SERVER,
"""
Maximum number of Raft log entries per AppendEntries batch. Bounds the per-batch in-memory \
footprint on the follower during catch-up resync, where many batches may queue before the \
state machine can apply them. Lowering this value reduces peak heap pressure on followers \
catching up from a far-behind state. 0 means no element-count limit (only the byte limit applies).""",
Integer.class, 64),
HA_APPEND_ELEMENT_LIMIT("arcadedb.ha.appendElementLimit", SCOPE.SERVER,
"""
Maximum number of Raft log entries per AppendEntries batch. Bounds the per-batch in-memory \
footprint on the follower during catch-up resync, where many batches may queue before the \
state machine can apply them. Lowering this value reduces peak heap pressure on followers \
catching up from a far-behind state. Must be greater than 0.""",
Integer.class, 64),

Comment on lines +103 to +104
final int appendElementLimit = configuration.getValueAsInteger(GlobalConfiguration.HA_APPEND_ELEMENT_LIMIT);
RaftServerConfigKeys.Log.Appender.setBufferElementLimit(properties, appendElementLimit);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Add validation to ensure that appendElementLimit is strictly greater than 0. If it is set to 0 or a negative value, it will cause replication to freeze or fail because Apache Ratis does not support 0 as a 'no limit' indicator for bufferElementLimit.

Suggested change
final int appendElementLimit = configuration.getValueAsInteger(GlobalConfiguration.HA_APPEND_ELEMENT_LIMIT);
RaftServerConfigKeys.Log.Appender.setBufferElementLimit(properties, appendElementLimit);
final int appendElementLimit = configuration.getValueAsInteger(GlobalConfiguration.HA_APPEND_ELEMENT_LIMIT);
if (appendElementLimit <= 0) {
throw new ConfigurationException(
"arcadedb.ha.appendElementLimit (" + appendElementLimit + ") must be greater than 0");
}
RaftServerConfigKeys.Log.Appender.setBufferElementLimit(properties, appendElementLimit);

Comment on lines +75 to +77
assertThat(RaftServerConfigKeys.Log.writeBufferSize(props).getSizeInt())
.isEqualTo(8 * 1024 * 1024);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add a unit test to verify that an exception is thrown when HA_APPEND_ELEMENT_LIMIT is configured with an invalid value (less than or equal to 0).

    assertThat(RaftServerConfigKeys.Log.writeBufferSize(props).getSizeInt())
        .isEqualTo(8 * 1024 * 1024);
  }

  @Test
  void invalidElementLimitThrowsException() {
    final ContextConfiguration config = new ContextConfiguration();
    config.setValue(GlobalConfiguration.HA_APPEND_ELEMENT_LIMIT, 0);
    org.assertj.core.api.Assertions.assertThatThrownBy(() -> RaftPropertiesBuilder.build(config))
        .isInstanceOf(com.arcadedb.exception.ConfigurationException.class)
        .hasMessageContaining("must be greater than 0");
  }

@codacy-production

codacy-production Bot commented Jun 27, 2026

Copy link
Copy Markdown

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 0 complexity

Metric Results
Complexity 0

View in Codacy

🟢 Coverage 100.00% diff coverage · -7.31% coverage variation

Metric Results
Coverage variation -7.31% coverage variation
Diff coverage 100.00% diff coverage

View coverage diff in Codacy

Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (1b1c074) 132608 98502 74.28%
Head commit (0c70fa7) 164416 (+31808) 110103 (+11601) 66.97% (-7.31%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#4753) 6 6 100.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review: PR #4753 — configurable AppendEntries element limit

Reviewed the diff against the repo conventions in CLAUDE.md. Overall this is a clean, well-scoped, well-documented change: it replaces a hardcoded magic number with a proper GlobalConfiguration knob, adds a regression test, and includes a design doc consistent with the existing docs/<issue>.md convention. Nice work. A few points below, mostly questions and minor suggestions.

Technical question on the root-cause mechanism (most important)

The design doc attributes the follower OOM to entries "each up to several MB of compressed WAL data" accumulating in memory, and frames lowering the element limit 256 -> 64 as the fix. But bufferByteLimit (default 4MB) already bounds each AppendEntries batch on the byte dimension. For multi-MB entries, the byte limit caps a batch to ~1-2 entries long before the element count (64 or 256) ever binds - so for that specific scenario, changing the element limit shouldn't materially change peak heap.

The element limit really only becomes the governing dimension when entries are small (so that 64-256 of them fit under 4MB). The comment hints at the real lever - "many batches may queue before the state machine can apply them" - i.e. accumulation across multiple in-flight batches, not within one batch. But total in-flight memory is roughly batches_in_flight x min(byteLimit, elements x avgEntrySize); when entries are multi-MB the byteLimit term dominates and the element count drops out.

Could you confirm from the heap dump / OOM analysis which dimension actually dominated? If the OOM was driven by large entries, the more effective knob may be HA_APPEND_BUFFER_SIZE (byte limit) or bounding the number of concurrent in-flight appends, rather than the element count. This doesn't make the change wrong - having the knob is still valuable - but it's worth making sure the documented default (64) targets the dimension that actually caused the incident.

No validation on the new value

HA_APPEND_BUFFER_SIZE / HA_WRITE_BUFFER_SIZE get validated (writeBuffer >= appendBuffer + 8), but appendElementLimit is passed straight through to setBufferElementLimit(properties, appendElementLimit) with no guard. A negative value (e.g. a typo -1) would be forwarded to Ratis with undefined behavior. Consider a small guard rejecting negatives (or clamping), consistent with how the sibling buffer configs validate their inputs.

"0 means no limit" is documented but untested

Both the config description and the design doc state 0 disables the element-count limit. That behavior depends on the Ratis version's interpretation of bufferElementLimit == 0, and the default has varied across Ratis releases. Since it's a documented contract, it'd be worth either:

  • adding a test asserting bufferElementLimit(props) == 0 round-trips when configured to 0, and ideally a comment pointing at the Ratis code that treats 0 as unlimited, or
  • softening the wording if you can't verify it against the pinned Ratis version.

Default change is a behavior change for healthy clusters

Lowering the default 256 -> 64 reduces entries-per-round-trip ~4x for all HA clusters, not just ones in the OOM/catch-up path. On a healthy cluster replicating many small transactions, this can increase replication round-trips and catch-up latency. The trade-off is reasonable and documented, but please call it out in the release notes so operators tuning replication throughput aren't surprised. Operators wanting the old behavior can set arcadedb.ha.appendElementLimit=256.

Things done well

Minor

  • The test writeBufferSizeMustExceedAppendBufferPlusFraming exercises the happy path; consider also asserting the ConfigurationException is thrown when writeBuffer < appendBuffer + 8, since that validation branch is currently untested.

Nothing here is blocking from a code-quality standpoint; the main ask is confirming the element-limit knob actually addresses the dimension that caused the OOM in #4752.

Restrict arcadedb.ha.appendElementLimit to positive values (>= 1):
setting it to 0 or a negative number would disable the per-batch bound
that protects against follower OOM during catch-up, defeating the fix.
Update the configuration description to document the constraint.
Add two regression tests for the new validation.
@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review: fix(#4752) make AppendEntries element limit configurable, default 64

Reviewed the diff against the repo conventions in CLAUDE.md. Overall this is a clean, well-scoped change: it removes a magic number, exposes it as a tunable SCOPE.SERVER config, validates the input, and ships focused unit tests plus a design note. Nice work. A few observations below, mostly minor.

Strengths

  • Style/conventions: matches surrounding code well - final on locals/params, no fully-qualified names (imports added correctly, ConfigurationException already in scope), text-block javadoc consistent with the adjacent HA_WRITE_BUFFER_SIZE entry. The validation pattern (if (... < 1) throw new ConfigurationException(...)) mirrors the existing writeBufferSize check a few lines down.
  • Single source of truth: confirmed the previously hardcoded 256 is the only occurrence; no other call site sets bufferElementLimit.
  • Validation at build time: rejecting < 1 eagerly in RaftPropertiesBuilder.build() is the right place - it fails fast on startup rather than producing a confusing Ratis error later.
  • Tests: assertions use the assertThat(...).isEqualTo(...) / assertThatThrownBy style requested in CLAUDE.md, and they read back through the real Ratis RaftServerConfigKeys accessors rather than asserting on internal state - good round-trip coverage including the zero/negative error paths.

Minor points

  1. Test count mismatch in the PR description: the body says "RaftPropertiesBuilderTest (5 tests)" / "Adds ... (5 tests)", but the committed file has 7 @Test methods (the extras being defaultGrpcMessageSizeMaxIs128MB, writeBufferSizeMustExceedAppendBufferPlusFraming, and the negative-limit case). Worth correcting the description for accuracy.

  2. No upper bound / interaction with HA_APPEND_BUFFER_SIZE: the lower bound is enforced but there's no upper guard, and the element limit and the 4 MB byte limit interact - whichever is hit first caps the batch. That's fine and arguably intentional (the byte limit is the real heap guard), but since the stated goal is "bound per-batch in-memory footprint," it may be worth a one-line note in the config doc that the byte limit remains the dominant safety bound and the element limit is the secondary cap. Not blocking.

  3. Throughput trade-off of the new default (256 -> 64): lowering the default reduces catch-up batch size by 4x, which bounds heap but can also slow resync throughput for far-behind followers (more round-trips). The design doc acknowledges operators can raise it, so this is a reasonable conservative default - just flagging that the catch-up-time effect isn't covered by a test. The existing RaftHAConfigurationIT/RaftHAServerTest passing confirms correctness but not catch-up performance; acceptable for this fix.

Things I verified

  • ConfigurationException import present and the two throw sites compile against it.
  • getValueAsInteger(GlobalConfiguration.HA_APPEND_ELEMENT_LIMIT) is consistent with other Integer.class configs (autoboxed 64 default matches the existing pattern, e.g. TYPE_DEFAULT_BUCKETS).
  • SCOPE.SERVER is the correct scope (cluster/replication-level, not per-database).

No security concerns, and no GC/perf hazard in a hot path (this runs once at properties-build time). LGTM pending the description fix.

…er validation

- Note in the appendElementLimit description that arcadedb.ha.appendBufferSize
  (byte limit) remains the dominant per-batch heap bound and the element count
  is the secondary cap.
- Add a regression test for the writeBufferSize < appendBufferSize + 8
  ConfigurationException branch.
- Record review history and the deferred root-cause-dimension question (needs
  the incident heap dump) in the tracking doc.
@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Code Review - PR 4753

Reviewed the diff against the repo conventions in CLAUDE.md. Overall this is a clean, well-scoped, well-tested change. The new HA_APPEND_ELEMENT_LIMIT config follows the existing GlobalConfiguration patterns exactly (SCOPE, text-block description, Integer.class default), the validation throws the same ConfigurationException used elsewhere in RaftPropertiesBuilder, and the new test class mirrors the project's JUnit5 + AssertJ style. Nice touch keeping final on locals and using assertThatThrownBy for the negative cases.

Strongest concern: does lowering the default actually address the OOM?

The PR honestly flags this in the deferred-items section, and I want to amplify it because it is the crux of the fix. The element limit only binds AFTER the 4MB byte limit (HA_APPEND_BUFFER_SIZE). For the incident's described workload (entries up to several MB of compressed WAL data), the 4MB byte cap already truncates a batch to ~1-2 entries long before 64 vs 256 elements ever matters. In that regime the element-limit change is a no-op for peak heap, while the lower default DOES impose a real cost on healthy clusters with many small transactions (more AppendEntries round-trips during catch-up).

So the change as-is is safe and makes the knob tunable (good), but I would push back on flipping the default from 256 to 64 without the heap-dump confirmation that entry-count was the dominant dimension. Two options:

  • Keep the default at 256 (preserve current behavior), ship only the new tunable knob, and let operators lower it. This avoids a silent behavioral regression for the common small-transaction case.
  • Keep 64, but only after the heap dump confirms entry-count dominated.

Either way this is a judgment call for the maintainer; the code itself is correct.

Minor

  • Validation has a lower bound but no upper bound. appendElementLimit < 1 is rejected, but an operator could set e.g. 1_000_000 and re-introduce the very OOM this guards against. Not blocking, but a sanity ceiling (or at least a doc note) would make the bound symmetric with intent.
  • Test layer is right. Verifying the property is translated into Ratis correctly at the unit level (vs. a slow HA integration test) is the correct call.
  • Per-issue doc file docs/4752-bound-appendentries-catch-up-batch.md is thorough and the review-cycle reasoning is useful, but it partly reads like scratch notes (e.g. references to the automated loop). Confirm the project intends these per-issue working docs to live in the repo long-term.

Test coverage - good

Default (64), custom (128), byte limit (4MB), gRPC message size (128MB), write-buffer happy path, write-buffer rejection, and zero/negative element-limit rejection are all covered. The negative-path assertions check both exception type and message substrings, which is the right level of specificity.

Security / performance

No security concerns - all values are server-scoped config, no untrusted input path. Performance impact is limited to replication batching, addressed in the default-change discussion above.

Nice work overall - the validation, the matching test, and the candid deferred-items writeup are all above bar. The one decision worth a second look before merge is the default value flip (256 to 64).

Note: I could not run the Maven build/test in this environment (command approval restricted), so this review is based on code reading; the test structure looks correct and should pass as the PR claims.

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Review: PR #4753 - make AppendEntries element limit configurable, default 64

Thanks for this. It is a clean, well-scoped change with strong test coverage and unusually thorough documentation of the reasoning. Overall it looks correct and ready, with one decision worth settling before merge.

What is good

  • Tight, isolated change. A single hardcoded 256 is lifted into GlobalConfiguration and wired through RaftPropertiesBuilder with no collateral edits. I grepped the module and confirmed setBufferElementLimit has no other call sites that still hardcode the value.
  • Validation + tests. The >= 1 guard correctly throws ConfigurationException, and the new RaftPropertiesBuilderTest covers default, custom, byte limit, gRPC message size, write-buffer validation (happy + failure), and the zero/negative rejection. Style follows project conventions (AssertJ assertThat(...).isEqualTo(...), final params, license header).
  • Honest write-up. The docs/ note records that the gemini "0 freezes replication" claim was verified against Ratis source (DataQueue.offer, 0 = unlimited) and the literal claim was wrong, while still applying the sound underlying advice. That traceability is great. The doc file also matches the existing repo convention (4689, 4740).

The one thing to settle: default 64 vs 256

This is the substantive call, and you have already flagged it as the open decision. I want to reinforce it rather than wave it through. By your own analysis:

  • The 4MB byte limit (appendBufferSize) is the dominant per-batch heap bound. For the incident workload (multi-MB compressed WAL entries), the byte cap truncates a batch to ~1-2 entries long before 64 vs 256 ever binds, so the default flip is a no-op for peak heap in exactly the regime that triggered the OOM.
  • Meanwhile the lower default does cost healthy clusters extra AppendEntries round-trips for streams of small transactions.

So shipping a behavior change whose benefit is unconfirmed (the heap dump is not available) while its cost to healthy clusters is concrete leans toward the strictly safer option: keep the default at 256 and ship only the tunable knob. Operators who actually hit the OOM lower it themselves; nobody else regresses. The knob and validation are correct either way, so this is purely a default-value decision for the maintainer/issue author once the heap dump confirms whether entry-count was ever the governing dimension. If 64 stays, please add the release note you mention (the change from 256 will surprise anyone tuning replication throughput).

Minor / non-blocking

  • No upper bound on appendElementLimit: an operator can set a huge value and re-introduce the OOM. You note this; the byte limit is the real ceiling, so an arbitrary cap is not worth a judgment call here. Fine as-is.
  • Validation runs at properties-build (server start) time, not at config-set time. That is consistent with the adjacent writeBufferSize check in the same method, so it is the right place; just noting the invalid value surfaces at startup, which is acceptable.
  • The docs/ note carries a fair amount of process/meta-commentary ("review cycles", per-reviewer notes). It matches the existing pattern, but consider trimming to the durable root-cause/fix content so the file stays useful long-term.

Test coverage

Good for the unit surface. The one gap is that nothing exercises the actual catch-up/OOM scenario end-to-end (understandably hard without the heap dump and a large dataset), so the fix real-world effectiveness rests on the manual reasoning in the doc rather than a regression test. Not a blocker given the constraints, but worth acknowledging that this PR makes the knob tunable; it does not prove it resolves the issue.

Nice work overall. The code is mergeable; the default value is the only thing I would want a maintainer to confirm first.

@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 65.25%. Comparing base (1b1c074) to head (0c70fa7).

Additional details and impacted files
@@            Coverage Diff            @@
##               main    #4753   +/-   ##
=========================================
  Coverage     65.25%   65.25%           
- Complexity      625      629    +4     
=========================================
  Files          1675     1675           
  Lines        132608   132613    +5     
  Branches      28365    28366    +1     
=========================================
+ Hits          86530    86542   +12     
+ Misses        34143    34141    -2     
+ Partials      11935    11930    -5     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HA: follower OOMs deserializing inbound AppendEntries during snapshot/catch-up resync (replication batch not bounded to heap)

1 participant