fix(#4752): make AppendEntries element limit configurable, default 64#4753
Conversation
Add HA_APPEND_ELEMENT_LIMIT (arcadedb.ha.appendElementLimit) to GlobalConfiguration with a default of 64, replacing the hardcoded 256 in RaftPropertiesBuilder. This bounds the per-batch in-memory footprint on the follower during catch-up resync, where many batches may queue before the state machine can apply them, reducing peak heap pressure and mitigating the OutOfMemoryError observed on 6 GB nodes. Adds RaftPropertiesBuilderTest to verify the config wiring.
|
Tick the box to add this pull request to the merge queue (same as
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new configuration parameter HA_APPEND_ELEMENT_LIMIT to bound the maximum number of Raft log entries per AppendEntries batch, resolving follower OOM issues during catch-up resync. The review feedback correctly identifies that setting this limit to 0 does not disable the limit in Apache Ratis but instead freezes replication. It is recommended to update the configuration description, add validation to ensure the limit is strictly greater than 0, and include a corresponding unit test to verify this validation behavior.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| HA_APPEND_ELEMENT_LIMIT("arcadedb.ha.appendElementLimit", SCOPE.SERVER, | ||
| """ | ||
| Maximum number of Raft log entries per AppendEntries batch. Bounds the per-batch in-memory \ | ||
| footprint on the follower during catch-up resync, where many batches may queue before the \ | ||
| state machine can apply them. Lowering this value reduces peak heap pressure on followers \ | ||
| catching up from a far-behind state. 0 means no element-count limit (only the byte limit applies).""", | ||
| Integer.class, 64), |
There was a problem hiding this comment.
Setting arcadedb.ha.appendElementLimit to 0 will not disable the element-count limit in Apache Ratis; instead, it will treat the buffer limit as zero, which will immediately mark the buffer as full and completely freeze replication. The description should be updated to reflect that the value must be greater than 0, and validation should be added to prevent invalid configurations.
| HA_APPEND_ELEMENT_LIMIT("arcadedb.ha.appendElementLimit", SCOPE.SERVER, | |
| """ | |
| Maximum number of Raft log entries per AppendEntries batch. Bounds the per-batch in-memory \ | |
| footprint on the follower during catch-up resync, where many batches may queue before the \ | |
| state machine can apply them. Lowering this value reduces peak heap pressure on followers \ | |
| catching up from a far-behind state. 0 means no element-count limit (only the byte limit applies).""", | |
| Integer.class, 64), | |
| HA_APPEND_ELEMENT_LIMIT("arcadedb.ha.appendElementLimit", SCOPE.SERVER, | |
| """ | |
| Maximum number of Raft log entries per AppendEntries batch. Bounds the per-batch in-memory \ | |
| footprint on the follower during catch-up resync, where many batches may queue before the \ | |
| state machine can apply them. Lowering this value reduces peak heap pressure on followers \ | |
| catching up from a far-behind state. Must be greater than 0.""", | |
| Integer.class, 64), |
| final int appendElementLimit = configuration.getValueAsInteger(GlobalConfiguration.HA_APPEND_ELEMENT_LIMIT); | ||
| RaftServerConfigKeys.Log.Appender.setBufferElementLimit(properties, appendElementLimit); |
There was a problem hiding this comment.
Add validation to ensure that appendElementLimit is strictly greater than 0. If it is set to 0 or a negative value, it will cause replication to freeze or fail because Apache Ratis does not support 0 as a 'no limit' indicator for bufferElementLimit.
| final int appendElementLimit = configuration.getValueAsInteger(GlobalConfiguration.HA_APPEND_ELEMENT_LIMIT); | |
| RaftServerConfigKeys.Log.Appender.setBufferElementLimit(properties, appendElementLimit); | |
| final int appendElementLimit = configuration.getValueAsInteger(GlobalConfiguration.HA_APPEND_ELEMENT_LIMIT); | |
| if (appendElementLimit <= 0) { | |
| throw new ConfigurationException( | |
| "arcadedb.ha.appendElementLimit (" + appendElementLimit + ") must be greater than 0"); | |
| } | |
| RaftServerConfigKeys.Log.Appender.setBufferElementLimit(properties, appendElementLimit); |
| assertThat(RaftServerConfigKeys.Log.writeBufferSize(props).getSizeInt()) | ||
| .isEqualTo(8 * 1024 * 1024); | ||
| } |
There was a problem hiding this comment.
Add a unit test to verify that an exception is thrown when HA_APPEND_ELEMENT_LIMIT is configured with an invalid value (less than or equal to 0).
assertThat(RaftServerConfigKeys.Log.writeBufferSize(props).getSizeInt())
.isEqualTo(8 * 1024 * 1024);
}
@Test
void invalidElementLimitThrowsException() {
final ContextConfiguration config = new ContextConfiguration();
config.setValue(GlobalConfiguration.HA_APPEND_ELEMENT_LIMIT, 0);
org.assertj.core.api.Assertions.assertThatThrownBy(() -> RaftPropertiesBuilder.build(config))
.isInstanceOf(com.arcadedb.exception.ConfigurationException.class)
.hasMessageContaining("must be greater than 0");
}
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 0 |
🟢 Coverage 100.00% diff coverage · -7.31% coverage variation
Metric Results Coverage variation ✅ -7.31% coverage variation Diff coverage ✅ 100.00% diff coverage Coverage variation details
Coverable lines Covered lines Coverage Common ancestor commit (1b1c074) 132608 98502 74.28% Head commit (0c70fa7) 164416 (+31808) 110103 (+11601) 66.97% (-7.31%) Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch:
<coverage of head commit> - <coverage of common ancestor commit>Diff coverage details
Coverable lines Covered lines Diff coverage Pull request (#4753) 6 6 100.00% Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified:
<covered lines added or modified>/<coverable lines added or modified> * 100%
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Code Review: PR #4753 — configurable AppendEntries element limitReviewed the diff against the repo conventions in Technical question on the root-cause mechanism (most important)The design doc attributes the follower OOM to entries "each up to several MB of compressed WAL data" accumulating in memory, and frames lowering the element limit The element limit really only becomes the governing dimension when entries are small (so that 64-256 of them fit under 4MB). The comment hints at the real lever - "many batches may queue before the state machine can apply them" - i.e. accumulation across multiple in-flight batches, not within one batch. But total in-flight memory is roughly Could you confirm from the heap dump / OOM analysis which dimension actually dominated? If the OOM was driven by large entries, the more effective knob may be No validation on the new value
"0 means no limit" is documented but untestedBoth the config description and the design doc state
Default change is a behavior change for healthy clustersLowering the default Things done well
Minor
Nothing here is blocking from a code-quality standpoint; the main ask is confirming the element-limit knob actually addresses the dimension that caused the OOM in #4752. |
Restrict arcadedb.ha.appendElementLimit to positive values (>= 1): setting it to 0 or a negative number would disable the per-batch bound that protects against follower OOM during catch-up, defeating the fix. Update the configuration description to document the constraint. Add two regression tests for the new validation.
Review: fix(#4752) make AppendEntries element limit configurable, default 64Reviewed the diff against the repo conventions in Strengths
Minor points
Things I verified
No security concerns, and no GC/perf hazard in a hot path (this runs once at properties-build time). LGTM pending the description fix. |
…er validation - Note in the appendElementLimit description that arcadedb.ha.appendBufferSize (byte limit) remains the dominant per-batch heap bound and the element count is the secondary cap. - Add a regression test for the writeBufferSize < appendBufferSize + 8 ConfigurationException branch. - Record review history and the deferred root-cause-dimension question (needs the incident heap dump) in the tracking doc.
Code Review - PR 4753Reviewed the diff against the repo conventions in Strongest concern: does lowering the default actually address the OOM? The PR honestly flags this in the deferred-items section, and I want to amplify it because it is the crux of the fix. The element limit only binds AFTER the 4MB byte limit ( So the change as-is is safe and makes the knob tunable (good), but I would push back on flipping the default from 256 to 64 without the heap-dump confirmation that entry-count was the dominant dimension. Two options:
Either way this is a judgment call for the maintainer; the code itself is correct. Minor
Test coverage - good Default (64), custom (128), byte limit (4MB), gRPC message size (128MB), write-buffer happy path, write-buffer rejection, and zero/negative element-limit rejection are all covered. The negative-path assertions check both exception type and message substrings, which is the right level of specificity. Security / performance No security concerns - all values are server-scoped config, no untrusted input path. Performance impact is limited to replication batching, addressed in the default-change discussion above. Nice work overall - the validation, the matching test, and the candid deferred-items writeup are all above bar. The one decision worth a second look before merge is the default value flip (256 to 64). Note: I could not run the Maven build/test in this environment (command approval restricted), so this review is based on code reading; the test structure looks correct and should pass as the PR claims. |
|
Review: PR #4753 - make AppendEntries element limit configurable, default 64 Thanks for this. It is a clean, well-scoped change with strong test coverage and unusually thorough documentation of the reasoning. Overall it looks correct and ready, with one decision worth settling before merge. What is good
The one thing to settle: default 64 vs 256 This is the substantive call, and you have already flagged it as the open decision. I want to reinforce it rather than wave it through. By your own analysis:
So shipping a behavior change whose benefit is unconfirmed (the heap dump is not available) while its cost to healthy clusters is concrete leans toward the strictly safer option: keep the default at 256 and ship only the tunable knob. Operators who actually hit the OOM lower it themselves; nobody else regresses. The knob and validation are correct either way, so this is purely a default-value decision for the maintainer/issue author once the heap dump confirms whether entry-count was ever the governing dimension. If 64 stays, please add the release note you mention (the change from 256 will surprise anyone tuning replication throughput). Minor / non-blocking
Test coverage Good for the unit surface. The one gap is that nothing exercises the actual catch-up/OOM scenario end-to-end (understandably hard without the heap dump and a large dataset), so the fix real-world effectiveness rests on the manual reasoning in the doc rather than a regression test. Not a blocker given the constraints, but worth acknowledging that this PR makes the knob tunable; it does not prove it resolves the issue. Nice work overall. The code is mergeable; the default value is the only thing I would want a maintainer to confirm first. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4753 +/- ##
=========================================
Coverage 65.25% 65.25%
- Complexity 625 629 +4
=========================================
Files 1675 1675
Lines 132608 132613 +5
Branches 28365 28366 +1
=========================================
+ Hits 86530 86542 +12
+ Misses 34143 34141 -2
+ Partials 11935 11930 -5 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Closes #4752
Summary
HA_APPEND_ELEMENT_LIMIT(arcadedb.ha.appendElementLimit) toGlobalConfigurationwith a default of 64, replacing the previously hardcoded256inRaftPropertiesBuilder.RaftServerConfigKeys.Log.Appender.setBufferElementLimitso operators can tune the per-batch entry count.>= 1at properties-build time (a value of 0 or negative would silently defeat the per-batch element bound).RaftPropertiesBuilderTest(8 tests) covering the default/custom element limit, byte limit, gRPC message size, write-buffer validation, and the zero/negative element-limit rejection.The lower default (64 vs 256) is a secondary cap that bounds the per-batch entry count during catch-up resync. The byte limit (
arcadedb.ha.appendBufferSize, 4MB default) remains the dominant per-batch heap bound; the element count governs only when entries are small enough that many fit under the byte limit. Together with the existingarcadedb.ha.grpcMessageSizeMaxknob, all three catch-up bounds the issue requested are now operator-tunable.Test plan
RaftPropertiesBuilderTest- 8 unit tests: default element limit is 64, custom value applied, byte limit and gRPC message size read back correctly, write-buffer validation (happy path + exception), and zero/negative element-limit rejection.RaftHAServerTest(45 tests) andRaftHAConfigurationIT(5 tests) pass without change.GlobalConfigurationTest(10 tests) passes - new enum constant is valid.Note
The root-cause dimension (entry count vs byte/message size) that drove the OOM in the incident is documented as a deferred item in the tracking doc and needs the incident heap dump to confirm; this PR makes all three bounds tunable regardless.