Reduce impact of background merge/compress to ChainTip#18995
Merged
Conversation
taratorio
previously approved these changes
Feb 6, 2026
oh wait, I think there might be some unintended changes to the execution-spec-tests and node/interfaces submodules?
Member
|
I think there might be some unintended changes to the execution-spec-tests and node/interfaces submodules? |
taratorio
approved these changes
Feb 6, 2026
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Apr 10, 2026
## Reduce impact of synchronized aggregation across fleet nodes ### Problem When running multiple Erigon nodes syncing the same chain, all nodes cross snapshot step boundaries at nearly the same time (within seconds of each other). This triggers `BuildFilesInBackground` simultaneously on every node, and the resulting aggregation I/O stalls block execution on all nodes at once. In a load-balanced fleet this causes a total service outage — every backend falls behind the chain tip simultaneously, and the proxy has zero healthy backends to route traffic to. ### Real-world incident (April 7 2026) We operate a 3-node fleet. After ~2 months of stable operation, all nodes hit aggregation step 2193 within 20 seconds of each other: | Node | `BuildFilesInBackground step=2193` | Aggregation duration | |------|-------------------------------------|---------------------| | node-1 | 09:59:34 | 2m30s | | node-2 | 09:59:28 | 2m29s | | node-3 | 09:59:48 | still aggregating, was restarted | During the aggregation, block execution throughput dropped from ~20 Mgas/s to ~1-5 Mgas/s. All nodes fell behind the chain tip. At 10:07:33 the fleet had **0 out of 3 healthy backends** for 60 seconds. The aggregation step itself evicted ~16GB of page cache (RSS dropped from 48GB to 32GB on one node), starving block execution of I/O bandwidth. Each node recovered on its own within 10-15 minutes, but the synchronized nature of the stall meant there was no healthy node to absorb traffic during the event. ### Root cause `BuildFilesInBackground` is triggered when `txNum` crosses a step boundary. Since all nodes process the same chain in real time, they all cross the boundary on the same block. The trigger is deterministic — there is no jitter or per-node offset. ### Solution Add a configurable delay (`ERIGON_AGGREGATION_DELAY_MS`, default 0) at the start of `BuildFilesInBackground`, before the build loop begins. This follows the same pattern as the existing `COMPRESS_WORKERS` env var in `common/dbg/experiments.go`. Operators running multi-node fleets can set different values per node to desynchronize aggregation: ``` node-1: ERIGON_AGGREGATION_DELAY_MS=0 node-2: ERIGON_AGGREGATION_DELAY_MS=60000 node-3: ERIGON_AGGREGATION_DELAY_MS=120000 ``` This guarantees at least 60 seconds between each node starting its aggregation, which would have completely prevented the 0/3 healthy window in the incident above. Single-node operators are unaffected (default is 0). ### Notes - This is complementary to `COMPRESS_WORKERS` (PR #18995) which reduces I/O pressure *within* each aggregation step. This PR addresses the *timing* of when aggregation starts across nodes. - No impact on single-node deployments or initial sync (default delay is 0). --------- Signed-off-by: Peter Lemenkov <lemenkov@gmail.com> Co-authored-by: Alexey Sharov <AskAlexSharov@gmail.com>
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Apr 11, 2026
…ure (#20486) ### Problem When Erigon is running at chain tip, `MergeLoop` executes merge steps back-to-back with no pause between iterations. Each merge step involves heavy disk I/O (reading, compressing, and writing state files). Running these steps consecutively saturates the disk, starving block execution of I/O bandwidth. The result is periodic block processing stalls: the node's reported block number freezes for minutes at a time while background merges consume all available I/O, then bursts forward when a merge step completes. During these stalls the node falls behind the chain tip and is marked unhealthy by load balancers. ### Observed behavior On a production fleet running Erigon v3.3.x on AWS Graviton instances (64GB RAM, EBS gp3 volumes), we observed the following pattern during MergeLoop activity on individual nodes: - Block execution throughput drops from ~20 Mgas/s to 1-5 Mgas/s - Node block number freezes for 8-16 minutes per merge step - Page cache eviction of 16GB+ as merge I/O displaces cached state data - Lag accumulates at ~5 blocks/minute during each stall - Worst observed: 164 blocks behind over a 188-minute period of continuous merge activity The node always recovers eventually, but the stalls cause the node to be removed from load balancer rotation, reducing fleet capacity. ### Solution Add a configurable delay between `MergeLoop` iterations via the `MERGE_THROTTLE_MS` environment variable (default 0, preserving current behavior). The delay is inserted after each successful `mergeLoopStep`, giving block execution a window to access the disk before the next merge step begins. ``` Before (current): mergeLoopStep() → heavy I/O mergeLoopStep() → immediately, more heavy I/O mergeLoopStep() → immediately, more heavy I/O After (with ERIGON_MERGE_THROTTLE_MS=2000): mergeLoopStep() → heavy I/O sleep(2s) → block execution catches up mergeLoopStep() → heavy I/O sleep(2s) → block execution catches up ``` ### Production results We have been running this patch on a 3-node production fleet since December 2025. Results: - Individual node availability during merge-heavy periods improved from ~90% to >99% - Block execution stalls reduced from 8-16 minutes to under 5 minutes - Nodes maintain chain tip proximity during merge activity - No negative impact on merge completion time (merges still finish, just spread over a slightly longer window) - Fleet-wide availability (via load-balanced proxy) is near 99.99%, with the remaining downtime caused by synchronized stalls that this patch and `AGGREGATION_DELAY_MS` (PR #20391) address together Recommended values based on our testing: | Use case | Value | Effect | |----------|-------|--------| | Default (no throttle) | 0 | Current behavior, no change | | Light throttle | 500 | Slight breathing room between merges | | Production RPC nodes | 2000 | Good balance of merge progress and block execution | | Heavy RPC workload | 5000 | Prioritize block execution over merge speed | ### Notes - This is complementary to `COMPRESS_WORKERS` (PR #18995) which reduces I/O pressure *within* each merge step by limiting worker parallelism. This PR addresses I/O pressure *between* merge steps. - This is also complementary to `AGGREGATION_DELAY_MS` (PR #20391, merged) which staggers the *start time* of aggregation across fleet nodes. - No impact on single-node deployments or initial sync (default delay is 0). Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.