Skip to content

Reduce impact of background merge/compress to ChainTip#18995

Merged
AskAlexSharov merged 3 commits into
release/3.3from
alex/comp_workers_reduce_33
Feb 6, 2026
Merged

Reduce impact of background merge/compress to ChainTip#18995
AskAlexSharov merged 3 commits into
release/3.3from
alex/comp_workers_reduce_33

Conversation

@AskAlexSharov

Copy link
Copy Markdown
Collaborator

No description provided.

taratorio
taratorio previously approved these changes Feb 6, 2026
@taratorio taratorio dismissed their stale review February 6, 2026 01:41

oh wait, I think there might be some unintended changes to the execution-spec-tests and node/interfaces submodules?

@taratorio

Copy link
Copy Markdown
Member

I think there might be some unintended changes to the execution-spec-tests and node/interfaces submodules?

@AskAlexSharov AskAlexSharov enabled auto-merge (squash) February 6, 2026 02:05
@AskAlexSharov AskAlexSharov merged commit b2dc316 into release/3.3 Feb 6, 2026
11 checks passed
@AskAlexSharov AskAlexSharov deleted the alex/comp_workers_reduce_33 branch February 6, 2026 02:10
github-merge-queue Bot pushed a commit that referenced this pull request Apr 10, 2026
## Reduce impact of synchronized aggregation across fleet nodes
 
### Problem
 
When running multiple Erigon nodes syncing the same chain, all nodes
cross snapshot step boundaries at nearly the same time (within seconds
of each other). This triggers `BuildFilesInBackground` simultaneously on
every node, and the resulting aggregation I/O stalls block execution on
all nodes at once.
 
In a load-balanced fleet this causes a total service outage — every
backend falls behind the chain tip simultaneously, and the proxy has
zero healthy backends to route traffic to.
 
### Real-world incident (April 7 2026)
 
We operate a 3-node fleet. After ~2 months of stable operation, all
nodes hit aggregation step 2193 within 20 seconds of each other:
 
| Node | `BuildFilesInBackground step=2193` | Aggregation duration |
|------|-------------------------------------|---------------------|
| node-1 | 09:59:34 | 2m30s |
| node-2 | 09:59:28 | 2m29s |
| node-3 | 09:59:48 | still aggregating, was restarted |
 
During the aggregation, block execution throughput dropped from ~20
Mgas/s to ~1-5 Mgas/s. All nodes fell behind the chain tip. At 10:07:33
the fleet had **0 out of 3 healthy backends** for 60 seconds.
 
The aggregation step itself evicted ~16GB of page cache (RSS dropped
from 48GB to 32GB on one node), starving block execution of I/O
bandwidth.
 
Each node recovered on its own within 10-15 minutes, but the
synchronized nature of the stall meant there was no healthy node to
absorb traffic during the event.
 
### Root cause
 
`BuildFilesInBackground` is triggered when `txNum` crosses a step
boundary. Since all nodes process the same chain in real time, they all
cross the boundary on the same block. The trigger is deterministic —
there is no jitter or per-node offset.
 
### Solution
 
Add a configurable delay (`ERIGON_AGGREGATION_DELAY_MS`, default 0) at
the start of `BuildFilesInBackground`, before the build loop begins.
This follows the same pattern as the existing `COMPRESS_WORKERS` env var
in `common/dbg/experiments.go`.
 
Operators running multi-node fleets can set different values per node to
desynchronize aggregation:
 
```
node-1: ERIGON_AGGREGATION_DELAY_MS=0
node-2: ERIGON_AGGREGATION_DELAY_MS=60000
node-3: ERIGON_AGGREGATION_DELAY_MS=120000
```
 
This guarantees at least 60 seconds between each node starting its
aggregation, which would have completely prevented the 0/3 healthy
window in the incident above. Single-node operators are unaffected
(default is 0).
 

### Notes
 
- This is complementary to `COMPRESS_WORKERS` (PR #18995) which reduces
I/O pressure *within* each aggregation step. This PR addresses the
*timing* of when aggregation starts across nodes.
- No impact on single-node deployments or initial sync (default delay is
0).

---------

Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>
Co-authored-by: Alexey Sharov <AskAlexSharov@gmail.com>
github-merge-queue Bot pushed a commit that referenced this pull request Apr 11, 2026
…ure (#20486)

### Problem
 
When Erigon is running at chain tip, `MergeLoop` executes merge steps
back-to-back with no pause between iterations. Each merge step involves
heavy disk I/O (reading, compressing, and writing state files). Running
these steps consecutively saturates the disk, starving block execution
of I/O bandwidth.
 
The result is periodic block processing stalls: the node's reported
block number freezes for minutes at a time while background merges
consume all available I/O, then bursts forward when a merge step
completes. During these stalls the node falls behind the chain tip and
is marked unhealthy by load balancers.
 
### Observed behavior
 
On a production fleet running Erigon v3.3.x on AWS Graviton instances
(64GB RAM, EBS gp3 volumes), we observed the following pattern during
MergeLoop activity on individual nodes:
 
- Block execution throughput drops from ~20 Mgas/s to 1-5 Mgas/s
- Node block number freezes for 8-16 minutes per merge step
- Page cache eviction of 16GB+ as merge I/O displaces cached state data
- Lag accumulates at ~5 blocks/minute during each stall
- Worst observed: 164 blocks behind over a 188-minute period of
continuous merge activity
 
The node always recovers eventually, but the stalls cause the node to be
removed from load balancer rotation, reducing fleet capacity.
 
### Solution
 
Add a configurable delay between `MergeLoop` iterations via the
`MERGE_THROTTLE_MS` environment variable (default 0, preserving current
behavior). The delay is inserted after each successful `mergeLoopStep`,
giving block execution a window to access the disk before the next merge
step begins.
 
```
Before (current):
  mergeLoopStep()  → heavy I/O
  mergeLoopStep()  → immediately, more heavy I/O
  mergeLoopStep()  → immediately, more heavy I/O
 
After (with ERIGON_MERGE_THROTTLE_MS=2000):
  mergeLoopStep()  → heavy I/O
  sleep(2s)        → block execution catches up
  mergeLoopStep()  → heavy I/O
  sleep(2s)        → block execution catches up
```
 
### Production results
 
We have been running this patch on a 3-node production fleet since
December 2025. Results:
 
- Individual node availability during merge-heavy periods improved from
~90% to >99%
- Block execution stalls reduced from 8-16 minutes to under 5 minutes
- Nodes maintain chain tip proximity during merge activity
- No negative impact on merge completion time (merges still finish, just
spread over a slightly longer window)
- Fleet-wide availability (via load-balanced proxy) is near 99.99%, with
the remaining downtime caused by synchronized stalls that this patch and
`AGGREGATION_DELAY_MS` (PR #20391) address together
 
Recommended values based on our testing:
 
| Use case | Value | Effect |
|----------|-------|--------|
| Default (no throttle) | 0 | Current behavior, no change |
| Light throttle | 500 | Slight breathing room between merges |
| Production RPC nodes | 2000 | Good balance of merge progress and block
execution |
| Heavy RPC workload | 5000 | Prioritize block execution over merge
speed |
 
### Notes
 
- This is complementary to `COMPRESS_WORKERS` (PR #18995) which reduces
I/O pressure *within* each merge step by limiting worker parallelism.
This PR addresses I/O pressure *between* merge steps.
- This is also complementary to `AGGREGATION_DELAY_MS` (PR #20391,
merged) which staggers the *start time* of aggregation across fleet
nodes.
- No impact on single-node deployments or initial sync (default delay is
0).

Signed-off-by: Peter Lemenkov <lemenkov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants