Fix accounts/abi/bind tests#24
Merged
Merged
Conversation
AlexeyAkhunov
pushed a commit
that referenced
this pull request
Apr 23, 2021
AlexeyAkhunov
added a commit
that referenced
this pull request
Apr 24, 2021
* Initial commit * Add sentry gRPC interface * p2psentry directory * Update README.md * Update README.md * Update README.md * Add go package * Correct syntax * add external downloader interface (#2) * Add txpool (#3) * Add private API (#4) * Invert control.proto, add PeerMinBlock, Separare incoming Tx message into a separate stream (#5) Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Separate upload messages into its own stream (#6) Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Only send changed accounts to listeners (#7) * Txpool interface doc (#9) * Add architecture diagram source and picture (#10) * Typed hashes (#11) * Typed hashes * Fix PeerId * 64-bit tx nonce * Add proper golang packages, max_block into p2p sentry Status (#12) * Add proper golang packages, max_block into p2p sentry Status * Change EtherReply to address Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> * Add Rust infrastructure (#13) * DB stats methods removed by #1665 * more p2p methods (#15) * add mining methods (#16) * First draft of Consensus gRPC interface (#14) * Update Rust build * Fix interfaces in architecture diagram (#17) * Fix KV interface provider * Fix Consensus interface provider * drop java attributes (#18) * tx pool remove unused import (#19) * ethbackend: add protocol version and client version (#20) * Add missing ethbackend I/F (#21) * Add interface versioning mechanism (#23) Add versioning in KV interface Co-authored-by: Artem Vorotnikov <artem@vorotnikov.me> * spec of tx pool method (#24) * spec of tx pool method (#25) * Update version.proto * Refactor interface versioning * Refactor interface versioning * Testing interface * Remove tree * Fix * Build testing protos * Fix * Fix * Update to the newer interfaces * Add ProtocolVersion and ClientVersion stubs * Hook up ProtocolVersion and ClientVersion * Remove service * Add compatibility checks for RPC daemon * Fix typos * Properly update DB schema version * Fix test * Add test for KV compatibility| * Info messages about compability for RPC daemon * DB schema version to be one key * Update release intructions Co-authored-by: Artem Vorotnikov <artem@vorotnikov.me> Co-authored-by: b00ris <b00ris@mail.ru> Co-authored-by: Alexey Sharp <alexeysharp@Alexeys-iMac.local> Co-authored-by: lightclient <14004106+lightclient@users.noreply.github.com> Co-authored-by: canepat <16927169+canepat@users.noreply.github.com> Co-authored-by: Alex Sharov <AskAlexSharov@gmail.com> Co-authored-by: canepat <tullio.canepa@gmail.com> Co-authored-by: Alex Sharp <alexsharp@Alexs-MacBook-Pro.local>
pgebal
pushed a commit
to imapp-pl/erigon
that referenced
this pull request
Jan 16, 2023
pcw109550
pushed a commit
to sunnyside-io/erigon
that referenced
this pull request
May 19, 2023
Fix hive CI pipeline
battlmonstr
pushed a commit
that referenced
this pull request
Sep 14, 2023
Pool: handle pooled transactions package
revitteth
referenced
this pull request
in 0xPolygon/cdk-erigon
Dec 12, 2023
changed hashing function
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Mar 27, 2026
This PR introduces an HTTP admission control layer to protect the Staged
Sync pipeline from being starved or delayed by high RPC load.
This PR introduces a two-level admission control system to protect the
Staged Sync pipeline from being starved or delayed by high RPC load.
Root Cause Analysis:
Under heavy RPC traffic, the node accumulates a large number of
goroutines blocked on roTxsLimiter.Acquire. When DB slots become
available, the backlog drains in a way that starves the staged sync
pipeline. The goroutine pile-up also causes a significant spike in
virtual memory and overall system instability.
Solution:
Two gates work in tandem:
1. HTTP admission handler (rpcAdmissionHandler) — outer gate installed
at the top of every HTTP RPC stack, before CORS, Gzip, or JSON decoding.
If the number of inflight requests exceeds the configured limit, the
request is rejected immediately with HTTP 503. This prevents goroutine
accumulation at the source. On every admitted request the handler tags
the context with
WithRPCContext (limit value) so the DB layer can identify the caller.
2. BeginRo inner gate — if the context carries a positive RPC limit,
BeginRo uses TryAcquire on roTxsLimiter and returns ErrServerOverloaded
immediately if the semaphore is full. Internal callers (staged sync,
background workers) always use blocking Acquire and are never rejected.
This two-level approach means most overload is shed at the HTTP layer
(goroutines never enter the system), while any RPC requests that slip
through under transient concurrency spikes are still fail-fast at the DB
layer rather than piling up behind the semaphore.
Configuration:
- --rpc.max.concurrency: HTTP admission limit.
- 0 (default): uses --db.read.concurrency (auto-tuned to GOMAXPROCS ×
64, capped at 9000)
- > 0: explicit limit
- -1: unlimited (admission control disabled, BeginRo falls back to
blocking Acquire) (as old behaviour)
| Resource | Result |
| :--- | :--- |
### Summary of Resource Management Improvements
| Resource | Result |
| :--- | :--- |
| **Goroutine pile-up** | ✅ Requests rejected at HTTP layer before CORS,
Gzip, or JSON decoding |
| **Staged sync starvation** | ✅ Internal callers (staged sync, workers)
use blocking `Acquire` and are never rejected; RPC uses `TryAcquire`
fail-fast |
| **Transient overload spikes** | ✅ `BeginRo` inner gate catches RPC
requests that pass the HTTP layer during concurrency spikes |
| **Scalability** | ✅ Default limit auto-tuned to `GOMAXPROCS × 64`
(capped at 9000) via `--db.read.concurrency` |
| **Configuration** | ✅ Zero required config, one optional flag
(`--rpc.max.concurrency`) |
Benchmark & Stress Test Results
Setup: 32 Cores, 64GB RAM, 70GB Swap. Minimal Node in Sync. Parallel
eth_call stress tests (28k QPS).
<details>
<summary><b>Click to expand: Benchmark Data (Before vs After on local
node)</b></summary>
### Current SW (main release)
CPU
03:23:56 PM all 29.55 0.00 22.30 34.33 0.00 13.83
03:24:06 PM all 56.41 0.00 15.44 10.83 0.00 17.32
03:24:16 PM all 75.60 0.00 13.36 2.86 0.00 8.18
03:24:26 PM all 73.19 0.00 14.35 2.82 0.00 9.63
03:24:36 PM all 73.35 0.00 14.56 2.75 0.00 9.34
Memory
15:23:30 rss=31.89GB vsz=7.65TB proc_swap=11.81GB sys_swap=27.21/72.00GB
MemAvail=1.15GB SwapAvail=44.79GB
15:23:40 rss=32.74GB vsz=7.65TB proc_swap=11.00GB sys_swap=27.02/72.00GB
MemAvail=1.50GB SwapAvail=44.98GB
15:23:50 rss=33.83GB vsz=7.65TB proc_swap=9.89GB sys_swap=25.65/72.00GB
MemAvail=1.44GB SwapAvail=46.35GB
15:24:00 rss=36.33GB vsz=7.65TB proc_swap=7.60GB sys_swap=23.55/72.00GB
MemAvail=1.67GB SwapAvail=48.45GB
15:24:10 rss=37.85GB vsz=7.65TB proc_swap=6.91GB sys_swap=21.83/72.00GB
MemAvail=5.10GB SwapAvail=50.17GB
15:24:20 rss=39.30GB vsz=7.65TB proc_swap=6.69GB sys_swap=20.23/72.00GB
MemAvail=7.28GB SwapAvail=51.77GB
15:24:30 rss=40.40GB vsz=7.65TB proc_swap=6.20GB sys_swap=17.94/72.00GB
MemAvail=10.20GB SwapAvail=54.06GB
15:24:40 rss=41.44GB vsz=7.65TB proc_swap=5.23GB sys_swap=14.95/72.00GB
MemAvail=20.01GB SwapAvail=57.05GB
15:24:50 rss=41.68GB vsz=7.65TB proc_swap=5.20GB sys_swap=14.92/72.00GB
MemAvail=16.14GB SwapAvail=57.08GB
15:25:00 rss=42.77GB vsz=7.65TB proc_swap=4.95GB sys_swap=14.87/72.00GB
MemAvail=11.41GB SwapAvail=57.13GB
15:25:11 rss=42.78GB vsz=7.65TB proc_swap=5.26GB sys_swap=15.55/72.00GB
MemAvail=8.58GB SwapAvail=56.45GB
15:25:21 rss=40.79GB vsz=7.65TB proc_swap=6.88GB sys_swap=17.46/72.00GB
MemAvail=5.65GB SwapAvail=54.54GB
TIP Trucking
[15:21:44] block #24,656,279 ts=2026-03-14 15:19:47 lag=+117.8s ALERT:
lag=117.8s — node is behind the tip!
[15:21:44] block #24,656,280 ts=2026-03-14 15:19:59 lag=+105.8s ALERT:
lag=105.8s — node is behind the tip!
[15:21:44] block #24,656,281 ts=2026-03-14 15:20:11 lag=+93.8s ALERT:
lag=93.8s — node is behind the tip!
[15:21:44] block #24,656,282 ts=2026-03-14 15:20:23 lag=+81.8s ALERT:
lag=81.8s — node is behind the tip!
[15:21:44] block #24,656,283 ts=2026-03-14 15:20:47 lag=+57.8s ALERT:
lag=57.8s — node is behind the tip!
[15:21:57] block #24,656,284 ts=2026-03-14 15:20:59 lag=+58.0s ALERT:
lag=58.0s — node is behind the tip!
[15:21:57] block #24,656,285 ts=2026-03-14 15:21:11 lag=+46.0s ALERT:
lag=46.0s — node is behind the tip!
[15:21:57] block #24,656,286 ts=2026-03-14 15:21:23 lag=+34.0s ALERT:
lag=34.0s — node is behind the tip!
[15:21:57] block #24,656,287 ts=2026-03-14 15:21:35 lag=+22.0s ALERT:
lag=22.0s — node is behind the tip!
[15:21:57] block #24,656,288 ts=2026-03-14 15:21:47 lag=+10.0s OK
[15:22:07] block #24,656,289 ts=2026-03-14 15:21:59 lag=+8.0s OK
[15:22:19] block #24,656,290 ts=2026-03-14 15:22:11 lag=+8.3s OK
[15:22:32] block #24,656,291 ts=2026-03-14 15:22:23 lag=+9.3s OK
[15:23:02] ALERT: no new block for 30s (last block #24656291) — node may
be losing the tip!
[15:23:32] ALERT: no new block for 60s (last block #24656291) — node may
be losing the tip!
[15:24:02] ALERT: no new block for 90s (last block #24656291) — node may
be losing the tip!
[15:24:24] block #24,656,292 ts=2026-03-14 15:22:35 lag=+109.5s ALERT:
lag=109.5s — node is behind the tip!
[15:24:24] block #24,656,293 ts=2026-03-14 15:22:47 lag=+97.5s ALERT:
lag=97.5s — node is behind the tip!
[15:24:24] block #24,656,294 ts=2026-03-14 15:22:59 lag=+85.5s ALERT:
lag=85.5s — node is behind the tip!
[15:24:24] block #24,656,295 ts=2026-03-14 15:23:11 lag=+73.5s ALERT:
lag=73.5s — node is behind the tip!
[15:24:54] ALERT: no new block for 30s (last block #24656295) — node may
be losing the tip!
[15:25:17] block #24,656,296 ts=2026-03-14 15:23:23 lag=+114.2s ALERT:
lag=114.2s — node is behind the tip!
[15:25:17] block #24,656,297 ts=2026-03-14 15:23:35 lag=+102.2s ALERT:
lag=102.2s — node is behind the tip!
[15:25:17] block #24,656,298 ts=2026-03-14 15:23:47 lag=+90.2s ALERT:
lag=90.2s — node is behind the tip!
[15:25:17] block #24,656,299 ts=2026-03-14 15:23:59 lag=+78.2s ALERT:
lag=78.2s — node is behind the tip!
[15:25:17] block #24,656,300 ts=2026-03-14 15:24:11 lag=+66.2s ALERT:
lag=66.2s — node is behind the tip!
[15:25:17] block #24,656,301 ts=2026-03-14 15:24:23 lag=+54.2s ALERT:
lag=54.2s — node is behind the tip!
[15:25:17] block #24,656,302 ts=2026-03-14 15:24:35 lag=+42.2s ALERT:
lag=42.2s — node is behind the tip!
[15:25:17] block #24,656,303 ts=2026-03-14 15:24:47 lag=+30.2s ALERT:
lag=30.2s — node is behind the tip!
> ./run_perf_tests.py -p
pattern/mainnet/stress_test_eth_call_001_latest.tar -t 28000:60 -y
eth_call -m 2 -r 100 -Z
Performance Test started
Test repetitions: 100 on sequence: 28000:60 for pattern:
pattern/mainnet/stress_test_eth_call_001_latest.tar
Test on port: http://localhost:8545
[1. 1] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m39s]
[1. 2] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m46s]
[1. 3] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m38s]
> ./run_perf_tests.py -p
pattern/mainnet/stress_test_eth_call_001_latest.tar -t 28000:60 -y
eth_call -m 2 -r 100 -Z
Performance Test started
Test repetitions: 100 on sequence: 28000:60 for pattern:
pattern/mainnet/stress_test_eth_call_001_latest.tar
Test on port: http://localhost:8545
[1. 1] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m39s]
[1. 2] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m45s]
[1. 3] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m40s]
### NEW Software (with PR)
CPU
7:58:51 AM all 51.09 0.00 6.16 0.35 0.00 42.40
07:58:56 AM all 49.26 0.00 5.82 0.03 0.00 44.89
07:59:01 AM all 50.34 0.00 5.95 0.20 0.00 43.51
07:59:06 AM all 51.60 0.00 5.88 0.04 0.00 42.47
07:59:11 AM all 48.97 0.00 5.90 0.06 0.00 45.07
07:59:16 AM all 49.59 0.00 6.11 0.36 0.00 43.93
07:59:21 AM all 48.69 0.00 5.78 0.03 0.00 45.51
07:59:26 AM all 53.50 0.00 6.66 0.26 0.00 39.59
07:59:31 AM all 50.45 0.00 6.37 0.02 0.00 43.16
07:59:36 AM all 48.71 0.00 6.18 0.03 0.00 45.08
07:59:41 AM all 53.58 0.00 6.45 0.15 0.00 39.81
07:59:46 AM all 53.74 0.00 6.13 0.05 0.00 40.07
07:59:51 AM all 31.76 0.00 3.95 0.23 0.00 64.06
07:59:56 AM all 37.20 0.00 5.05 0.03 0.00 57.71
08:00:01 AM all 77.10 0.00 12.95 0.01 0.00 9.94
08:00:06 AM all 78.22 0.00 12.58 0.08 0.00 9.11
08:00:11 AM all 77.64 0.00 12.50 0.00 0.00 9.86
08:00:16 AM all 77.48 0.00 12.61 0.08 0.00 9.83
08:00:21 AM all 77.61 0.00 12.47 0.01 0.00 9.90
08:00:26 AM all 77.35 0.00 12.89 0.06 0.00 9.70
08:00:31 AM all 77.85 0.00 12.92 0.04 0.00 9.19
08:00:36 AM all 77.73 0.00 12.80 0.02 0.00 9.44
08:00:41 AM all 78.42 0.00 12.95 0.05 0.00 8.59
08:00:46 AM all 78.52 0.00 12.55 0.01 0.00 8.93
08:00:51 AM all 78.42 0.00 12.77 0.19 0.00 8.62
08:00:56 AM all 56.98 0.00 8.64 0.11 0.00 34.28
Memory
2026-03-20 08:00:36 pid=1117840 rss=30.04GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.93GB SwapAvail=71.02GB
2026-03-20 08:00:41 pid=1117840 rss=30.20GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.86GB SwapAvail=71.02GB
2026-03-20 08:00:41 pid=1117840 rss=30.20GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.86GB SwapAvail=71.02GB
2026-03-20 08:00:46 pid=1117840 rss=30.20GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.90GB SwapAvail=71.02GB
2026-03-20 08:00:46 pid=1117840 rss=30.20GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.90GB SwapAvail=71.02GB
2026-03-20 08:00:51 pid=1117840 rss=30.28GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.88GB SwapAvail=71.02GB
2026-03-20 08:00:51 pid=1117840 rss=30.28GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.88GB SwapAvail=71.02GB
2026-03-20 08:00:56 pid=1117840 rss=30.54GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=40.39GB SwapAvail=71.02GB
2026-03-20 08:00:56 pid=1117840 rss=30.54GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=40.39GB SwapAvail=71.02GB
2026-03-20 08:01:02 pid=1117840 rss=30.61GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=40.25GB SwapAvail=71.02GB
2026-03-20 08:01:02 pid=1117840 rss=30.61GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=40.25GB SwapAvail=71.02GB
2026-03-20 08:01:07 pid=1117840 rss=30.61GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.97GB SwapAvail=71.02GB
2026-03-20 08:01:07 pid=1117840 rss=30.61GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.97GB SwapAvail=71.02GB
2026-03-20 08:01:12 pid=1117840 rss=30.62GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.48GB SwapAvail=71.02GB
2026-03-20 08:01:12 pid=1117840 rss=30.62GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.48GB SwapAvail=71.02GB
2026-03-20 08:01:17 pid=1117840 rss=30.71GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.57GB SwapAvail=71.02GB
2026-03-20 08:01:17 pid=1117840 rss=30.71GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.57GB SwapAvail=71.02GB
TIP Trucking
07:56:10] block #24,697,055 ts=2026-03-20 07:55:59 lag=+12.0s OK
[07:56:15] block #24,697,056 ts=2026-03-20 07:56:11 lag=+4.5s OK
[07:56:25] block #24,697,057 ts=2026-03-20 07:56:23 lag=+2.5s OK
[07:56:38] block #24,697,058 ts=2026-03-20 07:56:35 lag=+3.4s OK
[07:56:50] block #24,697,059 ts=2026-03-20 07:56:47 lag=+3.5s OK
[07:57:02] block #24,697,060 ts=2026-03-20 07:56:59 lag=+3.6s OK
[07:57:16] block #24,697,061 ts=2026-03-20 07:57:11 lag=+5.6s OK
[07:57:27] block #24,697,062 ts=2026-03-20 07:57:23 lag=+4.7s OK
[07:57:39] block #24,697,063 ts=2026-03-20 07:57:35 lag=+4.3s OK
[07:57:49] block #24,697,064 ts=2026-03-20 07:57:47 lag=+2.4s OK
[07:58:01] block #24,697,065 ts=2026-03-20 07:57:59 lag=+2.9s OK
[07:58:13] block #24,697,066 ts=2026-03-20 07:58:11 lag=+2.8s OK
[07:58:25] block #24,697,067 ts=2026-03-20 07:58:23 lag=+2.4s OK
[07:58:37] block #24,697,068 ts=2026-03-20 07:58:35 lag=+2.7s OK
[07:58:49] block #24,697,069 ts=2026-03-20 07:58:47 lag=+2.3s OK
[07:59:01] block #24,697,070 ts=2026-03-20 07:58:59 lag=+2.1s OK
[07:59:15] block #24,697,071 ts=2026-03-20 07:59:11 lag=+4.3s OK
[07:59:25] block #24,697,072 ts=2026-03-20 07:59:23 lag=+2.6s OK
[07:59:40] block #24,697,073 ts=2026-03-20 07:59:35 lag=+5.3s OK
[08:00:02] block #24,697,074 ts=2026-03-20 07:59:59 lag=+3.9s OK
[08:00:13] block #24,697,075 ts=2026-03-20 08:00:11 lag=+2.8s OK
./run_perf_tests.py -p
pattern/mainnet/stress_test_eth_call_001_latest.tar -t 28000:60 -y
eth_call -m 2 -r 100 -Z
Performance Test started
Test repetitions: 100 on sequence: 28000:60 for pattern:
pattern/mainnet/stress_test_eth_call_001_latest.tar
Test on port: http://localhost:8545
[1. 1] daemon: executes test qps: 28000 time: 60 -> [R=51.39%
max=605.449ms error=503 Service Unavailable]
[1. 2] daemon: executes test qps: 28000 time: 60 -> [R=51.55%
max=442.974ms error=503 Service Unavailable]
[1. 3] daemon: executes test qps: 28000 time: 60 -> [R=49.52%
max=440.405ms error=503 Service Unavailable]
[1. 4] daemon: executes test qps: 28000 time: 60 -> [R=51.01%
max=440.004ms error=503 Service Unavailable]
[1. 5] daemon: executes test qps: 28000 time: 60 -> [R=49.66%
max=597.333ms error=503 Service Unavailable]
./run_perf_tests.py -p
pattern/mainnet/stress_test_eth_call_001_latest.tar -t 28000:60 -y
eth_call -m 2 -r 100 -Z
Performance Test started
Test repetitions: 100 on sequence: 28000:60 for pattern:
pattern/mainnet/stress_test_eth_call_001_latest.tar
Test on port: http://localhost:8545
[1. 1] daemon: executes test qps: 28000 time: 60 -> [R=51.51%
max=581.793ms error=503 Service Unavailable]
[1. 2] daemon: executes test qps: 28000 time: 60 -> [R=51.61%
max=431.222ms error=503 Service Unavailable]
[1. 3] daemon: executes test qps: 28000 time: 60 -> [R=49.48%
max=495.57ms error=503 Service Unavailable]
[1. 4] daemon: executes test qps: 28000 time: 60 -> [R=50.91%
max=433.208ms error=503 Service Unavailable]
[1. 5] daemon: executes test qps: 28000 time: 60 -> [R=49.57%
max=538.283ms error=503 Service Unavailable]
Verified on CI TIPtrucking infrastructure. Previous software versions
experienced "TIP lost" at 3,000 QPS. With these changes, the system now
successfully handles up to 6,000 QPS without any TIP loss or
degradation.
</details>
Stress Test Observations (main release)
- Chain Tip Loss: Under heavy load, the node fails to stay synced and
the Chain Tip is lost, as the staged sync pipeline is starved of DB read
slots by queued RPC goroutines.
- Virtual Memory Pressure: The system experiences severe VM pressure,
with process swap usage reaching 11.81 GB. The massive accumulation of
goroutines blocked on roTxsLimiter.Acquire causes excessive paging and
swapping. This state is highly unstable and frequently leads to the
process being terminated by the OOM Killer, causing total node downtime.
- Request Satisfaction (100%): Despite the performance degradation, all
requests are eventually satisfied. However, this is achieved at the cost
of system stability and synchronization.
- Increased Latency: Request latency increases dramatically due to deep
queuing, with response times reaching up to 1m 40s.
---
Stress Test Observations (with PR)
- Chain Tip Stability: The two-level admission control prevents
goroutine accumulation entirely. The HTTP outer gate rejects excess
requests before any processing; the BeginRo inner gate ensures that any
RPC request that does enter the system uses TryAcquire (fail-fast)
rather than blocking. Internal callers (staged sync, background workers)
always use blocking Acquire
and are never rejected, so the pipeline makes continuous progress.
- Virtual Memory Pressure: Significantly lower memory footprint. By
eliminating request queuing at the HTTP layer, the system avoids
excessive paging and swapping (0.00 GB swap), keeping the OS stable.
- Request Satisfaction (~50%): Approximately 50% of requests are
satisfied; the remainder are immediately rejected with 503 Service
Unavailable. This is the intended fail-fast behavior — goroutines never
accumulate, DB slots are never exhausted.
- Latency Consistency: Response latency remains consistently low. By
refusing to queue requests beyond the system's capacity, the node avoids
the massive latency spikes (previously up to 1m 40s) seen before the
fix.
This behavior is aligned with Nethermind, which returns 503 Service
Unavailable under high load, prioritizing node health over request
queuing.
---
Final Observation
By adopting a fail-fast strategy at two levels — HTTP admission before
any expensive processing, and TryAcquire inside BeginRo for RPC callers
— we enforce resource isolation at the core level. Internal execution
paths retain guaranteed access to DB read slots via blocking Acquire,
while external RPC pressure is shed immediately. This approach shifts
congestion management
responsibility to the external infrastructure (load balancers, proxies),
which is better equipped to handle buffering, ensuring that the Erigon
node remains stable and synchronized regardless of external RPC load.
## 🚀 RPC Concurrency & Resource Management Comparison
| Feature | Erigon (main) | **Erigon (with PR)** |
| :--- | :--- | :--- |
| **Admission control** | ❌ None | ✅ **HTTP outer gate**
(`rpcAdmissionHandler`) |
| **Overload response** | Unlimited queuing | ✅ **Immediate HTTP 503** |
| **Rejection point** | ❌ None | ✅ Pre-CORS, Gzip, JSON decode |
| **Goroutine accumulation** | ⚠️ Yes, unlimited | ✅ **Eliminated** —
goroutines don't enter the system |
| **Internal pipeline protection** | ❌ RPC and staged sync compete for
slots | ✅ **Internal callers** use blocking `Acquire` |
| **DB slots protection** | ❌ None — RPC exhausts slots | ✅ `TryAcquire`
in `BeginRo` for RPC |
| **Memory under load** | ❌ Critical — swap up to 11.81 GB, OOM | ✅
**Stable** (0.00 GB swap in test) |
| **Latency under overload** | High (~1m 40s) | ✅ **Constantly low**
(fail-fast) |
| **Configuration required** | ❌ No concurrency flags | ✅ **Zero
config**; `--rpc.max.concurrency` optional |
| **Execution isolation** | ❌ Chain tip lost under load | ✅ **Guaranteed
by design** |
### 📊 Performance Comparison: Main (18/03) vs. PR
This benchmark compares the current `main` branch against this PR using
the same set of APIs under heavy load.
| API | main (18/03) post_exec p50 | PR post_exec p50 | Improvement |
| :--- | :---: | :---: | :---: |
| **eth_call** @ 3000 QPS | 6.82s ✅ | 5.89s ✅ | **−14%** |
| **eth_getBlockByNumber** @ 3000 QPS | 13.73s ⚠️ | 5.23s ✅ | **−62%** |
| **eth_getProof** @ 1000–3000 QPS | 49.12s (tip lost) | 2.84s ✅ |
**−94%** |
---
### 🔍 Key Observations
* **eth_call**: Neither `main` nor the PR caused a chain tip loss. Since
`eth_call` is read-only and light on DB slots, it is inherently more
stable, but the PR still delivers a **14% reduction** in p50 latency.
* **eth_getBlockByNumber**: Remains stable up to **6000 QPS** with no
actual tip loss. Any observed `sync=0` periods during testing were
identified as monitoring false negatives rather than actual node desync.
* **eth_getProof**: This is the most impactful result. While `main` lost
the chain tip at only 1000 QPS (p50=49s), the **PR successfully holds up
to 3000 QPS** with a p50 of 2.84s—a **94% performance gain**.
### 🏆 Overall Conclusion
The final PR successfully **eliminates chain tip loss** across all
tested APIs and QPS levels. No real tip loss was observed in any
production-level test run, ensuring much higher node reliability under
stress.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
lupin012
added a commit
that referenced
this pull request
Apr 2, 2026
This PR introduces an HTTP admission control layer to protect the Staged
Sync pipeline from being starved or delayed by high RPC load.
This PR introduces a two-level admission control system to protect the
Staged Sync pipeline from being starved or delayed by high RPC load.
Root Cause Analysis:
Under heavy RPC traffic, the node accumulates a large number of
goroutines blocked on roTxsLimiter.Acquire. When DB slots become
available, the backlog drains in a way that starves the staged sync
pipeline. The goroutine pile-up also causes a significant spike in
virtual memory and overall system instability.
Solution:
Two gates work in tandem:
1. HTTP admission handler (rpcAdmissionHandler) — outer gate installed
at the top of every HTTP RPC stack, before CORS, Gzip, or JSON decoding.
If the number of inflight requests exceeds the configured limit, the
request is rejected immediately with HTTP 503. This prevents goroutine
accumulation at the source. On every admitted request the handler tags
the context with
WithRPCContext (limit value) so the DB layer can identify the caller.
2. BeginRo inner gate — if the context carries a positive RPC limit,
BeginRo uses TryAcquire on roTxsLimiter and returns ErrServerOverloaded
immediately if the semaphore is full. Internal callers (staged sync,
background workers) always use blocking Acquire and are never rejected.
This two-level approach means most overload is shed at the HTTP layer
(goroutines never enter the system), while any RPC requests that slip
through under transient concurrency spikes are still fail-fast at the DB
layer rather than piling up behind the semaphore.
Configuration:
- --rpc.max.concurrency: HTTP admission limit.
- 0 (default): uses --db.read.concurrency (auto-tuned to GOMAXPROCS ×
64, capped at 9000)
- > 0: explicit limit
- -1: unlimited (admission control disabled, BeginRo falls back to
blocking Acquire) (as old behaviour)
| Resource | Result |
| :--- | :--- |
### Summary of Resource Management Improvements
| Resource | Result |
| :--- | :--- |
| **Goroutine pile-up** | ✅ Requests rejected at HTTP layer before CORS,
Gzip, or JSON decoding |
| **Staged sync starvation** | ✅ Internal callers (staged sync, workers)
use blocking `Acquire` and are never rejected; RPC uses `TryAcquire`
fail-fast |
| **Transient overload spikes** | ✅ `BeginRo` inner gate catches RPC
requests that pass the HTTP layer during concurrency spikes |
| **Scalability** | ✅ Default limit auto-tuned to `GOMAXPROCS × 64`
(capped at 9000) via `--db.read.concurrency` |
| **Configuration** | ✅ Zero required config, one optional flag
(`--rpc.max.concurrency`) |
Benchmark & Stress Test Results
Setup: 32 Cores, 64GB RAM, 70GB Swap. Minimal Node in Sync. Parallel
eth_call stress tests (28k QPS).
<details>
<summary><b>Click to expand: Benchmark Data (Before vs After on local
node)</b></summary>
### Current SW (main release)
CPU
03:23:56 PM all 29.55 0.00 22.30 34.33 0.00 13.83
03:24:06 PM all 56.41 0.00 15.44 10.83 0.00 17.32
03:24:16 PM all 75.60 0.00 13.36 2.86 0.00 8.18
03:24:26 PM all 73.19 0.00 14.35 2.82 0.00 9.63
03:24:36 PM all 73.35 0.00 14.56 2.75 0.00 9.34
Memory
15:23:30 rss=31.89GB vsz=7.65TB proc_swap=11.81GB sys_swap=27.21/72.00GB
MemAvail=1.15GB SwapAvail=44.79GB
15:23:40 rss=32.74GB vsz=7.65TB proc_swap=11.00GB sys_swap=27.02/72.00GB
MemAvail=1.50GB SwapAvail=44.98GB
15:23:50 rss=33.83GB vsz=7.65TB proc_swap=9.89GB sys_swap=25.65/72.00GB
MemAvail=1.44GB SwapAvail=46.35GB
15:24:00 rss=36.33GB vsz=7.65TB proc_swap=7.60GB sys_swap=23.55/72.00GB
MemAvail=1.67GB SwapAvail=48.45GB
15:24:10 rss=37.85GB vsz=7.65TB proc_swap=6.91GB sys_swap=21.83/72.00GB
MemAvail=5.10GB SwapAvail=50.17GB
15:24:20 rss=39.30GB vsz=7.65TB proc_swap=6.69GB sys_swap=20.23/72.00GB
MemAvail=7.28GB SwapAvail=51.77GB
15:24:30 rss=40.40GB vsz=7.65TB proc_swap=6.20GB sys_swap=17.94/72.00GB
MemAvail=10.20GB SwapAvail=54.06GB
15:24:40 rss=41.44GB vsz=7.65TB proc_swap=5.23GB sys_swap=14.95/72.00GB
MemAvail=20.01GB SwapAvail=57.05GB
15:24:50 rss=41.68GB vsz=7.65TB proc_swap=5.20GB sys_swap=14.92/72.00GB
MemAvail=16.14GB SwapAvail=57.08GB
15:25:00 rss=42.77GB vsz=7.65TB proc_swap=4.95GB sys_swap=14.87/72.00GB
MemAvail=11.41GB SwapAvail=57.13GB
15:25:11 rss=42.78GB vsz=7.65TB proc_swap=5.26GB sys_swap=15.55/72.00GB
MemAvail=8.58GB SwapAvail=56.45GB
15:25:21 rss=40.79GB vsz=7.65TB proc_swap=6.88GB sys_swap=17.46/72.00GB
MemAvail=5.65GB SwapAvail=54.54GB
TIP Trucking
[15:21:44] block #24,656,279 ts=2026-03-14 15:19:47 lag=+117.8s ALERT:
lag=117.8s — node is behind the tip!
[15:21:44] block #24,656,280 ts=2026-03-14 15:19:59 lag=+105.8s ALERT:
lag=105.8s — node is behind the tip!
[15:21:44] block #24,656,281 ts=2026-03-14 15:20:11 lag=+93.8s ALERT:
lag=93.8s — node is behind the tip!
[15:21:44] block #24,656,282 ts=2026-03-14 15:20:23 lag=+81.8s ALERT:
lag=81.8s — node is behind the tip!
[15:21:44] block #24,656,283 ts=2026-03-14 15:20:47 lag=+57.8s ALERT:
lag=57.8s — node is behind the tip!
[15:21:57] block #24,656,284 ts=2026-03-14 15:20:59 lag=+58.0s ALERT:
lag=58.0s — node is behind the tip!
[15:21:57] block #24,656,285 ts=2026-03-14 15:21:11 lag=+46.0s ALERT:
lag=46.0s — node is behind the tip!
[15:21:57] block #24,656,286 ts=2026-03-14 15:21:23 lag=+34.0s ALERT:
lag=34.0s — node is behind the tip!
[15:21:57] block #24,656,287 ts=2026-03-14 15:21:35 lag=+22.0s ALERT:
lag=22.0s — node is behind the tip!
[15:21:57] block #24,656,288 ts=2026-03-14 15:21:47 lag=+10.0s OK
[15:22:07] block #24,656,289 ts=2026-03-14 15:21:59 lag=+8.0s OK
[15:22:19] block #24,656,290 ts=2026-03-14 15:22:11 lag=+8.3s OK
[15:22:32] block #24,656,291 ts=2026-03-14 15:22:23 lag=+9.3s OK
[15:23:02] ALERT: no new block for 30s (last block #24656291) — node may
be losing the tip!
[15:23:32] ALERT: no new block for 60s (last block #24656291) — node may
be losing the tip!
[15:24:02] ALERT: no new block for 90s (last block #24656291) — node may
be losing the tip!
[15:24:24] block #24,656,292 ts=2026-03-14 15:22:35 lag=+109.5s ALERT:
lag=109.5s — node is behind the tip!
[15:24:24] block #24,656,293 ts=2026-03-14 15:22:47 lag=+97.5s ALERT:
lag=97.5s — node is behind the tip!
[15:24:24] block #24,656,294 ts=2026-03-14 15:22:59 lag=+85.5s ALERT:
lag=85.5s — node is behind the tip!
[15:24:24] block #24,656,295 ts=2026-03-14 15:23:11 lag=+73.5s ALERT:
lag=73.5s — node is behind the tip!
[15:24:54] ALERT: no new block for 30s (last block #24656295) — node may
be losing the tip!
[15:25:17] block #24,656,296 ts=2026-03-14 15:23:23 lag=+114.2s ALERT:
lag=114.2s — node is behind the tip!
[15:25:17] block #24,656,297 ts=2026-03-14 15:23:35 lag=+102.2s ALERT:
lag=102.2s — node is behind the tip!
[15:25:17] block #24,656,298 ts=2026-03-14 15:23:47 lag=+90.2s ALERT:
lag=90.2s — node is behind the tip!
[15:25:17] block #24,656,299 ts=2026-03-14 15:23:59 lag=+78.2s ALERT:
lag=78.2s — node is behind the tip!
[15:25:17] block #24,656,300 ts=2026-03-14 15:24:11 lag=+66.2s ALERT:
lag=66.2s — node is behind the tip!
[15:25:17] block #24,656,301 ts=2026-03-14 15:24:23 lag=+54.2s ALERT:
lag=54.2s — node is behind the tip!
[15:25:17] block #24,656,302 ts=2026-03-14 15:24:35 lag=+42.2s ALERT:
lag=42.2s — node is behind the tip!
[15:25:17] block #24,656,303 ts=2026-03-14 15:24:47 lag=+30.2s ALERT:
lag=30.2s — node is behind the tip!
> ./run_perf_tests.py -p
pattern/mainnet/stress_test_eth_call_001_latest.tar -t 28000:60 -y
eth_call -m 2 -r 100 -Z
Performance Test started
Test repetitions: 100 on sequence: 28000:60 for pattern:
pattern/mainnet/stress_test_eth_call_001_latest.tar
Test on port: http://localhost:8545
[1. 1] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m39s]
[1. 2] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m46s]
[1. 3] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m38s]
> ./run_perf_tests.py -p
pattern/mainnet/stress_test_eth_call_001_latest.tar -t 28000:60 -y
eth_call -m 2 -r 100 -Z
Performance Test started
Test repetitions: 100 on sequence: 28000:60 for pattern:
pattern/mainnet/stress_test_eth_call_001_latest.tar
Test on port: http://localhost:8545
[1. 1] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m39s]
[1. 2] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m45s]
[1. 3] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m40s]
### NEW Software (with PR)
CPU
7:58:51 AM all 51.09 0.00 6.16 0.35 0.00 42.40
07:58:56 AM all 49.26 0.00 5.82 0.03 0.00 44.89
07:59:01 AM all 50.34 0.00 5.95 0.20 0.00 43.51
07:59:06 AM all 51.60 0.00 5.88 0.04 0.00 42.47
07:59:11 AM all 48.97 0.00 5.90 0.06 0.00 45.07
07:59:16 AM all 49.59 0.00 6.11 0.36 0.00 43.93
07:59:21 AM all 48.69 0.00 5.78 0.03 0.00 45.51
07:59:26 AM all 53.50 0.00 6.66 0.26 0.00 39.59
07:59:31 AM all 50.45 0.00 6.37 0.02 0.00 43.16
07:59:36 AM all 48.71 0.00 6.18 0.03 0.00 45.08
07:59:41 AM all 53.58 0.00 6.45 0.15 0.00 39.81
07:59:46 AM all 53.74 0.00 6.13 0.05 0.00 40.07
07:59:51 AM all 31.76 0.00 3.95 0.23 0.00 64.06
07:59:56 AM all 37.20 0.00 5.05 0.03 0.00 57.71
08:00:01 AM all 77.10 0.00 12.95 0.01 0.00 9.94
08:00:06 AM all 78.22 0.00 12.58 0.08 0.00 9.11
08:00:11 AM all 77.64 0.00 12.50 0.00 0.00 9.86
08:00:16 AM all 77.48 0.00 12.61 0.08 0.00 9.83
08:00:21 AM all 77.61 0.00 12.47 0.01 0.00 9.90
08:00:26 AM all 77.35 0.00 12.89 0.06 0.00 9.70
08:00:31 AM all 77.85 0.00 12.92 0.04 0.00 9.19
08:00:36 AM all 77.73 0.00 12.80 0.02 0.00 9.44
08:00:41 AM all 78.42 0.00 12.95 0.05 0.00 8.59
08:00:46 AM all 78.52 0.00 12.55 0.01 0.00 8.93
08:00:51 AM all 78.42 0.00 12.77 0.19 0.00 8.62
08:00:56 AM all 56.98 0.00 8.64 0.11 0.00 34.28
Memory
2026-03-20 08:00:36 pid=1117840 rss=30.04GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.93GB SwapAvail=71.02GB
2026-03-20 08:00:41 pid=1117840 rss=30.20GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.86GB SwapAvail=71.02GB
2026-03-20 08:00:41 pid=1117840 rss=30.20GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.86GB SwapAvail=71.02GB
2026-03-20 08:00:46 pid=1117840 rss=30.20GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.90GB SwapAvail=71.02GB
2026-03-20 08:00:46 pid=1117840 rss=30.20GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.90GB SwapAvail=71.02GB
2026-03-20 08:00:51 pid=1117840 rss=30.28GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.88GB SwapAvail=71.02GB
2026-03-20 08:00:51 pid=1117840 rss=30.28GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.88GB SwapAvail=71.02GB
2026-03-20 08:00:56 pid=1117840 rss=30.54GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=40.39GB SwapAvail=71.02GB
2026-03-20 08:00:56 pid=1117840 rss=30.54GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=40.39GB SwapAvail=71.02GB
2026-03-20 08:01:02 pid=1117840 rss=30.61GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=40.25GB SwapAvail=71.02GB
2026-03-20 08:01:02 pid=1117840 rss=30.61GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=40.25GB SwapAvail=71.02GB
2026-03-20 08:01:07 pid=1117840 rss=30.61GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.97GB SwapAvail=71.02GB
2026-03-20 08:01:07 pid=1117840 rss=30.61GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.97GB SwapAvail=71.02GB
2026-03-20 08:01:12 pid=1117840 rss=30.62GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.48GB SwapAvail=71.02GB
2026-03-20 08:01:12 pid=1117840 rss=30.62GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.48GB SwapAvail=71.02GB
2026-03-20 08:01:17 pid=1117840 rss=30.71GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.57GB SwapAvail=71.02GB
2026-03-20 08:01:17 pid=1117840 rss=30.71GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.57GB SwapAvail=71.02GB
TIP Trucking
07:56:10] block #24,697,055 ts=2026-03-20 07:55:59 lag=+12.0s OK
[07:56:15] block #24,697,056 ts=2026-03-20 07:56:11 lag=+4.5s OK
[07:56:25] block #24,697,057 ts=2026-03-20 07:56:23 lag=+2.5s OK
[07:56:38] block #24,697,058 ts=2026-03-20 07:56:35 lag=+3.4s OK
[07:56:50] block #24,697,059 ts=2026-03-20 07:56:47 lag=+3.5s OK
[07:57:02] block #24,697,060 ts=2026-03-20 07:56:59 lag=+3.6s OK
[07:57:16] block #24,697,061 ts=2026-03-20 07:57:11 lag=+5.6s OK
[07:57:27] block #24,697,062 ts=2026-03-20 07:57:23 lag=+4.7s OK
[07:57:39] block #24,697,063 ts=2026-03-20 07:57:35 lag=+4.3s OK
[07:57:49] block #24,697,064 ts=2026-03-20 07:57:47 lag=+2.4s OK
[07:58:01] block #24,697,065 ts=2026-03-20 07:57:59 lag=+2.9s OK
[07:58:13] block #24,697,066 ts=2026-03-20 07:58:11 lag=+2.8s OK
[07:58:25] block #24,697,067 ts=2026-03-20 07:58:23 lag=+2.4s OK
[07:58:37] block #24,697,068 ts=2026-03-20 07:58:35 lag=+2.7s OK
[07:58:49] block #24,697,069 ts=2026-03-20 07:58:47 lag=+2.3s OK
[07:59:01] block #24,697,070 ts=2026-03-20 07:58:59 lag=+2.1s OK
[07:59:15] block #24,697,071 ts=2026-03-20 07:59:11 lag=+4.3s OK
[07:59:25] block #24,697,072 ts=2026-03-20 07:59:23 lag=+2.6s OK
[07:59:40] block #24,697,073 ts=2026-03-20 07:59:35 lag=+5.3s OK
[08:00:02] block #24,697,074 ts=2026-03-20 07:59:59 lag=+3.9s OK
[08:00:13] block #24,697,075 ts=2026-03-20 08:00:11 lag=+2.8s OK
./run_perf_tests.py -p
pattern/mainnet/stress_test_eth_call_001_latest.tar -t 28000:60 -y
eth_call -m 2 -r 100 -Z
Performance Test started
Test repetitions: 100 on sequence: 28000:60 for pattern:
pattern/mainnet/stress_test_eth_call_001_latest.tar
Test on port: http://localhost:8545
[1. 1] daemon: executes test qps: 28000 time: 60 -> [R=51.39%
max=605.449ms error=503 Service Unavailable]
[1. 2] daemon: executes test qps: 28000 time: 60 -> [R=51.55%
max=442.974ms error=503 Service Unavailable]
[1. 3] daemon: executes test qps: 28000 time: 60 -> [R=49.52%
max=440.405ms error=503 Service Unavailable]
[1. 4] daemon: executes test qps: 28000 time: 60 -> [R=51.01%
max=440.004ms error=503 Service Unavailable]
[1. 5] daemon: executes test qps: 28000 time: 60 -> [R=49.66%
max=597.333ms error=503 Service Unavailable]
./run_perf_tests.py -p
pattern/mainnet/stress_test_eth_call_001_latest.tar -t 28000:60 -y
eth_call -m 2 -r 100 -Z
Performance Test started
Test repetitions: 100 on sequence: 28000:60 for pattern:
pattern/mainnet/stress_test_eth_call_001_latest.tar
Test on port: http://localhost:8545
[1. 1] daemon: executes test qps: 28000 time: 60 -> [R=51.51%
max=581.793ms error=503 Service Unavailable]
[1. 2] daemon: executes test qps: 28000 time: 60 -> [R=51.61%
max=431.222ms error=503 Service Unavailable]
[1. 3] daemon: executes test qps: 28000 time: 60 -> [R=49.48%
max=495.57ms error=503 Service Unavailable]
[1. 4] daemon: executes test qps: 28000 time: 60 -> [R=50.91%
max=433.208ms error=503 Service Unavailable]
[1. 5] daemon: executes test qps: 28000 time: 60 -> [R=49.57%
max=538.283ms error=503 Service Unavailable]
Verified on CI TIPtrucking infrastructure. Previous software versions
experienced "TIP lost" at 3,000 QPS. With these changes, the system now
successfully handles up to 6,000 QPS without any TIP loss or
degradation.
</details>
Stress Test Observations (main release)
- Chain Tip Loss: Under heavy load, the node fails to stay synced and
the Chain Tip is lost, as the staged sync pipeline is starved of DB read
slots by queued RPC goroutines.
- Virtual Memory Pressure: The system experiences severe VM pressure,
with process swap usage reaching 11.81 GB. The massive accumulation of
goroutines blocked on roTxsLimiter.Acquire causes excessive paging and
swapping. This state is highly unstable and frequently leads to the
process being terminated by the OOM Killer, causing total node downtime.
- Request Satisfaction (100%): Despite the performance degradation, all
requests are eventually satisfied. However, this is achieved at the cost
of system stability and synchronization.
- Increased Latency: Request latency increases dramatically due to deep
queuing, with response times reaching up to 1m 40s.
---
Stress Test Observations (with PR)
- Chain Tip Stability: The two-level admission control prevents
goroutine accumulation entirely. The HTTP outer gate rejects excess
requests before any processing; the BeginRo inner gate ensures that any
RPC request that does enter the system uses TryAcquire (fail-fast)
rather than blocking. Internal callers (staged sync, background workers)
always use blocking Acquire
and are never rejected, so the pipeline makes continuous progress.
- Virtual Memory Pressure: Significantly lower memory footprint. By
eliminating request queuing at the HTTP layer, the system avoids
excessive paging and swapping (0.00 GB swap), keeping the OS stable.
- Request Satisfaction (~50%): Approximately 50% of requests are
satisfied; the remainder are immediately rejected with 503 Service
Unavailable. This is the intended fail-fast behavior — goroutines never
accumulate, DB slots are never exhausted.
- Latency Consistency: Response latency remains consistently low. By
refusing to queue requests beyond the system's capacity, the node avoids
the massive latency spikes (previously up to 1m 40s) seen before the
fix.
This behavior is aligned with Nethermind, which returns 503 Service
Unavailable under high load, prioritizing node health over request
queuing.
---
Final Observation
By adopting a fail-fast strategy at two levels — HTTP admission before
any expensive processing, and TryAcquire inside BeginRo for RPC callers
— we enforce resource isolation at the core level. Internal execution
paths retain guaranteed access to DB read slots via blocking Acquire,
while external RPC pressure is shed immediately. This approach shifts
congestion management
responsibility to the external infrastructure (load balancers, proxies),
which is better equipped to handle buffering, ensuring that the Erigon
node remains stable and synchronized regardless of external RPC load.
## 🚀 RPC Concurrency & Resource Management Comparison
| Feature | Erigon (main) | **Erigon (with PR)** |
| :--- | :--- | :--- |
| **Admission control** | ❌ None | ✅ **HTTP outer gate**
(`rpcAdmissionHandler`) |
| **Overload response** | Unlimited queuing | ✅ **Immediate HTTP 503** |
| **Rejection point** | ❌ None | ✅ Pre-CORS, Gzip, JSON decode |
| **Goroutine accumulation** | ⚠️ Yes, unlimited | ✅ **Eliminated** —
goroutines don't enter the system |
| **Internal pipeline protection** | ❌ RPC and staged sync compete for
slots | ✅ **Internal callers** use blocking `Acquire` |
| **DB slots protection** | ❌ None — RPC exhausts slots | ✅ `TryAcquire`
in `BeginRo` for RPC |
| **Memory under load** | ❌ Critical — swap up to 11.81 GB, OOM | ✅
**Stable** (0.00 GB swap in test) |
| **Latency under overload** | High (~1m 40s) | ✅ **Constantly low**
(fail-fast) |
| **Configuration required** | ❌ No concurrency flags | ✅ **Zero
config**; `--rpc.max.concurrency` optional |
| **Execution isolation** | ❌ Chain tip lost under load | ✅ **Guaranteed
by design** |
### 📊 Performance Comparison: Main (18/03) vs. PR
This benchmark compares the current `main` branch against this PR using
the same set of APIs under heavy load.
| API | main (18/03) post_exec p50 | PR post_exec p50 | Improvement |
| :--- | :---: | :---: | :---: |
| **eth_call** @ 3000 QPS | 6.82s ✅ | 5.89s ✅ | **−14%** |
| **eth_getBlockByNumber** @ 3000 QPS | 13.73s ⚠️ | 5.23s ✅ | **−62%** |
| **eth_getProof** @ 1000–3000 QPS | 49.12s (tip lost) | 2.84s ✅ |
**−94%** |
---
### 🔍 Key Observations
* **eth_call**: Neither `main` nor the PR caused a chain tip loss. Since
`eth_call` is read-only and light on DB slots, it is inherently more
stable, but the PR still delivers a **14% reduction** in p50 latency.
* **eth_getBlockByNumber**: Remains stable up to **6000 QPS** with no
actual tip loss. Any observed `sync=0` periods during testing were
identified as monitoring false negatives rather than actual node desync.
* **eth_getProof**: This is the most impactful result. While `main` lost
the chain tip at only 1000 QPS (p50=49s), the **PR successfully holds up
to 3000 QPS** with a p50 of 2.84s—a **94% performance gain**.
### 🏆 Overall Conclusion
The final PR successfully **eliminates chain tip loss** across all
tested APIs and QPS levels. No real tip loss was observed in any
production-level test run, ensuring much higher node reliability under
stress.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
lupin012
added a commit
that referenced
this pull request
Apr 2, 2026
This PR introduces an HTTP admission control layer to protect the Staged
Sync pipeline from being starved or delayed by high RPC load.
This PR introduces a two-level admission control system to protect the
Staged Sync pipeline from being starved or delayed by high RPC load.
Root Cause Analysis:
Under heavy RPC traffic, the node accumulates a large number of
goroutines blocked on roTxsLimiter.Acquire. When DB slots become
available, the backlog drains in a way that starves the staged sync
pipeline. The goroutine pile-up also causes a significant spike in
virtual memory and overall system instability.
Solution:
Two gates work in tandem:
1. HTTP admission handler (rpcAdmissionHandler) — outer gate installed
at the top of every HTTP RPC stack, before CORS, Gzip, or JSON decoding.
If the number of inflight requests exceeds the configured limit, the
request is rejected immediately with HTTP 503. This prevents goroutine
accumulation at the source. On every admitted request the handler tags
the context with
WithRPCContext (limit value) so the DB layer can identify the caller.
2. BeginRo inner gate — if the context carries a positive RPC limit,
BeginRo uses TryAcquire on roTxsLimiter and returns ErrServerOverloaded
immediately if the semaphore is full. Internal callers (staged sync,
background workers) always use blocking Acquire and are never rejected.
This two-level approach means most overload is shed at the HTTP layer
(goroutines never enter the system), while any RPC requests that slip
through under transient concurrency spikes are still fail-fast at the DB
layer rather than piling up behind the semaphore.
Configuration:
- --rpc.max.concurrency: HTTP admission limit.
- 0 (default): uses --db.read.concurrency (auto-tuned to GOMAXPROCS ×
64, capped at 9000)
- > 0: explicit limit
- -1: unlimited (admission control disabled, BeginRo falls back to
blocking Acquire) (as old behaviour)
| Resource | Result |
| :--- | :--- |
### Summary of Resource Management Improvements
| Resource | Result |
| :--- | :--- |
| **Goroutine pile-up** | ✅ Requests rejected at HTTP layer before CORS,
Gzip, or JSON decoding |
| **Staged sync starvation** | ✅ Internal callers (staged sync, workers)
use blocking `Acquire` and are never rejected; RPC uses `TryAcquire`
fail-fast |
| **Transient overload spikes** | ✅ `BeginRo` inner gate catches RPC
requests that pass the HTTP layer during concurrency spikes |
| **Scalability** | ✅ Default limit auto-tuned to `GOMAXPROCS × 64`
(capped at 9000) via `--db.read.concurrency` |
| **Configuration** | ✅ Zero required config, one optional flag
(`--rpc.max.concurrency`) |
Benchmark & Stress Test Results
Setup: 32 Cores, 64GB RAM, 70GB Swap. Minimal Node in Sync. Parallel
eth_call stress tests (28k QPS).
<details>
<summary><b>Click to expand: Benchmark Data (Before vs After on local
node)</b></summary>
### Current SW (main release)
CPU
03:23:56 PM all 29.55 0.00 22.30 34.33 0.00 13.83
03:24:06 PM all 56.41 0.00 15.44 10.83 0.00 17.32
03:24:16 PM all 75.60 0.00 13.36 2.86 0.00 8.18
03:24:26 PM all 73.19 0.00 14.35 2.82 0.00 9.63
03:24:36 PM all 73.35 0.00 14.56 2.75 0.00 9.34
Memory
15:23:30 rss=31.89GB vsz=7.65TB proc_swap=11.81GB sys_swap=27.21/72.00GB
MemAvail=1.15GB SwapAvail=44.79GB
15:23:40 rss=32.74GB vsz=7.65TB proc_swap=11.00GB sys_swap=27.02/72.00GB
MemAvail=1.50GB SwapAvail=44.98GB
15:23:50 rss=33.83GB vsz=7.65TB proc_swap=9.89GB sys_swap=25.65/72.00GB
MemAvail=1.44GB SwapAvail=46.35GB
15:24:00 rss=36.33GB vsz=7.65TB proc_swap=7.60GB sys_swap=23.55/72.00GB
MemAvail=1.67GB SwapAvail=48.45GB
15:24:10 rss=37.85GB vsz=7.65TB proc_swap=6.91GB sys_swap=21.83/72.00GB
MemAvail=5.10GB SwapAvail=50.17GB
15:24:20 rss=39.30GB vsz=7.65TB proc_swap=6.69GB sys_swap=20.23/72.00GB
MemAvail=7.28GB SwapAvail=51.77GB
15:24:30 rss=40.40GB vsz=7.65TB proc_swap=6.20GB sys_swap=17.94/72.00GB
MemAvail=10.20GB SwapAvail=54.06GB
15:24:40 rss=41.44GB vsz=7.65TB proc_swap=5.23GB sys_swap=14.95/72.00GB
MemAvail=20.01GB SwapAvail=57.05GB
15:24:50 rss=41.68GB vsz=7.65TB proc_swap=5.20GB sys_swap=14.92/72.00GB
MemAvail=16.14GB SwapAvail=57.08GB
15:25:00 rss=42.77GB vsz=7.65TB proc_swap=4.95GB sys_swap=14.87/72.00GB
MemAvail=11.41GB SwapAvail=57.13GB
15:25:11 rss=42.78GB vsz=7.65TB proc_swap=5.26GB sys_swap=15.55/72.00GB
MemAvail=8.58GB SwapAvail=56.45GB
15:25:21 rss=40.79GB vsz=7.65TB proc_swap=6.88GB sys_swap=17.46/72.00GB
MemAvail=5.65GB SwapAvail=54.54GB
TIP Trucking
[15:21:44] block #24,656,279 ts=2026-03-14 15:19:47 lag=+117.8s ALERT:
lag=117.8s — node is behind the tip!
[15:21:44] block #24,656,280 ts=2026-03-14 15:19:59 lag=+105.8s ALERT:
lag=105.8s — node is behind the tip!
[15:21:44] block #24,656,281 ts=2026-03-14 15:20:11 lag=+93.8s ALERT:
lag=93.8s — node is behind the tip!
[15:21:44] block #24,656,282 ts=2026-03-14 15:20:23 lag=+81.8s ALERT:
lag=81.8s — node is behind the tip!
[15:21:44] block #24,656,283 ts=2026-03-14 15:20:47 lag=+57.8s ALERT:
lag=57.8s — node is behind the tip!
[15:21:57] block #24,656,284 ts=2026-03-14 15:20:59 lag=+58.0s ALERT:
lag=58.0s — node is behind the tip!
[15:21:57] block #24,656,285 ts=2026-03-14 15:21:11 lag=+46.0s ALERT:
lag=46.0s — node is behind the tip!
[15:21:57] block #24,656,286 ts=2026-03-14 15:21:23 lag=+34.0s ALERT:
lag=34.0s — node is behind the tip!
[15:21:57] block #24,656,287 ts=2026-03-14 15:21:35 lag=+22.0s ALERT:
lag=22.0s — node is behind the tip!
[15:21:57] block #24,656,288 ts=2026-03-14 15:21:47 lag=+10.0s OK
[15:22:07] block #24,656,289 ts=2026-03-14 15:21:59 lag=+8.0s OK
[15:22:19] block #24,656,290 ts=2026-03-14 15:22:11 lag=+8.3s OK
[15:22:32] block #24,656,291 ts=2026-03-14 15:22:23 lag=+9.3s OK
[15:23:02] ALERT: no new block for 30s (last block #24656291) — node may
be losing the tip!
[15:23:32] ALERT: no new block for 60s (last block #24656291) — node may
be losing the tip!
[15:24:02] ALERT: no new block for 90s (last block #24656291) — node may
be losing the tip!
[15:24:24] block #24,656,292 ts=2026-03-14 15:22:35 lag=+109.5s ALERT:
lag=109.5s — node is behind the tip!
[15:24:24] block #24,656,293 ts=2026-03-14 15:22:47 lag=+97.5s ALERT:
lag=97.5s — node is behind the tip!
[15:24:24] block #24,656,294 ts=2026-03-14 15:22:59 lag=+85.5s ALERT:
lag=85.5s — node is behind the tip!
[15:24:24] block #24,656,295 ts=2026-03-14 15:23:11 lag=+73.5s ALERT:
lag=73.5s — node is behind the tip!
[15:24:54] ALERT: no new block for 30s (last block #24656295) — node may
be losing the tip!
[15:25:17] block #24,656,296 ts=2026-03-14 15:23:23 lag=+114.2s ALERT:
lag=114.2s — node is behind the tip!
[15:25:17] block #24,656,297 ts=2026-03-14 15:23:35 lag=+102.2s ALERT:
lag=102.2s — node is behind the tip!
[15:25:17] block #24,656,298 ts=2026-03-14 15:23:47 lag=+90.2s ALERT:
lag=90.2s — node is behind the tip!
[15:25:17] block #24,656,299 ts=2026-03-14 15:23:59 lag=+78.2s ALERT:
lag=78.2s — node is behind the tip!
[15:25:17] block #24,656,300 ts=2026-03-14 15:24:11 lag=+66.2s ALERT:
lag=66.2s — node is behind the tip!
[15:25:17] block #24,656,301 ts=2026-03-14 15:24:23 lag=+54.2s ALERT:
lag=54.2s — node is behind the tip!
[15:25:17] block #24,656,302 ts=2026-03-14 15:24:35 lag=+42.2s ALERT:
lag=42.2s — node is behind the tip!
[15:25:17] block #24,656,303 ts=2026-03-14 15:24:47 lag=+30.2s ALERT:
lag=30.2s — node is behind the tip!
> ./run_perf_tests.py -p
pattern/mainnet/stress_test_eth_call_001_latest.tar -t 28000:60 -y
eth_call -m 2 -r 100 -Z
Performance Test started
Test repetitions: 100 on sequence: 28000:60 for pattern:
pattern/mainnet/stress_test_eth_call_001_latest.tar
Test on port: http://localhost:8545
[1. 1] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m39s]
[1. 2] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m46s]
[1. 3] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m38s]
> ./run_perf_tests.py -p
pattern/mainnet/stress_test_eth_call_001_latest.tar -t 28000:60 -y
eth_call -m 2 -r 100 -Z
Performance Test started
Test repetitions: 100 on sequence: 28000:60 for pattern:
pattern/mainnet/stress_test_eth_call_001_latest.tar
Test on port: http://localhost:8545
[1. 1] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m39s]
[1. 2] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m45s]
[1. 3] daemon: executes test qps: 28000 time: 60 -> [R=100.00%
max=1m40s]
### NEW Software (with PR)
CPU
7:58:51 AM all 51.09 0.00 6.16 0.35 0.00 42.40
07:58:56 AM all 49.26 0.00 5.82 0.03 0.00 44.89
07:59:01 AM all 50.34 0.00 5.95 0.20 0.00 43.51
07:59:06 AM all 51.60 0.00 5.88 0.04 0.00 42.47
07:59:11 AM all 48.97 0.00 5.90 0.06 0.00 45.07
07:59:16 AM all 49.59 0.00 6.11 0.36 0.00 43.93
07:59:21 AM all 48.69 0.00 5.78 0.03 0.00 45.51
07:59:26 AM all 53.50 0.00 6.66 0.26 0.00 39.59
07:59:31 AM all 50.45 0.00 6.37 0.02 0.00 43.16
07:59:36 AM all 48.71 0.00 6.18 0.03 0.00 45.08
07:59:41 AM all 53.58 0.00 6.45 0.15 0.00 39.81
07:59:46 AM all 53.74 0.00 6.13 0.05 0.00 40.07
07:59:51 AM all 31.76 0.00 3.95 0.23 0.00 64.06
07:59:56 AM all 37.20 0.00 5.05 0.03 0.00 57.71
08:00:01 AM all 77.10 0.00 12.95 0.01 0.00 9.94
08:00:06 AM all 78.22 0.00 12.58 0.08 0.00 9.11
08:00:11 AM all 77.64 0.00 12.50 0.00 0.00 9.86
08:00:16 AM all 77.48 0.00 12.61 0.08 0.00 9.83
08:00:21 AM all 77.61 0.00 12.47 0.01 0.00 9.90
08:00:26 AM all 77.35 0.00 12.89 0.06 0.00 9.70
08:00:31 AM all 77.85 0.00 12.92 0.04 0.00 9.19
08:00:36 AM all 77.73 0.00 12.80 0.02 0.00 9.44
08:00:41 AM all 78.42 0.00 12.95 0.05 0.00 8.59
08:00:46 AM all 78.52 0.00 12.55 0.01 0.00 8.93
08:00:51 AM all 78.42 0.00 12.77 0.19 0.00 8.62
08:00:56 AM all 56.98 0.00 8.64 0.11 0.00 34.28
Memory
2026-03-20 08:00:36 pid=1117840 rss=30.04GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.93GB SwapAvail=71.02GB
2026-03-20 08:00:41 pid=1117840 rss=30.20GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.86GB SwapAvail=71.02GB
2026-03-20 08:00:41 pid=1117840 rss=30.20GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.86GB SwapAvail=71.02GB
2026-03-20 08:00:46 pid=1117840 rss=30.20GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.90GB SwapAvail=71.02GB
2026-03-20 08:00:46 pid=1117840 rss=30.20GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.90GB SwapAvail=71.02GB
2026-03-20 08:00:51 pid=1117840 rss=30.28GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.88GB SwapAvail=71.02GB
2026-03-20 08:00:51 pid=1117840 rss=30.28GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.88GB SwapAvail=71.02GB
2026-03-20 08:00:56 pid=1117840 rss=30.54GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=40.39GB SwapAvail=71.02GB
2026-03-20 08:00:56 pid=1117840 rss=30.54GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=40.39GB SwapAvail=71.02GB
2026-03-20 08:01:02 pid=1117840 rss=30.61GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=40.25GB SwapAvail=71.02GB
2026-03-20 08:01:02 pid=1117840 rss=30.61GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=40.25GB SwapAvail=71.02GB
2026-03-20 08:01:07 pid=1117840 rss=30.61GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.97GB SwapAvail=71.02GB
2026-03-20 08:01:07 pid=1117840 rss=30.61GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.97GB SwapAvail=71.02GB
2026-03-20 08:01:12 pid=1117840 rss=30.62GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.48GB SwapAvail=71.02GB
2026-03-20 08:01:12 pid=1117840 rss=30.62GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.48GB SwapAvail=71.02GB
2026-03-20 08:01:17 pid=1117840 rss=30.71GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.57GB SwapAvail=71.02GB
2026-03-20 08:01:17 pid=1117840 rss=30.71GB vsz=7.49TB proc_swap=0.00GB
sys_swap=0.98/72.00GB MemAvail=39.57GB SwapAvail=71.02GB
TIP Trucking
07:56:10] block #24,697,055 ts=2026-03-20 07:55:59 lag=+12.0s OK
[07:56:15] block #24,697,056 ts=2026-03-20 07:56:11 lag=+4.5s OK
[07:56:25] block #24,697,057 ts=2026-03-20 07:56:23 lag=+2.5s OK
[07:56:38] block #24,697,058 ts=2026-03-20 07:56:35 lag=+3.4s OK
[07:56:50] block #24,697,059 ts=2026-03-20 07:56:47 lag=+3.5s OK
[07:57:02] block #24,697,060 ts=2026-03-20 07:56:59 lag=+3.6s OK
[07:57:16] block #24,697,061 ts=2026-03-20 07:57:11 lag=+5.6s OK
[07:57:27] block #24,697,062 ts=2026-03-20 07:57:23 lag=+4.7s OK
[07:57:39] block #24,697,063 ts=2026-03-20 07:57:35 lag=+4.3s OK
[07:57:49] block #24,697,064 ts=2026-03-20 07:57:47 lag=+2.4s OK
[07:58:01] block #24,697,065 ts=2026-03-20 07:57:59 lag=+2.9s OK
[07:58:13] block #24,697,066 ts=2026-03-20 07:58:11 lag=+2.8s OK
[07:58:25] block #24,697,067 ts=2026-03-20 07:58:23 lag=+2.4s OK
[07:58:37] block #24,697,068 ts=2026-03-20 07:58:35 lag=+2.7s OK
[07:58:49] block #24,697,069 ts=2026-03-20 07:58:47 lag=+2.3s OK
[07:59:01] block #24,697,070 ts=2026-03-20 07:58:59 lag=+2.1s OK
[07:59:15] block #24,697,071 ts=2026-03-20 07:59:11 lag=+4.3s OK
[07:59:25] block #24,697,072 ts=2026-03-20 07:59:23 lag=+2.6s OK
[07:59:40] block #24,697,073 ts=2026-03-20 07:59:35 lag=+5.3s OK
[08:00:02] block #24,697,074 ts=2026-03-20 07:59:59 lag=+3.9s OK
[08:00:13] block #24,697,075 ts=2026-03-20 08:00:11 lag=+2.8s OK
./run_perf_tests.py -p
pattern/mainnet/stress_test_eth_call_001_latest.tar -t 28000:60 -y
eth_call -m 2 -r 100 -Z
Performance Test started
Test repetitions: 100 on sequence: 28000:60 for pattern:
pattern/mainnet/stress_test_eth_call_001_latest.tar
Test on port: http://localhost:8545
[1. 1] daemon: executes test qps: 28000 time: 60 -> [R=51.39%
max=605.449ms error=503 Service Unavailable]
[1. 2] daemon: executes test qps: 28000 time: 60 -> [R=51.55%
max=442.974ms error=503 Service Unavailable]
[1. 3] daemon: executes test qps: 28000 time: 60 -> [R=49.52%
max=440.405ms error=503 Service Unavailable]
[1. 4] daemon: executes test qps: 28000 time: 60 -> [R=51.01%
max=440.004ms error=503 Service Unavailable]
[1. 5] daemon: executes test qps: 28000 time: 60 -> [R=49.66%
max=597.333ms error=503 Service Unavailable]
./run_perf_tests.py -p
pattern/mainnet/stress_test_eth_call_001_latest.tar -t 28000:60 -y
eth_call -m 2 -r 100 -Z
Performance Test started
Test repetitions: 100 on sequence: 28000:60 for pattern:
pattern/mainnet/stress_test_eth_call_001_latest.tar
Test on port: http://localhost:8545
[1. 1] daemon: executes test qps: 28000 time: 60 -> [R=51.51%
max=581.793ms error=503 Service Unavailable]
[1. 2] daemon: executes test qps: 28000 time: 60 -> [R=51.61%
max=431.222ms error=503 Service Unavailable]
[1. 3] daemon: executes test qps: 28000 time: 60 -> [R=49.48%
max=495.57ms error=503 Service Unavailable]
[1. 4] daemon: executes test qps: 28000 time: 60 -> [R=50.91%
max=433.208ms error=503 Service Unavailable]
[1. 5] daemon: executes test qps: 28000 time: 60 -> [R=49.57%
max=538.283ms error=503 Service Unavailable]
Verified on CI TIPtrucking infrastructure. Previous software versions
experienced "TIP lost" at 3,000 QPS. With these changes, the system now
successfully handles up to 6,000 QPS without any TIP loss or
degradation.
</details>
Stress Test Observations (main release)
- Chain Tip Loss: Under heavy load, the node fails to stay synced and
the Chain Tip is lost, as the staged sync pipeline is starved of DB read
slots by queued RPC goroutines.
- Virtual Memory Pressure: The system experiences severe VM pressure,
with process swap usage reaching 11.81 GB. The massive accumulation of
goroutines blocked on roTxsLimiter.Acquire causes excessive paging and
swapping. This state is highly unstable and frequently leads to the
process being terminated by the OOM Killer, causing total node downtime.
- Request Satisfaction (100%): Despite the performance degradation, all
requests are eventually satisfied. However, this is achieved at the cost
of system stability and synchronization.
- Increased Latency: Request latency increases dramatically due to deep
queuing, with response times reaching up to 1m 40s.
---
Stress Test Observations (with PR)
- Chain Tip Stability: The two-level admission control prevents
goroutine accumulation entirely. The HTTP outer gate rejects excess
requests before any processing; the BeginRo inner gate ensures that any
RPC request that does enter the system uses TryAcquire (fail-fast)
rather than blocking. Internal callers (staged sync, background workers)
always use blocking Acquire
and are never rejected, so the pipeline makes continuous progress.
- Virtual Memory Pressure: Significantly lower memory footprint. By
eliminating request queuing at the HTTP layer, the system avoids
excessive paging and swapping (0.00 GB swap), keeping the OS stable.
- Request Satisfaction (~50%): Approximately 50% of requests are
satisfied; the remainder are immediately rejected with 503 Service
Unavailable. This is the intended fail-fast behavior — goroutines never
accumulate, DB slots are never exhausted.
- Latency Consistency: Response latency remains consistently low. By
refusing to queue requests beyond the system's capacity, the node avoids
the massive latency spikes (previously up to 1m 40s) seen before the
fix.
This behavior is aligned with Nethermind, which returns 503 Service
Unavailable under high load, prioritizing node health over request
queuing.
---
Final Observation
By adopting a fail-fast strategy at two levels — HTTP admission before
any expensive processing, and TryAcquire inside BeginRo for RPC callers
— we enforce resource isolation at the core level. Internal execution
paths retain guaranteed access to DB read slots via blocking Acquire,
while external RPC pressure is shed immediately. This approach shifts
congestion management
responsibility to the external infrastructure (load balancers, proxies),
which is better equipped to handle buffering, ensuring that the Erigon
node remains stable and synchronized regardless of external RPC load.
## 🚀 RPC Concurrency & Resource Management Comparison
| Feature | Erigon (main) | **Erigon (with PR)** |
| :--- | :--- | :--- |
| **Admission control** | ❌ None | ✅ **HTTP outer gate**
(`rpcAdmissionHandler`) |
| **Overload response** | Unlimited queuing | ✅ **Immediate HTTP 503** |
| **Rejection point** | ❌ None | ✅ Pre-CORS, Gzip, JSON decode |
| **Goroutine accumulation** | ⚠️ Yes, unlimited | ✅ **Eliminated** —
goroutines don't enter the system |
| **Internal pipeline protection** | ❌ RPC and staged sync compete for
slots | ✅ **Internal callers** use blocking `Acquire` |
| **DB slots protection** | ❌ None — RPC exhausts slots | ✅ `TryAcquire`
in `BeginRo` for RPC |
| **Memory under load** | ❌ Critical — swap up to 11.81 GB, OOM | ✅
**Stable** (0.00 GB swap in test) |
| **Latency under overload** | High (~1m 40s) | ✅ **Constantly low**
(fail-fast) |
| **Configuration required** | ❌ No concurrency flags | ✅ **Zero
config**; `--rpc.max.concurrency` optional |
| **Execution isolation** | ❌ Chain tip lost under load | ✅ **Guaranteed
by design** |
### 📊 Performance Comparison: Main (18/03) vs. PR
This benchmark compares the current `main` branch against this PR using
the same set of APIs under heavy load.
| API | main (18/03) post_exec p50 | PR post_exec p50 | Improvement |
| :--- | :---: | :---: | :---: |
| **eth_call** @ 3000 QPS | 6.82s ✅ | 5.89s ✅ | **−14%** |
| **eth_getBlockByNumber** @ 3000 QPS | 13.73s ⚠️ | 5.23s ✅ | **−62%** |
| **eth_getProof** @ 1000–3000 QPS | 49.12s (tip lost) | 2.84s ✅ |
**−94%** |
---
### 🔍 Key Observations
* **eth_call**: Neither `main` nor the PR caused a chain tip loss. Since
`eth_call` is read-only and light on DB slots, it is inherently more
stable, but the PR still delivers a **14% reduction** in p50 latency.
* **eth_getBlockByNumber**: Remains stable up to **6000 QPS** with no
actual tip loss. Any observed `sync=0` periods during testing were
identified as monitoring false negatives rather than actual node desync.
* **eth_getProof**: This is the most impactful result. While `main` lost
the chain tip at only 1000 QPS (p50=49s), the **PR successfully holds up
to 3000 QPS** with a p50 of 2.84s—a **94% performance gain**.
### 🏆 Overall Conclusion
The final PR successfully **eliminates chain tip loss** across all
tested APIs and QPS levels. No real tip loss was observed in any
production-level test run, ensuring much higher node reliability under
stress.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This was referenced Apr 16, 2026
AskAlexSharov
pushed a commit
that referenced
this pull request
Apr 18, 2026
Adds FAQ entries for the MCP server to the Help Center. ## Changes - `docs/gitbook-help/frequently-asked-questions-faqs.md` — FAQ #23: what is the MCP server; FAQ #24: how to connect Claude Desktop ## Notes - `mcp.md` already exists and is complete — no changes - Port 8553 and MCP flags already in `default-ports.md` and `configuring-erigon/README.md` - Second PR targeting `main`: #20605
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.