fix(l1): p2p sync stall fixes and discovery hardening by edg-l · Pull Request #6394 · lambdaclass/ethrex

edg-l · 2026-03-23T08:37:15Z

Summary

Fixes sync stalls and discovery issues found while testing on mainnet-7.

discv4: remove ping amplification and drop unsolicited Neighbors (`discv4/server.rs`)

handle_neighbors was pinging every returned node immediately -- up to 16 PINGs per FindNode response (1:16 amplification). Removed the ping loop; new contacts are now validated by the revalidation loop instead.

Added pending_find_node tracking so only Neighbors responses to our own FindNode requests are accepted, preventing unsolicited table injection.

Reduced REVALIDATION_CHECK_INTERVAL from 12h to 30s as a required companion: without immediate pings, new contacts need the revalidation loop to validate them before they become usable for lookups.

snap sync: exponential backoff on pivot retry (`sync/snap_sync.rs`)

update_pivot was retrying the same peer thousands of times with no delay, tanking its score from 50 to -50. Now uses exponential backoff (1s/2s/4s/.../30s), caps at 3 retries per peer, resets backoff when a different peer is found, and bails out after 15 total failures.

download tasks: reserve request slots before spawning (`peer_handler.rs`, `snap/client.rs`)

Header/account/storage/bytecode download loops spawned tasks without incrementing the peer's request count first. Since inc_requests only ran inside the spawned task, get_best_peer kept returning the same peer, flooding it with dozens of concurrent requests in a single tick. Workers now call outgoing_request directly instead of make_request to avoid double inc_requests.

sync head resolution: prevent multi-hour stall (`peer_handler.rs`, `snap/constants.rs`, `sync/full.rs`)

request_block_headers asked all peers sequentially (42 peers x 15s timeout = 10+ min per retry) with 10 inner retries and 100 outer attempts. When no peer recognized the sync head hash, the node stalled for hours at "0 headers/s". Fixed by capping to 5 peers per retry, 3 inner retries, 5 outer attempts with 5s delay between them. MAX_HEADER_FETCH_ATTEMPTS reduced from 100 to 5.

Testing

Deployed to mainnet-7. Reaches 100 peers in ~6 minutes (same as main), headers download at 12,354/s, sync completes all phases without stalling.

Test plan

Deployed to mainnet-7, verified peer discovery speed matches main
cargo test -p ethrex-p2p

github-actions · 2026-03-23T08:39:11Z

🤖 Kimi Code Review

This PR introduces a UDP rate limiter and discovery traffic shaping to avoid hosting provider scan detection. The implementation is generally sound but has a few issues:

Critical Issues

1. Resource leak in rate limiter background task
crates/networking/p2p/discovery/rate_limiter.rs:35-46

The replenishment task runs indefinitely even when all RateLimitedSocket instances are dropped. It holds an Arc<Semaphore> clone, preventing cleanup:

tokio::spawn(async move {
    let mut ticker = time::interval(interval);
    loop {
        ticker.tick().await;
        if sem.available_permits() < MAX_PACKETS_PER_SECOND as usize {
            sem.add_permits(1);
        }
    }
});

Fix: Use a tokio_util::sync::CancellationToken or check if the Arc<Semaphore> strong count is 1 (indicating no sockets remain) to exit the loop.

2. Logic error in peer table pruning
crates/networking/p2p/peer_table.rs:845-850

The comment states "Remove contacts that have been repeatedly failing ... or that are disposable" but the code uses &&:

(c.disposable && c.failed_attempts >= MAX_FAILED_ATTEMPTS_BEFORE_PRUNE)

This only prunes disposable contacts that have failed 5+ times, leaving other disposable contacts in the table indefinitely (contrary to the previous behavior and comment).

Fix: Change to (c.disposable || c.failed_attempts >= MAX_FAILED_ATTEMPTS_BEFORE_PRUNE) or update the comment to match the intended behavior.

Minor Issues

3. Division by zero risk
crates/networking/p2p/discovery/rate_limiter.rs:33

let interval = Duration::from_micros(1_000_000 / MAX_PACKETS_PER_SECOND as u64);

If MAX_PACKETS_PER_SECOND is ever changed to 0 (e.g., via configuration), this panics. Use checked_div or ensure the constant is validated.

4. Unnecessary double iteration
crates/networking/p2p/peer_table.rs:920-945

get_contacts_to_revalidate iterates the contacts list twice (once for timeouts, once for selection). While O(2N) is acceptable for 5,000 contacts, this could be

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

github-actions · 2026-03-23T08:39:56Z

Lines of code report

Total lines added: 73
Total lines removed: 17
Total lines changed: 90

Detailed view

+------------------------------------------------+-------+------+
| File                                           | Lines | Diff |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/discv4/server.rs  | 692   | +25  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/peer_handler.rs   | 548   | +1   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/snap/client.rs    | 1166  | -17  |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/full.rs      | 248   | +5   |
+------------------------------------------------+-------+------+
| ethrex/crates/networking/p2p/sync/snap_sync.rs | 1018  | +42  |
+------------------------------------------------+-------+------+

github-actions · 2026-03-23T08:40:55Z

🤖 Codex Code Review

Timed-out pings are turned into "validated" contacts. In crates/networking/p2p/peer_table.rs#L939 the timeout path records a failure and clears ping_id, but it does not clear validation_timestamp. Since was_validated() is just validation_timestamp.is_some() && !has_pending_ping() at crates/networking/p2p/peer_table.rs#L111, a peer that never answered our ping becomes "validated" again immediately after timing out. That feeds directly into amplification gating in validate_contact() at crates/networking/p2p/peer_table.rs#L961 and into discv4 ENR lookups at crates/networking/p2p/peer_table.rs#L902. This should either clear validation_timestamp on timeout or split "last ping sent" from "last successful validation".
Disposable contacts are no longer actually pruned in the common case. prune() now removes only disposable && failed_attempts >= 5 at crates/networking/p2p/peer_table.rs#L844, but set_disposable() only increments the counter once at crates/networking/p2p/peer_table.rs#L1575, and all selection paths stop touching disposable entries at crates/networking/p2p/peer_table.rs#L887 and crates/networking/p2p/peer_table.rs#L1164. So a node marked disposable after a send error in crates/networking/p2p/discv4/server.rs#L275, crates/networking/p2p/discv4/server.rs#L686, or crates/networking/p2p/discv5/server.rs#L449 will usually stay in contacts forever, and later rediscovery only hits the occupied-entry path at crates/networking/p2p/peer_table.rs#L1025 / crates/networking/p2p/peer_table.rs#L1053. If the intent is delayed eviction, something still needs to keep driving failed_attempts upward for disposable contacts.

I couldn’t run the Rust tests in this sandbox because cargo/rustup tried to write under /home/runner/.rustup, which is read-only here.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

github-actions · 2026-03-23T08:41:40Z

🤖 Claude Code Review

Now I have enough context for a thorough review.

Review: fix(l1): p2p rate limit outgoing (PR #6394)

The motivation is solid and the PR description is excellent — the root-cause analysis is unusually thorough. Most of the individual fixes are correct. A few issues worth addressing before merge.

Critical: Disposable contacts are never pruned (memory leak)

prune() in peer_table.rs:852 now requires both disposable == true AND failed_attempts >= 5:

(c.disposable && c.failed_attempts >= MAX_FAILED_ATTEMPTS_BEFORE_PRUNE)
    .then_some(*c_id)

But is_validation_needed at line 1164 already contains !contact.disposable, which means disposable contacts are filtered out of the second pass of get_contacts_to_revalidate. Since they never get revalidated, they never accumulate more ping-timeout failures. A contact marked disposable via SetDisposable will have failed_attempts = 1 (from record_failure() in the same handler) — and will never be pruned.

Before this PR, every disposable contact was removed on the next 5-second prune cycle. Now they stay in the table forever (up to TARGET_CONTACTS). The condition should be:

// prune if disposable regardless of failure count,
// OR if too many failures even without being flagged disposable
(c.disposable || c.failed_attempts >= MAX_FAILED_ATTEMPTS_BEFORE_PRUNE)
    .then_some(*c_id)

Minor: Background replenishment task leaks on socket drop

RateLimitedSocket::new() at rate_limiter.rs:34 holds a strong Arc clone inside the spawned task:

let sem = semaphore.clone(); // strong clone
tokio::spawn(async move {
    loop { ticker.tick().await; sem.add_permits(1); }
});

When all RateLimitedSocket instances are dropped, the task still holds sem, keeping the Semaphore alive and looping forever. In production this doesn't matter (the socket lives as long as the process), but it does mean each call to new() — including every test in discv5_server_tests.rs — leaks a Tokio task.

Use a weak reference so the task can exit cleanly:

let sem = Arc::downgrade(&semaphore);
tokio::spawn(async move {
    let mut ticker = time::interval(interval);
    loop {
        ticker.tick().await;
        let Some(sem) = sem.upgrade() else { break };
        if sem.available_permits() < MAX_PACKETS_PER_SECOND as usize {
            sem.add_permits(1);
        }
    }
});

Minor: Non-random `.take()` in revalidation batch

get_contacts_to_revalidate at peer_table.rs:949:

self.contacts
    .values()
    .filter(...)
    .take(MAX_REVALIDATION_BATCH)
    .cloned()
    .collect()

HashMap::values() iteration order is deterministic for a given HashMap state but not random. If there are always more than 20 contacts that need revalidation, the same contacts at the "front" of the map will always be picked and contacts at the "back" will systematically never be revalidated. Consider shuffling the filtered results before truncating, or using a BTreeMap with an age-based sort key to guarantee fair ordering.

Nit: Misleading comment in `send_to`

rate_limiter.rs:59:

// The permit is dropped here, but tokens are replenished by the
// background task, not by dropping permits.
permit.forget();

forget() explicitly prevents the permit from being dropped in the RAII sense — the comment's phrasing "The permit is dropped here" is the opposite of what happens. Suggest:

// Consume the token permanently; the background task is responsible
// for replenishment, not permit drops.
permit.forget();

Good changes worth acknowledging

Ping amplification fix (discv4/server.rs:497): Removing the immediate ping-per-neighbor loop is the right call. The nodes.clone() → nodes simplification is a free bonus.
Global rate limiter architecture: Creating one RateLimitedSocket in network.rs and cloning it into both servers correctly shares a single budget. The semaphore-based token bucket is simple and correct for this use case.
Exponential backoff in Contact (peer_table.rs:139): The (1u64 << self.failed_attempts.min(12)).min(3600) expression is safe (no overflow), and the .min(12) guard is correct.
validation_timestamp interpretation for timed-out pings: record_ping_sent updates validation_timestamp = Instant::now() at send time, so using it as the ping age reference in the first pass is accurate.
Interval constants: The new values (1 s initial lookup, 3 s steady-state, 30 s revalidation check, 5 min revalidation interval) are proportionate and well-commented.
Arc<UdpSocket> → RateLimitedSocket propagation: cleanly threaded through both servers and the test helpers.

Automated review by Claude (Anthropic) · sonnet · custom prompt

github-actions · 2026-03-23T09:40:09Z

🤖 Codex Code Review

Findings

peer_table.rs and peer_table.rs: when a ping times out, the new code clears ping_id but leaves validation_timestamp intact. Because was_validated() is defined as validation_timestamp.is_some() && !has_pending_ping(), that timeout now makes the contact look validated again. After the backoff expires, the node is not retried until REVALIDATION_INTERVAL elapses from the failed ping send time, which is up to 12 hours for discv4. That turns transient timeout handling into long-lived false positives and undermines the intended exponential backoff. I’d either clear validation_timestamp on timeout or split “last ping sent” from “last successful validation”.
discv4/server.rs together with peer_table.rs and discv4/server.rs: NEIGHBORS entries are no longer pinged on receipt, but get_contact_for_lookup() still selects unvalidated contacts for outgoing FINDNODE. In discv4, peers commonly ignore FINDNODE until the ping/pong handshake has established the sender as valid. With revalidation now only running hourly, freshly discovered neighbors can sit unusable for a long time while lookup budget is spent on nodes that are very unlikely to answer. If the amplification concern is real, I’d either keep eager pinging for discv4 neighbors with a rate cap, or require was_validated() before a contact can be used for lookup.

Assumptions

I’m assuming the existing validate_contact() behavior reflects the intended discv4 trust model for inbound FINDNODE/ENRRequest, and that remote clients behave similarly enough for this to matter in practice.

Testing gaps

No targeted tests were added for the new timeout/backoff or discv4 neighbor flow.
I couldn’t run cargo test here because the sandbox blocks rustup from creating temp files under ~/.rustup.

Summary
Aside from the two discovery regressions above, the rest of the change looks directionally reasonable: the reduced lookup rate and capped discarded-contact set are understandable traffic-control measures.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

greptile-apps · 2026-03-23T09:41:56Z

Greptile Summary

This PR addresses a real production issue where ethrex nodes on Hetzner were generating ~29k outgoing UDP packets/second on port 30303, triggering abuse detection. The fix is multi-faceted and well-reasoned: it eliminates the root causes rather than adding a rate limiter.

Key changes:

discv4 ping amplification removed: The loop in handle_neighbors that immediately pinged every returned node (up to 16 per FindNode) is gone. New contacts are added to the table and validated by the scheduled revalidation loop only.
TARGET_CONTACTS reduced 100k → 5k: The primary driver of revalidation storms was a 100,000-entry table being revalidated in bulk; 5k is still ~18× geth's max of 272.
Revalidation rate slowed: discv5 interval moved from 30s → 5min (check: 10s → 30s); discv4 check interval moved from 12h → 1h so it's meaningfully shorter than the 12h stale threshold.
Lookup intervals slowed: Both protocols move from 100ms/600ms to 1s/3s.
Revalidation batch capped at 20: Prevents burst of thousands of PINGs per cycle.
Exponential backoff added: Contacts track failed_attempts / last_failure; lookup and revalidation selection skip contacts in backoff. Pruning triggers at 5 consecutive failures.
Bounded discard set: discarded_contacts converted from FxHashSet (unbounded, permanent) to IndexSet capped at 10,000; oldest entries evicted so they can be rediscovered.
sent_ping_is_stale fix: Previously evaluated for every contact regardless of whether a ping was pending; now correctly gated on has_pending_ping().
record_success() on pong receipt: Resets failure counter when a pong is received, allowing backoff to clear for responsive contacts.

Minor notes:

.take(20) in get_contacts_to_revalidate iterates in IndexMap insertion order, meaning the same oldest-inserted contacts are consistently prioritised; randomising the selection would give newer contacts a fair chance.
get_contact_for_enr_lookup does not apply the is_in_backoff() filter present in the other contact-selection paths.
With discv4's 1h check interval, newly discovered contacts via Neighbors responses won't be validated for up to 1 hour. This is an explicit, accepted tradeoff per the PR description; bootnodes and discv5 still bootstrap quickly.

Confidence Score: 4/5

PR is safe to merge; it fixes a real production network-abuse issue with well-scoped changes. Minor improvements remain (revalidation selection fairness, ENR lookup backoff) but none block correctness.
All 53 discovery tests pass, the root-cause analysis is thorough and matches the code changes, and the implemented mechanisms (backoff, batch cap, bounded discard set, reduced targets) are correct. The two P2 suggestions (randomise revalidation selection, apply backoff in ENR lookup) are improvements worth a follow-up but do not affect the primary goal of reducing UDP traffic.
crates/networking/p2p/peer_table.rs — revalidation batch selection and ENR lookup backoff filtering are the only areas needing attention.

Important Files Changed

Filename	Overview
crates/networking/p2p/peer_table.rs	Core change: adds exponential backoff (`failed_attempts`, `last_failure`) to `Contact`, reduces `TARGET_CONTACTS` from 100k to 5k, caps the revalidation batch at 20, bounds the discard set at 10k with LRU eviction, and fixes the `sent_ping_is_stale` logic to gate on `has_pending_ping()`. Timeout-detection first-pass in `get_contacts_to_revalidate` and `record_success()` on pong receipt are well-placed. Minor: `.take(20)` on insertion-ordered iterator may starve newer contacts; `get_contact_for_enr_lookup` lacks the backoff guard present in other selectors.
crates/networking/p2p/discv4/server.rs	Removes the per-neighbor immediate ping loop in `handle_neighbors` (eliminating 1:16 UDP amplification), reduces `REVALIDATION_CHECK_INTERVAL` from 12h to 1h (so it's meaningfully shorter than the 12h stale threshold), and cuts the lookup interval from 100ms/600ms to 1s/3s. Change is clean and intentional.
crates/networking/p2p/discv5/server.rs	Slows revalidation check from 10s→30s and stale threshold from 30s→5min, and halves lookup rates from 100ms/600ms to 1s/3s. Straightforward constant changes aligned with the PR's goal of reducing burst traffic.

Sequence Diagram

sequenceDiagram
    participant RV as RevalidationLoop
    participant PT as PeerTable
    participant Net as UDP Socket
    participant Remote as Remote Node

    Note over RV,PT: Every 30s (discv5) / 1h (discv4)
    RV->>PT: get_contacts_to_revalidate()
    PT->>PT: Pass 1 – detect timed-out pings (>30s)<br/>record_failure(), clear ping_id
    PT->>PT: Pass 2 – collect stale contacts<br/>skip is_in_backoff(), take(20)
    PT-->>RV: [up to 20 contacts]
    loop for each contact
        RV->>Net: send_ping()
        Net->>Remote: PING
        alt Pong received in time
            Remote->>Net: PONG
            Net->>PT: record_pong_received()<br/>record_success() → reset failed_attempts
        else Timeout (>30s, next cycle)
            PT->>PT: record_failure()<br/>failed_attempts++, backoff = 2^(n-1)s
            Note over PT: If failed_attempts >= 5 → prune
        end
    end

    Note over PT: SetDisposable (send error)<br/>record_failure() + disposable=true<br/>→ pruned next PRUNE_INTERVAL (5s)

    Note over PT: discarded_contacts capped at 10k<br/>oldest evicted → re-discoverable

Comments Outside Diff (1)

crates/networking/p2p/peer_table.rs, line 908-921 (link)

get_contact_for_enr_lookup doesn't filter on is_in_backoff()

get_contact_for_lookup and get_contacts_to_revalidate both skip contacts in backoff, but get_contact_for_enr_lookup does not. While ENR requests are infrequent, contacts that failed pings and are in exponential backoff will still receive ENR requests. For consistency and to fully honour the backoff semantics, consider adding the same guard here:

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 908-921

Comment:
**`get_contact_for_enr_lookup` doesn't filter on `is_in_backoff()`**

`get_contact_for_lookup` and `get_contacts_to_revalidate` both skip contacts in backoff, but `get_contact_for_enr_lookup` does not. While ENR requests are infrequent, contacts that failed pings and are in exponential backoff will still receive ENR requests. For consistency and to fully honour the backoff semantics, consider adding the same guard here:



How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 956-965

Comment:
**Revalidation batch may always favour earliest-inserted contacts**

`.take(20)` operates on an `IndexMap` iterated in insertion order, so the same 20 oldest-inserted contacts are always selected for revalidation until they fail and are pruned. With 5,000 contacts, recently-discovered nodes near the tail of the map could go unvalidated for many cycles.

Consider randomising the selection — e.g. shuffle the filtered iterator or use `choose_multiple` — so newer contacts also get a chance to be pinged:

```rust
        // Second pass: collect contacts that need revalidation and aren't in backoff
        let mut candidates: Vec<Contact> = self
            .contacts
            .values()
            .filter(|c| {
                c.supports_protocol(protocol)
                    && Self::is_validation_needed(c, revalidation_interval)
                    && !c.is_in_backoff()
            })
            .cloned()
            .collect();
        candidates.shuffle(&mut rand::rngs::OsRng);
        candidates.truncate(MAX_REVALIDATION_BATCH);
        candidates
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 908-921

Comment:
**`get_contact_for_enr_lookup` doesn't filter on `is_in_backoff()`**

`get_contact_for_lookup` and `get_contacts_to_revalidate` both skip contacts in backoff, but `get_contact_for_enr_lookup` does not. While ENR requests are infrequent, contacts that failed pings and are in exponential backoff will still receive ENR requests. For consistency and to fully honour the backoff semantics, consider adding the same guard here:

```suggestion
    fn get_contact_for_enr_lookup(&mut self) -> Option<Contact> {
        self.contacts
            .values()
            .filter(|c| {
                c.is_discv4
                    && c.was_validated()
                    && !c.has_pending_enr_request()
                    && c.record.is_none()
                    && !c.disposable
                    && !c.is_in_backoff()
            })
            .collect::<Vec<_>>()
            .choose(&mut rand::rngs::OsRng)
            .cloned()
            .cloned()
    }
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(l1): clear validation on ping timeou..." | Re-trigger Greptile}

greptile-apps · 2026-03-23T09:41:59Z

+        }
+
+        // Second pass: collect contacts that need revalidation and aren't in backoff.
+        // Shuffle before truncating so we don't always pick the same contacts
+        // (IndexMap iterates in insertion order).
+        let mut candidates: Vec<Contact> = self
+            .contacts
            .values()
            .filter(|c| {
                c.supports_protocol(protocol)


Revalidation batch may always favour earliest-inserted contacts

.take(20) operates on an IndexMap iterated in insertion order, so the same 20 oldest-inserted contacts are always selected for revalidation until they fail and are pruned. With 5,000 contacts, recently-discovered nodes near the tail of the map could go unvalidated for many cycles.

Consider randomising the selection — e.g. shuffle the filtered iterator or use choose_multiple — so newer contacts also get a chance to be pinged:

// Second pass: collect contacts that need revalidation and aren't in backoff let mut candidates: Vec<Contact> = self .contacts .values() .filter(|c| { c.supports_protocol(protocol) && Self::is_validation_needed(c, revalidation_interval) && !c.is_in_backoff() }) .cloned() .collect(); candidates.shuffle(&mut rand::rngs::OsRng); candidates.truncate(MAX_REVALIDATION_BATCH); candidates

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/networking/p2p/peer_table.rs Line: 956-965 Comment: **Revalidation batch may always favour earliest-inserted contacts** `.take(20)` operates on an `IndexMap` iterated in insertion order, so the same 20 oldest-inserted contacts are always selected for revalidation until they fail and are pruned. With 5,000 contacts, recently-discovered nodes near the tail of the map could go unvalidated for many cycles. Consider randomising the selection — e.g. shuffle the filtered iterator or use `choose_multiple` — so newer contacts also get a chance to be pinged: ```rust // Second pass: collect contacts that need revalidation and aren't in backoff let mut candidates: Vec<Contact> = self .contacts .values() .filter(|c| { c.supports_protocol(protocol) && Self::is_validation_needed(c, revalidation_interval) && !c.is_in_backoff() }) .cloned() .collect(); candidates.shuffle(&mut rand::rngs::OsRng); candidates.truncate(MAX_REVALIDATION_BATCH); candidates ``` How can I resolve this? If you propose a fix, please make it concise.

github-actions · 2026-03-23T09:42:24Z

🤖 Claude Code Review

Now I have enough context to write the review.

Review: `fix(l1): p2p rate limit outgoing` (PR #6394)

This PR addresses a real operational issue — Hetzner abuse detection from excessive UDP packet output — with a set of mostly well-reasoned changes. The core fixes are sound.

Correctness

was_validated() causes double-pinging of in-flight contacts

was_validated() (peer_table.rs:116) returns false when has_pending_ping() is true:

pub fn was_validated(&self) -> bool {
    self.validation_timestamp.is_some() && !self.has_pending_ping()
}

In the second pass of get_contacts_to_revalidate, is_validation_needed evaluates !was_validated() as true for any contact with an in-flight ping, making validation_is_stale = true. This means contacts that have already been pinged (within the last 30s) are re-selected by the second pass. When a new ping is sent via record_ping_sent, the old ping_id is overwritten and validation_timestamp is reset, effectively restarting the timeout clock.

Consequence: a contact that is slow to respond (say, 20s round-trip) gets a new ping every revalidation check cycle (30s), and its timeout clock keeps resetting so it never trips the 30s timeout in the first pass. The fix is to add && !c.has_pending_ping() to the second pass filter, or alternatively change was_validated() to not depend on has_pending_ping().

Misleading prune() comment

peer_table.rs (new code around line 847):

// Remove contacts that are either:
// - disposable (send error) with enough failures, or
// - have accumulated enough ping timeouts (even without disposable flag)

The condition is || c.disposable, so disposable contacts are pruned unconditionally regardless of failed_attempts. The comment implies a failure threshold is required for disposable contacts, which is wrong. Suggested fix:

// Remove contacts that are either:
// - disposable (e.g., send error occurred), or
// - have accumulated >= MAX_FAILED_ATTEMPTS_BEFORE_PRUNE timeouts

SetDisposable redundantly calls record_failure()

peer_table.rs:1593:

contact.disposable = true;
contact.record_failure();

Disposable contacts are pruned on the next prune() cycle (every 5s) unconditionally via || c.disposable. The record_failure() call has no observable effect on the contact's lifetime. This is harmless but could be confusing — a reader might think the failure count matters for eviction of disposable contacts.

Performance

shift_remove_index(0) is O(n)

peer_table.rs in prune():

while self.discarded_contacts.len() >= MAX_DISCARDED_CONTACTS {
    self.discarded_contacts.shift_remove_index(0);
}

IndexSet::shift_remove_index(0) shifts all subsequent elements left — O(n) where n = MAX_DISCARDED_CONTACTS (10,000). In steady state this runs at most once per pruned contact per cycle, so it's tolerable, but a VecDeque<H256> + FxHashSet<H256> would give O(1) eviction and O(1) membership. Not a blocker at this scale, but worth a comment explaining why IndexSet was chosen.

Good Changes

prune() uses || — correct; ensures both disposable and repeatedly-failing contacts get evicted independently.
sent_ping_is_stale gated on has_pending_ping() in is_validation_needed — correct fix; previously contacts without in-flight pings could be incorrectly queued.
Shuffle before truncate in get_contacts_to_revalidate — good; prevents the same first-N contacts from monopolizing revalidation slots.
validation_timestamp = None after timeout — correct; resets the contact to "unvalidated" state after a failed ping.
UDP amplification removal in handle_neighbors — correct; the 1:16 ping amplification was the primary contributor to the packet storm.
Exponential backoff arithmetic is safe: failed_attempts.saturating_sub(1).min(12) caps the shift at 12 bits (max 4096s), then .min(3600) clamps to 1 hour. No overflow risk.
TARGET_CONTACTS 100,000 → 5,000 — still 18× Geth's 272, but a substantial improvement.

Summary

The double-ping issue (was_validated() returning false for in-flight contacts) is the only meaningful correctness concern introduced or left unaddressed. The other points are documentation/minor efficiency. The operational fix (removing the 1:16 amplification, reducing aggressiveness, adding backoff + bounded eviction) is well-motivated and correctly implemented.

Automated review by Claude (Anthropic) · sonnet · custom prompt

Arkenan

There's a single change I would do. The rest looks perfect. I'm only marking it as a request for changes as I don't think it wise to merge without addressing it.

Arkenan

LGTM thanks for the changes

ElFantasma

I have a concern regarding the speed that peers will be discovered. Lookup interval of 1 second seems too slow. That interval is used for FindNode messages on discv4 and discv5, and also to try to establish RLPx connections with new peers.
Also that interval is configurable via CLI, if it is too agressive a user could change it in the command line.
Also discv4 seems to slow to start sending pings to found neighbors. This should be tested with discv5 disabled and see how discv4 alone behaves (and maybe a test with discv5 alone too)

Discovery: - Remove discv4 ping amplification: handle_neighbors no longer pings every returned node (was up to 16 PINGs per FindNode) - Drop unsolicited Neighbors messages using pending_find_node tracking Snap sync pivot: - Exponential backoff on pivot retry (1s/2s/4s/.../30s) - Cap at 3 retries per peer, 15 total before bailing out - Reset backoff when a different peer is found Header download: - Reserve request slots before tokio::spawn to prevent flooding a single peer with dozens of concurrent tasks - Workers call outgoing_request directly (not make_request) to avoid double inc_requests - Cap sync head resolution to 5 peers per retry, 3 inner retries, 5 outer attempts, 5s delay between retries - Reduce MAX_HEADER_FETCH_ATTEMPTS from 100 to 5

Required companion to the ping amplification removal: without immediate pings in handle_neighbors, new contacts need the revalidation loop to validate them. At 12h interval they were stuck unvalidated and invisible to lookups.

ElFantasma

It LGTM now. It should be tested to ensure peers are still discovered at an appropriate speed, but the changes seem OK.

In any case, incoming and outgoing packet rates should be fine-tuned after working on #5990

edg-l · 2026-03-25T17:23:56Z

It LGTM now. It should be tested to ensure peers are still discovered at an appropriate speed, but the changes seem OK.

In any case, incoming and outgoing packet rates should be fine-tuned after working on #5990

Tested on mainned, reached 100 peers in 5-6 minutes like main

## Summary Fixes sync stalls and discovery issues found while testing on mainnet-7. ### discv4: remove ping amplification and drop unsolicited Neighbors (`discv4/server.rs`) `handle_neighbors` was pinging every returned node immediately -- up to 16 PINGs per FindNode response (1:16 amplification). Removed the ping loop; new contacts are now validated by the revalidation loop instead. Added `pending_find_node` tracking so only Neighbors responses to our own FindNode requests are accepted, preventing unsolicited table injection. Reduced `REVALIDATION_CHECK_INTERVAL` from 12h to 30s as a required companion: without immediate pings, new contacts need the revalidation loop to validate them before they become usable for lookups. ### snap sync: exponential backoff on pivot retry (`sync/snap_sync.rs`) `update_pivot` was retrying the same peer thousands of times with no delay, tanking its score from 50 to -50. Now uses exponential backoff (1s/2s/4s/.../30s), caps at 3 retries per peer, resets backoff when a different peer is found, and bails out after 15 total failures. ### download tasks: reserve request slots before spawning (`peer_handler.rs`, `snap/client.rs`) Header/account/storage/bytecode download loops spawned tasks without incrementing the peer's request count first. Since `inc_requests` only ran inside the spawned task, `get_best_peer` kept returning the same peer, flooding it with dozens of concurrent requests in a single tick. Workers now call `outgoing_request` directly instead of `make_request` to avoid double `inc_requests`. ### sync head resolution: prevent multi-hour stall (`peer_handler.rs`, `snap/constants.rs`, `sync/full.rs`) `request_block_headers` asked all peers sequentially (42 peers x 15s timeout = 10+ min per retry) with 10 inner retries and 100 outer attempts. When no peer recognized the sync head hash, the node stalled for hours at "0 headers/s". Fixed by capping to 5 peers per retry, 3 inner retries, 5 outer attempts with 5s delay between them. `MAX_HEADER_FETCH_ATTEMPTS` reduced from 100 to 5. ## Testing Deployed to mainnet-7. Reaches 100 peers in ~6 minutes (same as main), headers download at 12,354/s, sync completes all phases without stalling. ## Test plan - [x] Deployed to mainnet-7, verified peer discovery speed matches main - [ ] `cargo test -p ethrex-p2p` # Conflicts: # crates/networking/p2p/snap/client.rs

Fixes sync stalls and discovery issues found while testing on mainnet-7. (`discv4/server.rs`) `handle_neighbors` was pinging every returned node immediately -- up to 16 PINGs per FindNode response (1:16 amplification). Removed the ping loop; new contacts are now validated by the revalidation loop instead. Added `pending_find_node` tracking so only Neighbors responses to our own FindNode requests are accepted, preventing unsolicited table injection. Reduced `REVALIDATION_CHECK_INTERVAL` from 12h to 30s as a required companion: without immediate pings, new contacts need the revalidation loop to validate them before they become usable for lookups. `update_pivot` was retrying the same peer thousands of times with no delay, tanking its score from 50 to -50. Now uses exponential backoff (1s/2s/4s/.../30s), caps at 3 retries per peer, resets backoff when a different peer is found, and bails out after 15 total failures. (`peer_handler.rs`, `snap/client.rs`) Header/account/storage/bytecode download loops spawned tasks without incrementing the peer's request count first. Since `inc_requests` only ran inside the spawned task, `get_best_peer` kept returning the same peer, flooding it with dozens of concurrent requests in a single tick. Workers now call `outgoing_request` directly instead of `make_request` to avoid double `inc_requests`. `snap/constants.rs`, `sync/full.rs`) `request_block_headers` asked all peers sequentially (42 peers x 15s timeout = 10+ min per retry) with 10 inner retries and 100 outer attempts. When no peer recognized the sync head hash, the node stalled for hours at "0 headers/s". Fixed by capping to 5 peers per retry, 3 inner retries, 5 outer attempts with 5s delay between them. `MAX_HEADER_FETCH_ATTEMPTS` reduced from 100 to 5. Deployed to mainnet-7. Reaches 100 peers in ~6 minutes (same as main), headers download at 12,354/s, sync completes all phases without stalling. - [x] Deployed to mainnet-7, verified peer discovery speed matches main - [ ] `cargo test -p ethrex-p2p`

github-actions Bot assigned edg-l Mar 23, 2026

edg-l changed the title ~~fix(p2p): rate limit outgoing UDP to prevent Hetzner network scan bans~~ fix(l1): p2p rate limit outgoing Mar 23, 2026

github-actions Bot added the L1 Ethereum client label Mar 23, 2026

github-project-automation Bot added this to ethrex_l1 Mar 23, 2026

edg-l moved this to In Progress in ethrex_l1 Mar 23, 2026

edg-l marked this pull request as ready for review March 23, 2026 09:37

edg-l requested a review from a team as a code owner March 23, 2026 09:37

ethrex-project-sync Bot moved this from In Progress to In Review in ethrex_l1 Mar 23, 2026

greptile-apps Bot reviewed Mar 23, 2026

View reviewed changes

edg-l requested a review from ElFantasma March 23, 2026 12:20

Arkenan requested changes Mar 23, 2026

View reviewed changes

Comment thread crates/networking/p2p/peer_table.rs Outdated

github-project-automation Bot moved this from In Review to In Progress in ethrex_l1 Mar 23, 2026

Arkenan approved these changes Mar 24, 2026

View reviewed changes

azteca1998 approved these changes Mar 24, 2026

View reviewed changes

ElFantasma requested changes Mar 24, 2026

View reviewed changes

Comment thread crates/networking/p2p/discv4/server.rs Outdated

edg-l changed the title ~~fix(l1): p2p rate limit outgoing~~ fix(l1): p2p discovery, sync stalls, and request flooding Mar 25, 2026

edg-l changed the title ~~fix(l1): p2p discovery, sync stalls, and request flooding~~ fix(l1): p2p rate limit outgoing Mar 25, 2026

edg-l changed the title ~~fix(l1): p2p rate limit outgoing~~ fix(l1): p2p rate limit outgoing and improve sync Mar 25, 2026

edg-l added 2 commits March 25, 2026 15:52

fix(l1): reduce discv4 revalidation check to 30s

0c5c897

Required companion to the ping amplification removal: without immediate pings in handle_neighbors, new contacts need the revalidation loop to validate them. At 12h interval they were stuck unvalidated and invisible to lookups.

edg-l force-pushed the fix/p2p-udp-rate-limiting branch from 5d869a7 to 0c5c897 Compare March 25, 2026 15:20

edg-l changed the title ~~fix(l1): p2p rate limit outgoing and improve sync~~ fix(l1): p2p sync stall fixes and discovery hardening Mar 25, 2026

ElFantasma approved these changes Mar 25, 2026

View reviewed changes

github-project-automation Bot moved this from In Progress to In Review in ethrex_l1 Mar 25, 2026

edg-l added this pull request to the merge queue Mar 25, 2026

Merged via the queue into main with commit 5837959 Mar 25, 2026
86 of 88 checks passed

edg-l deleted the fix/p2p-udp-rate-limiting branch March 25, 2026 18:16

github-project-automation Bot moved this from In Review to Done in ethrex_l1 Mar 25, 2026

ElFantasma added a commit that referenced this pull request Mar 27, 2026

docs(l1): fix #6394 double-count and #6327 merge date

744de7f

ElFantasma mentioned this pull request Apr 14, 2026

fix(l1): snap sync pivot update crashes due to peer selection bugs #6474

Open

iovoid added a commit that referenced this pull request Apr 20, 2026

heal: reserve peer request slot before spawn (mirror #6394 fix)

0ac2429

iovoid added a commit that referenced this pull request Apr 28, 2026

heal: reserve peer request slot before spawn (mirror #6394 fix)

c5a40a7

Conversation

edg-l commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

discv4: remove ping amplification and drop unsolicited Neighbors (discv4/server.rs)

snap sync: exponential backoff on pivot retry (sync/snap_sync.rs)

download tasks: reserve request slots before spawning (peer_handler.rs, snap/client.rs)

sync head resolution: prevent multi-hour stall (peer_handler.rs, snap/constants.rs, sync/full.rs)

Testing

Test plan

Uh oh!

github-actions Bot commented Mar 23, 2026

🤖 Kimi Code Review

Critical Issues

Minor Issues

Uh oh!

github-actions Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lines of code report

Uh oh!

github-actions Bot commented Mar 23, 2026

🤖 Codex Code Review

Uh oh!

github-actions Bot commented Mar 23, 2026

🤖 Claude Code Review

Review: fix(l1): p2p rate limit outgoing (PR #6394)

Critical: Disposable contacts are never pruned (memory leak)

Minor: Background replenishment task leaks on socket drop

Minor: Non-random .take() in revalidation batch

Nit: Misleading comment in send_to

Good changes worth acknowledging

Uh oh!

github-actions Bot commented Mar 23, 2026

🤖 Codex Code Review

Uh oh!

greptile-apps Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Mar 23, 2026

🤖 Claude Code Review

Review: fix(l1): p2p rate limit outgoing (PR #6394)

Correctness

Performance

Good Changes

Summary

Uh oh!

Arkenan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Arkenan left a comment

Choose a reason for hiding this comment

Uh oh!

ElFantasma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ElFantasma left a comment

Choose a reason for hiding this comment

Uh oh!

edg-l commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

edg-l commented Mar 23, 2026 •

edited

Loading

discv4: remove ping amplification and drop unsolicited Neighbors (`discv4/server.rs`)

snap sync: exponential backoff on pivot retry (`sync/snap_sync.rs`)

download tasks: reserve request slots before spawning (`peer_handler.rs`, `snap/client.rs`)

sync head resolution: prevent multi-hour stall (`peer_handler.rs`, `snap/constants.rs`, `sync/full.rs`)

github-actions Bot commented Mar 23, 2026 •

edited

Loading

Minor: Non-random `.take()` in revalidation batch

Nit: Misleading comment in `send_to`

greptile-apps Bot commented Mar 23, 2026 •

edited

Loading

Review: `fix(l1): p2p rate limit outgoing` (PR #6394)