Skip to content

fix(snap-sync): refresh pivot instead of punishing the only peer (#6803)#11569

Merged
LukaszRozmej merged 5 commits into
NethermindEth:masterfrom
0xDevNinja:feature/6803-peer-quality-guard
May 15, 2026
Merged

fix(snap-sync): refresh pivot instead of punishing the only peer (#6803)#11569
LukaszRozmej merged 5 commits into
NethermindEth:masterfrom
0xDevNinja:feature/6803-peer-quality-guard

Conversation

@0xDevNinja

Copy link
Copy Markdown
Contributor

Fixes #6803

Changes

  • Track in AnalyzeResponsePerPeer whether the recent result window contains entries from any peer other than the one being analyzed.
  • When the only peer observed exceeds AllowedInvalidResponses with no offsetting success, call _snapProvider.UpdatePivot() and clear the log instead of returning LesserQuality. Punishing the only available peer in that case just stalls sync.
  • Multi-peer behavior is untouched — the existing allLastFailures == peerLastFailures punishment branch still fires when other peers are healthy and only this peer is misbehaving.

Types of changes

What types of changes does your code introduce?

  • Bugfix (a non-breaking change that fixes an issue)
  • New feature (a non-breaking change that adds functionality)
  • Breaking change (a change that causes existing functionality not to work as expected)
  • Optimization
  • Refactoring
  • Documentation update
  • Build-related changes
  • Other: Description

Testing

Requires testing

  • Yes
  • No

If yes, did you write tests?

  • Yes
  • No

Notes on testing

Two new cases in AnalyzeResponsePerPeerTests:

  • Single_peer_with_consecutive_failures_refreshes_pivot_instead_of_punishing — exercises the Do not mark peer as low quality too soon if there is only one peer connected #6803 path: six consecutive failures from a sole peer trigger UpdatePivot and never return LesserQuality.
  • Single_peer_with_recent_success_is_not_punished_below_threshold — confirms a sub-threshold failure burst after a success neither punishes nor refreshes.

All four pre-existing tests continue to pass.

Documentation

Requires documentation update

  • Yes
  • No

Requires explanation in Release Notes

  • Yes
  • No

Remarks

Earlier attempts (#6818, #7067) were closed for being too simplistic or lacking tests. This iteration:

  • Uses the result log itself as the single-peer signal (no ISyncPeerPool injection needed), keeping SnapSyncFeed constructor unchanged.
  • Only refreshes the pivot when truly observing one peer with zero successes, leaving the multi-peer punishment branch intact.

Fixes NethermindEth#6803. Track if other peers appear in the recent window; when only
one peer is observed and there are no successes, refresh the pivot
instead of marking it LesserQuality.
Covers NethermindEth#6803: consecutive failures from a sole peer trigger UpdatePivot,
and a sub-threshold burst after a success neither punishes nor refreshes.
{
foreach ((PeerInfo peer, AddRangeResult result) item in _resultLog)
{
if (item.peer != peer)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if item.peer is the first peer?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you are right. With the in-loop check, a newcomer that fails its first 6 requests while healthy peers sit further back in the log would never reach those older entries before seenOtherPeer got evaluated, so the guard would refresh the pivot instead of punishing the newcomer.

Moved seenOtherPeer to a pre-scan over the full window in 83a0327946 and added a regression test (New_peer_failing_burst_is_punished_when_log_holds_other_healthy_peers) that exercises exactly this case.

Addresses @asdacap's review on NethermindEth#11569: the in-loop seenOtherPeer check
could fire before reaching entries from older healthy peers, so a
newcomer's first AllowedInvalidResponses failures would refresh the
pivot instead of being demoted. Compute seenOtherPeer over the full
result window before the analysis loop. Adds a regression test for
the multi-peer-with-newcomer case.
@0xDevNinja 0xDevNinja requested a review from asdacap May 13, 2026 17:52
@LukaszRozmej

Copy link
Copy Markdown
Member

@claude review

@claude

claude Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Claude finished @LukaszRozmej's task in 4m 31s —— View job


PR Review: fix(snap-sync): refresh pivot instead of punishing the only peer

  • Reading changed files
  • Loading relevant rules
  • Analyzing implementation correctness
  • Checking tests
  • Posting findings

Summary

The fix is logically correct and addresses the original stall scenario (#6803) without disturbing multi-peer behavior. The reviewer-caught race (in-loop seenOtherPeer check firing prematurely for a newcomer) was properly addressed by the pre-scan in commit 83a0327.

Correctness trace

Key invariants verified by manual trace:

Scenario Expected Result
Single peer, ≥6 consecutive failures, no successes in log UpdatePivot + return OK ✓ condition 1 fires before condition 2
Single peer, ≤5 failures No action ✓ threshold never exceeded
New peer failing 6×, healthy peer's OK further back in log Punish (LesserQuality) ✓ pre-scan sets seenOtherPeer=true, falls to condition 2
Multi-peer, one peer consistently bad Punish ✓ condition 2 unchanged
Multi-peer, global failure stream UpdatePivot ✓ condition 3 unchanged

The ordering of conditions 1 → 2 → 3 is correct: condition 1 is a strict single-peer sub-case that is checked first, preventing condition 2 from incorrectly punishing the sole available peer.

Findings

Medium

  • Weak test assertion at line 141Is.Not.EqualTo(LesserQuality) passes for NoProgress, InternalError, etc. Since the input is DifferentRootHash, the expected result after a pivot refresh is OK. The assertion should be Is.EqualTo(SyncResponseHandlingResult.OK). (inline comment posted)

Low

  • Duplicated constant at line 195private const int AllowedInvalidResponses = 5 must stay in sync with the production constant. If the production value changes, the two new tests silently use wrong loop bounds. Consider exposing the constant as internal or adding a tracking note. (inline comment posted)

Pre-existing issues (not introduced by this PR)

  • The while (_resultLog.Count > maxSize) trimming loop reads _resultLog.Count outside lock (_syncLock) — a data race on LinkedList<T>. Not in scope here.
  • The two-lock structure (add in one lock, analyze in a second) means another thread can modify the log between the two acquisitions. Also pre-existing.

Overall

The production logic is sound and the three new tests meaningfully cover the bug scenario and its regression cases. The only blocking-adjacent issue is the imprecise test assertion (Medium); the constant duplication is a minor hygiene point. Both are in tests, not production code.

lastResult = feed.AnalyzeResponsePerPeer(AddRangeResult.DifferentRootHash, peer);
}

Assert.That(lastResult, Is.Not.EqualTo(SyncResponseHandlingResult.LesserQuality));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Weak assertion masks potential regressions

Assert.That(lastResult, Is.Not.EqualTo(SyncResponseHandlingResult.LesserQuality));

This passes for any result other than LesserQuality — including NoProgress or InternalError. Since the input is DifferentRootHash (not ExpiredRootHash), the expected post-refresh return value is OK. The assertion should be exact:

Suggested change
Assert.That(lastResult, Is.Not.EqualTo(SyncResponseHandlingResult.LesserQuality));
Assert.That(lastResult, Is.EqualTo(SyncResponseHandlingResult.OK));

Fix this →

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tightened in a5495b409b. Single-peer refresh path returns OK after UpdatePivot, so the assertion now checks that exact value.

snapProvider.DidNotReceive().UpdatePivot();
}

private const int AllowedInvalidResponses = 5;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — Duplicated constant drifts silently

AllowedInvalidResponses = 5 must stay in sync with the production constant in SnapSyncFeed. If the production value is changed, the loop bounds in Single_peer_with_recent_success_is_not_punished_below_threshold and New_peer_failing_burst_is_punished_when_log_holds_other_healthy_peers silently test the wrong thresholds.

One approach — expose the production constant as internal (with InternalsVisibleTo already wired in the test project) and reference it directly. Alternatively, add a note that this value must track the production constant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a5495b409b — promoted SnapSyncFeed.AllowedInvalidResponses to internal (with [assembly: InternalsVisibleTo("Nethermind.Synchronization.Test")] added) and the test file now references it directly. The local duplicate is gone, so changing the production value updates both loop bounds automatically.

Assert exact OK on single-peer pivot refresh; expose production constant
as internal so test loop bounds cannot drift from SnapSyncFeed.
InternalsVisibleTo("Nethermind.Synchronization.Test") is already declared
in PeerInfo.cs (and other files) in the same assembly, so the duplicate
attribute and its System.Runtime.CompilerServices using are unnecessary.

Rename the pre-scan tuple element from p to peer so probe.peer matches the
naming used by item.peer in the main loop below.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@LukaszRozmej LukaszRozmej merged commit e217f11 into NethermindEth:master May 15, 2026
547 of 551 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Do not mark peer as low quality too soon if there is only one peer connected

5 participants