fix(snap-sync): refresh pivot instead of punishing the only peer (#6803)#11569
Conversation
Fixes NethermindEth#6803. Track if other peers appear in the recent window; when only one peer is observed and there are no successes, refresh the pivot instead of marking it LesserQuality.
Covers NethermindEth#6803: consecutive failures from a sole peer trigger UpdatePivot, and a sub-threshold burst after a success neither punishes nor refreshes.
| { | ||
| foreach ((PeerInfo peer, AddRangeResult result) item in _resultLog) | ||
| { | ||
| if (item.peer != peer) |
There was a problem hiding this comment.
What if item.peer is the first peer?
There was a problem hiding this comment.
Good catch — you are right. With the in-loop check, a newcomer that fails its first 6 requests while healthy peers sit further back in the log would never reach those older entries before seenOtherPeer got evaluated, so the guard would refresh the pivot instead of punishing the newcomer.
Moved seenOtherPeer to a pre-scan over the full window in 83a0327946 and added a regression test (New_peer_failing_burst_is_punished_when_log_holds_other_healthy_peers) that exercises exactly this case.
Addresses @asdacap's review on NethermindEth#11569: the in-loop seenOtherPeer check could fire before reaching entries from older healthy peers, so a newcomer's first AllowedInvalidResponses failures would refresh the pivot instead of being demoted. Compute seenOtherPeer over the full result window before the analysis loop. Adds a regression test for the multi-peer-with-newcomer case.
|
@claude review |
|
Claude finished @LukaszRozmej's task in 4m 31s —— View job PR Review: fix(snap-sync): refresh pivot instead of punishing the only peer
SummaryThe fix is logically correct and addresses the original stall scenario (#6803) without disturbing multi-peer behavior. The reviewer-caught race (in-loop Correctness traceKey invariants verified by manual trace:
The ordering of conditions 1 → 2 → 3 is correct: condition 1 is a strict single-peer sub-case that is checked first, preventing condition 2 from incorrectly punishing the sole available peer. FindingsMedium
Low
Pre-existing issues (not introduced by this PR)
OverallThe production logic is sound and the three new tests meaningfully cover the bug scenario and its regression cases. The only blocking-adjacent issue is the imprecise test assertion (Medium); the constant duplication is a minor hygiene point. Both are in tests, not production code. |
| lastResult = feed.AnalyzeResponsePerPeer(AddRangeResult.DifferentRootHash, peer); | ||
| } | ||
|
|
||
| Assert.That(lastResult, Is.Not.EqualTo(SyncResponseHandlingResult.LesserQuality)); |
There was a problem hiding this comment.
Medium — Weak assertion masks potential regressions
Assert.That(lastResult, Is.Not.EqualTo(SyncResponseHandlingResult.LesserQuality));This passes for any result other than LesserQuality — including NoProgress or InternalError. Since the input is DifferentRootHash (not ExpiredRootHash), the expected post-refresh return value is OK. The assertion should be exact:
| Assert.That(lastResult, Is.Not.EqualTo(SyncResponseHandlingResult.LesserQuality)); | |
| Assert.That(lastResult, Is.EqualTo(SyncResponseHandlingResult.OK)); |
There was a problem hiding this comment.
Tightened in a5495b409b. Single-peer refresh path returns OK after UpdatePivot, so the assertion now checks that exact value.
| snapProvider.DidNotReceive().UpdatePivot(); | ||
| } | ||
|
|
||
| private const int AllowedInvalidResponses = 5; |
There was a problem hiding this comment.
Low — Duplicated constant drifts silently
AllowedInvalidResponses = 5 must stay in sync with the production constant in SnapSyncFeed. If the production value is changed, the loop bounds in Single_peer_with_recent_success_is_not_punished_below_threshold and New_peer_failing_burst_is_punished_when_log_holds_other_healthy_peers silently test the wrong thresholds.
One approach — expose the production constant as internal (with InternalsVisibleTo already wired in the test project) and reference it directly. Alternatively, add a note that this value must track the production constant.
There was a problem hiding this comment.
Done in a5495b409b — promoted SnapSyncFeed.AllowedInvalidResponses to internal (with [assembly: InternalsVisibleTo("Nethermind.Synchronization.Test")] added) and the test file now references it directly. The local duplicate is gone, so changing the production value updates both loop bounds automatically.
Assert exact OK on single-peer pivot refresh; expose production constant as internal so test loop bounds cannot drift from SnapSyncFeed.
InternalsVisibleTo("Nethermind.Synchronization.Test") is already declared
in PeerInfo.cs (and other files) in the same assembly, so the duplicate
attribute and its System.Runtime.CompilerServices using are unnecessary.
Rename the pre-scan tuple element from p to peer so probe.peer matches the
naming used by item.peer in the main loop below.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes #6803
Changes
AnalyzeResponsePerPeerwhether the recent result window contains entries from any peer other than the one being analyzed.AllowedInvalidResponseswith no offsetting success, call_snapProvider.UpdatePivot()and clear the log instead of returningLesserQuality. Punishing the only available peer in that case just stalls sync.allLastFailures == peerLastFailurespunishment branch still fires when other peers are healthy and only this peer is misbehaving.Types of changes
What types of changes does your code introduce?
Testing
Requires testing
If yes, did you write tests?
Notes on testing
Two new cases in
AnalyzeResponsePerPeerTests:Single_peer_with_consecutive_failures_refreshes_pivot_instead_of_punishing— exercises the Do not mark peer as low quality too soon if there is only one peer connected #6803 path: six consecutive failures from a sole peer triggerUpdatePivotand never returnLesserQuality.Single_peer_with_recent_success_is_not_punished_below_threshold— confirms a sub-threshold failure burst after a success neither punishes nor refreshes.All four pre-existing tests continue to pass.
Documentation
Requires documentation update
Requires explanation in Release Notes
Remarks
Earlier attempts (#6818, #7067) were closed for being too simplistic or lacking tests. This iteration:
ISyncPeerPoolinjection needed), keepingSnapSyncFeedconstructor unchanged.