Skip to content

Feature/Rework DiscV4#8616

Merged
flcl42 merged 126 commits into
masterfrom
feature/kademlia-discv4
May 27, 2026
Merged

Feature/Rework DiscV4#8616
flcl42 merged 126 commits into
masterfrom
feature/kademlia-discv4

Conversation

@asdacap

@asdacap asdacap commented May 12, 2025

Copy link
Copy Markdown
Contributor
  • Rework DiscV4 to use a somewhat reusable Kademlia implementation that is also used for Portal.
    • This kademlia implementation uses proper tree routing table.
  • INodeSource from discv4 now actively iterate the kad network for random target instead of randomly find nodes on another thread and waiting for new nodes event to send to INodeSource.
    • Discovery traffic now properly go down as it uses IAsyncEnumerable to block further request when peer manager is full.
  • Additionally, peer now get pinged first before sending to PeerManager.
  • Did not add forkid check in this PR. Experiment shows that confusingly not many peer expose forkid.
  • Although it does populate peer candidate list faster, this is largely due to much higher limits. In fact the efficiency in terms of new candidate per message is lower.
  • Graph is (After, Before).
    Screenshot_2025-05-22_08-10-29

Changes

  • Almost everything Discv4 except for serialization.

How to review.

  • Probably start with DiscoveryApp, follow DiscV4KademliaModule, and then KademliaModule.
  • The important part are probably KademliaDiscv4Adapter.

Types of changes

What types of changes does your code introduce?

  • Optimization
  • Refactoring

Testing

Requires testing

  • Yes
  • No

If yes, did you write tests?

  • Yes
  • No

Notes on testing

  • Hive discv4
  • Mainnet
  • Energyweb

Comment thread src/Nethermind/Nethermind.Network.Discovery/Discv4/EnrResponseHandler.cs Outdated
Comment thread src/Nethermind/Nethermind.Network.Discovery/Discv4/IMessageHandler.cs Outdated
Comment thread src/Nethermind/Nethermind.Network.Discovery/Discv4/KademliaDiscv4Adapter.cs Outdated
Comment thread src/Nethermind/Nethermind.Network.Discovery/Discv4/NodeSession.cs Outdated
Comment thread src/Nethermind/Nethermind.Network.Discovery/Discv4/NodeSession.cs Outdated
Comment thread src/Nethermind/Nethermind.Network.Discovery/Discv4/PongMsgHandler.cs Outdated
Comment thread src/Nethermind/Nethermind.Network.Discovery/Kademlia/DoubleEndedLru.cs Outdated
Comment thread src/Nethermind/Nethermind.Network.Discovery/Kademlia/IteratorNodeLookup.cs Outdated
Comment thread src/Nethermind/Nethermind.Network.Discovery/Kademlia/IteratorNodeLookup.cs Outdated
Comment thread src/Nethermind/Nethermind.Network.Discovery/TalkReqAndRespHandler.cs Outdated
LukaszRozmej and others added 2 commits May 22, 2026 09:51
- NodeSessionTests: collapse three flag-and-timeout tests into one [TestCaseSource]
- IteratorNodeLookupTests: extract routing-table / find-neighbours stub helpers
- KademliaDiscv4AdapterTests: reuse ConfigureBondCallback in ping test
- KademliaSimulation: drop accidentally duplicated [TestFixture(3, 0)]

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- EnrResponseHandler / PongMsgHandler: collapse Handle to expression body
- DoubleEndedLru.GetByHash: collapse to ternary expression body
- IteratorNodeLookup.ShouldStop: drop redundant if, return condition directly
- IteratorNodeLookup.FindNeighbour: collapse unreachable check + send to single return
- NodeSession.RecordStatsFor*Msg: dedupe via shared helper with switch expression
  (relies on Discovery*Out/Discovery*In pairs being adjacent in NodeStatsEventType)
- KademliaDiscv4Adapter: drop #region
- ITaskCompleter<T>: move to its own file
- TalkReqAndRespHandler.HandleResponse: collapse to one line

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@LukaszRozmej LukaszRozmej left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep review

Findings below are observations, not pushed changes — please decide which to apply.

Important (real defects, but not blockers)

  1. KBucketTree.cs:59,153 — events fire under McsLock. OnNodeAdded / OnNodeRemoved are invoked synchronously while _lock is held. The internal subscriber (KademliaNodeSource.Handler) uses non-blocking Channel.Writer.TryWrite, but DiscoveryApp.OnKademliaNodeRemoved re-raises a public NodeRemoved event with external (peer-manager) subscribers — any one re-entering the tree deadlocks. Fix: capture the node, release the using scope, then Invoke.

  2. LookupKNearestNeighbour.cs:65,107 — racy TaskCompletionSource reassignment, and the token is misused. Multiple workers race on if (roundComplete.TrySetResult()) roundComplete = new TaskCompletionSource(token); outside any lock. Also new TaskCompletionSource(token) treats CancellationToken as opaque state — cancellation is not actually wired in. Fix: Interlocked.Exchange for the swap, and drop the (token) arg or register cancellation via RegisterToCompletionSource.

  3. Kademlia.cs:103-115Bootstrap only catches OperationCanceledException. Any non-cancellation exception from Ping (transient network error, etc.) propagates through Parallel.ForEachAsync, kills Bootstrap, kills Run's while (true), and lands in ActivateAsync.catch(Exception) (DiscoveryApp.cs:244). Discovery then dies for the rest of the process lifetime. Wrap individual pings the way IteratorNodeLookup.FindNeighbour does, or catch+log inside Run's outer loop and continue.

  4. EnrResponseHandler.cs:12 — accepts any unsolicited ENR response. Unlike PongMsgHandler (which validates pong.PingMdc == ping.Mdc), this handler takes the first ENR seen on the (nodeId, EnrResponse) key. EnrResponseMsg already carries RequestKeccak; capture the request hash on send and compare.

  5. KademliaNodeSource.cs:118-122finally awaits discoverTask before unsubscribing. If anything inside the consumer await foreach throws before cancellation, the producer keeps running and OnNodeAdded stays subscribed until cancel. Either unsubscribe first, or ch.Writer.TryComplete() before the await.

  6. KBucket.GetAllWithHash() / DoubleEndedLru.GetAllWithHash() return the live LinkedList<>. All current callers happen to go through KBucketTree._lock, but IRoutingTable.IterateBuckets() exposes KBucket to the outside — a future caller iterating outside the lock will hit InvalidOperationException: Collection was modified. Either materialize under lock or document the lock requirement.

Nits

  • Typo on public surface: LookupFindNeighbourHardTimout in KademliaConfig and DiscV4KademliaModule.cs:43 — rename to …Timeout.
  • KademliaNodeSource ctor param lookup2 — leftover name from refactor.
  • KBucketTree.GetAllAtDistanceRecursive and KBucketTree.GetKNearestNeighbour use LINQ .Where/.Select/.Take chains on hot lookup paths — convert to manual foreach per project style (.agents/rules/coding-style.md rule: no LINQ on hot paths).
  • KBucketTree.cs:130 .Reverse() on IEnumerable allocates a buffer on every split.
  • Hash256XorUtils.MaxDistance can be public const int instead of a property.
  • Kademlia.cs:61 comment: "logid" → "logic".
  • KademliaDiscv4Adapter.cs:175 comment: "does not seems" → "does not seem".

TODO assessment

Discv4/DiscV4KademliaModule.cs:43LookupFindNeighbourHardTimout = SendNodeTimeout; // TODO: This seems very low.Doable now. SendNodeTimeout defaults to 500ms, used here as the per-FindNode hard timeout inside Kademlia lookup. At 500ms many slow peers get cut off before they respond. Add a dedicated KademliaLookupHardTimeoutMs on IDiscoveryConfig (default ~2000ms), or use 2 * SendNodeTimeout. Combine with the Timout typo rename.

Kademlia/Hash256XorUtils.cs:70// TODO: Just add a min/max range per bucket and randomized between them.Defer. Needs threading per-bucket prefix ranges through IRoutingTable/KBucketTree and changing IKeyOperator.CreateRandomKeyAtDistance signatures. No correctness risk today — the function is covered by Hash256XorUtilsTests.TestGetRandomHash. Code-shape TODO, follow-up.

Kademlia/KBucketTree.cs:28// TODO: Double check and probably make lockless.Defer. Non-trivial lockfree design, would need a lockless DoubleEndedLru and new concurrent stress tests. The lock itself is correct; throughput is the concern. But the concrete defect the TODO hints at — event invocation under the lock — is Important #1 above and can be fixed independently of going lockless.

Test coverage gaps worth filling

  • KBucketTree.SplitBucket LRU order preservation (the .Reverse() claim) — no assertion.
  • KBucketTree.GetAllAtDistance depth > targetDepth "collect all leaves" branch — untested.
  • NodeHealthTracker.TryRefresh — entire Task.Run path (ping head → refresh on success → evict on failure) not exercised.
  • LookupKNearestNeighbour — no mid-flight cancellation test; only one Alpha value implicitly exercised.
  • KademliaDiscv4Adapter.RemoveMessageHandler — CAS retry loop never stressed with concurrent add/remove.
  • EnrResponseHandler — no test that an unsolicited ENR response is rejected (would catch Important #4).
  • DiscoveryPersistenceManager.LoadPersistedNodesNode construction throwing and ping throwing paths not asserted to skip and continue.

LukaszRozmej and others added 2 commits May 22, 2026 10:41
Important:
- KBucketTree: raise OnNodeAdded/OnNodeRemoved after releasing _lock to
  avoid re-entrancy deadlocks via external subscribers.
- LookupKNearestNeighbour: atomic CAS swap of roundComplete TCS and drop
  the misused (token) state argument; mark `finished` access with Volatile.
- Kademlia.Bootstrap: catch non-OCE exceptions from individual ping calls
  and from the Bootstrap iteration in Run, so a transient network error
  no longer kills the discovery loop.
- EnrResponseHandler: validate resp.RequestKeccak against the MDC of the
  EnrRequest just sent. EnrRequestMsg.Hash is now also set by the
  serializer on send.
- KademliaNodeSource: unsubscribe Handler and complete the channel writer
  before awaiting the producer task in the finally block.
- DoubleEndedLru/KBucket: GetAll / GetAllWithHash now materialize arrays
  under the inner lock rather than returning the live LinkedList.

Nits:
- Rename LookupFindNeighbourHardTimout -> LookupFindNeighbourHardTimeout.
- Rename KademliaNodeSource ctor param lookup2 -> lookup.
- Replace LINQ chains on KBucketTree hot paths (GetAllAtDistance,
  GetKNearestNeighbour, SplitBucket) with manual loops.
- Hash256XorUtils.MaxDistance -> public const int.
- Drop two stale comments and fix `does not seems` typo.

Tests:
- KBucketTreeTests: split preserves LRU order; GetAllAtDistance covers
  the deeper-bucket branch.
- LookupKNearestNeighbourTests: mid-flight cancellation across Alpha=1/3.
- NodeHealthTrackerTests: TryRefresh removes node on ping timeout,
  keeps it on success.
- KademliaDiscv4AdapterTests: SendEnrRequest now wires request hash;
  added rejection test for unsolicited / wrong-keccak ENR responses.
- DiscoveryPersistenceManagerTests: skip nodes that fail Node ctor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	src/Nethermind/Nethermind.Xdc/XdcModule.cs
LukaszRozmej and others added 3 commits May 22, 2026 10:51
- Extract IdentityNodeHashProvider into a shared file so KBucketTreeTests
  and LookupKNearestNeighbourTests stop redefining it.
- KBucketTreeTests: pull common KBucketTree construction into CreateTree
  and Add helpers; replace duplicated Assert.That index lookups with
  single Is.EqualTo array comparisons.
- LookupKNearestNeighbourTests: extract CreateLookup factory and named
  seed/neighbour hash constants so the two test methods only differ in
  what they're actually asserting.
- NodeHealthTrackerTests: extract a CreateTracker helper (and Self/Remote
  /Stale constants) so each test body only sets the bits unique to its
  scenario; promote StringNodeHashProvider to a singleton.
- DiscoveryPersistenceManagerTests: shared NodeA / NodeB fixtures plus a
  CreateManager helper and a PingReceived assertion helper; drop the
  per-test ad-hoc NetworkNode arrays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@flcl42 flcl42 merged commit 0bbc395 into master May 27, 2026
546 checks passed
@flcl42 flcl42 deleted the feature/kademlia-discv4 branch May 27, 2026 12:51
@benaadams

Copy link
Copy Markdown
Member

Do we need "Online bootnodes: 2" output every 3 blocks?

image

@asdacap

asdacap commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

That.... is probably a debug leftover

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants