Skip to content

Conversation

@lidel
Copy link
Member

@lidel lidel commented Sep 15, 2025

The main branch fails very very often, especially on Windows.

This PR should stabilize CI by removing some races / deadlocks when tests are run with things like go test -shuffle=on -failfast -race -count=10 ./....

I don't think this will have impact on real world queries but maybe there are better ways of dealing with those?

fix: race condition in TestFindProviders

the test was incorrectly using require.Eventually on already-read results. also, parallelRouter doesn't deduplicate, so peers[0] can appear twice (once with addresses from mr1, once without from mr2). fixed by searching for the record with addresses before verifying filtration.

fix: deadlock in manyIter.Close()

manyIter.Close() would deadlock when goroutines were blocked on unbuffered channel sends, which occured sometimes on CI:

fixed with synchronous drain that seems to be most compatible with http streaming semantics from routing v1 spec (clients ending as soon they got enough providers, drain channel to unblock senders)

fix: deadlock in cacheFallbackIter

dispatchFindPeer goroutines blocked forever trying to send to unbuffered channel when iterator closed early.

unsure if this is the best way of dealing with this, but fixed with buffered channel + non-blocking sends that drop results if nobody is reading (acceptable for background lookups). also added context cancellation in Close() to correctly stop goroutines.

If we dispatch >100 concurrent lookups and the consumer is slow, we lose results silently. However, this is actually acceptable because these are lookups for peers which had no addresses anyway, and if we hit 100, means CID is extremely popular and dropping some of providers does not matter at that point

fix: infinite recursion in cacheFallbackIter.Next()

replaced recursion with loop to avoid stack overflow.

maintains original logic: after dispatching lookup for peer without addresses, continues trying source iterator or waiting for lookup results.

also fixed deadlock when ongoingLookups reaches 0 between check and channel read by using periodic timeout.

the test was incorrectly using require.Eventually on already-read results.
also, parallelRouter doesn't deduplicate, so peers[0] can appear twice
(once with addresses from mr1, once without from mr2). fixed by searching
for the record with addresses before verifying filtration.
@lidel lidel changed the title fix: race condition in TestFindProviders fix(ci): race condition in tests Sep 15, 2025
@lidel lidel changed the title fix(ci): race condition in tests fix(ci): race conditions in tests Sep 15, 2025
manyIter.Close() would deadlock when goroutines were blocked
on unbuffered channel sends. fixed with synchronous drain that
matches http streaming semantics from routing v1 spec:
- idempotent close (safe to call multiple times)
- drain channel to unblock senders (expected for early termination)
- proper cleanup of child iterators

also added missing defer it.Close() in tests.
@lidel lidel marked this pull request as draft September 15, 2025 00:33
dispatchFindPeer goroutines blocked forever trying to send to
unbuffered channel when iterator closed early.

fixed with buffered channel + non-blocking sends that drop
results if nobody is reading (acceptable for background lookups).
also added context cancellation in Close() to stop goroutines.
@lidel lidel requested a review from gammazero September 15, 2025 00:58
added defer it.Close() to all test iterators to prevent
resource leaks that could cause intermittent hangs.
replaced recursion with loop to avoid stack overflow.
maintains original logic: after dispatching lookup for
peer without addresses, continues trying source iterator
or waiting for lookup results.

also fixed deadlock when ongoingLookups reaches 0 between
check and channel read by using periodic timeout.
- removed unnecessary handleRecord closure
- cache ongoingLookups value to avoid race in logging
- use defer timer.Stop() for proper cleanup
- fix iterator leak in dispatchFindPeer by adding defer Close()
- remove redundant goroutine in manyIter.Close() that could
  cause double-close panic. the channel is already closed by
  the goroutine in newManyIter.
@lidel lidel marked this pull request as ready for review September 15, 2025 02:00
@hsanjuan
Copy link
Contributor

Triage:

@guillaumemichel guillaumemichel marked this pull request as draft September 23, 2025 14:29
@gammazero gammazero marked this pull request as ready for review September 23, 2025 20:58
@gammazero gammazero merged commit d67d18e into main Sep 23, 2025
8 checks passed
@gammazero gammazero deleted the fix/race-condition-test branch October 31, 2025 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants