fix(tests): de-flake go test races on main#837
Merged
Conversation
Six recent push-to-main runs failed in `test_and_lint / Go test`. Each failure hit a different test, but the same tests recurred — these are races, not regressions. Fixes: - stores/blockchain/sql: TestGenesisHashWrongParams mutated store.chainParams while the New() background goroutine read it. insertGenesisTransaction now takes *chaincfg.Params so the test passes wrong params without mutating shared state. - services/blockassembly: TestBlockAssembly_GetMiningCandidate signalled completion before the assembler had wired the subtree into the mining candidate. Replace the WaitGroup with require.Eventually polling NumTxs == 3. - services/blockassembly: TestHandleReorgWithInvalidBlock_Integration observed bestBlock=B3 during the window between BlockAssembler.reset() storing bestBlock and subtreeProcessor.Reset writing processed_at. The Eventually now also requires processed_at to be flushed on every chain B block before downstream assertions run. - services/blockvalidation: Test_consumerMessageHandler/context_cancellation raced ctx.Done() vs a buffered channel send: when blockHandler returned nil quickly, the buffered send succeeded and select picked errCh=nil. Cancel before constructing the handler and use an unbuffered channel so the handler's select can only fire ctx.Done(). - util/kafka: TestPerf* spin up a Redpanda testcontainer and measure throughput. They are not unit tests, blow the per-job time budget under -race, and the wait condition is a log line that fires before the admin API is reliable. Gate behind //go:build perf so make test skips them; run them with go test -tags perf. - CI: disable testcontainers' Ryuk reaper sidecar in PR/main test workflows. Pulling testcontainers/ryuk:* from Docker Hub on shared runners is the dominant flake source for tests using the aerospike testcontainer (Test_processTransactionInternalAerospike, the validator TestExtendedTx*). Tests already call container.Terminate() explicitly, so the reaper provides no additional safety on ephemeral runners. Also set TESTCONTAINERS_RYUK_DISABLED in test/utils/aerospike/aerospike.go for local dev convenience. Out of scope: TestDaemon_Start_AllServices needs a real getFreePort TOCTOU fix — left for a follow-up.
Contributor
|
🤖 Claude Code Review Status: Complete All files reviewed. Test de-flaking changes look good - each fix directly addresses its documented race condition with appropriate synchronization patterns. Current Review: No issues found. The PR correctly:
All race fixes verified through code inspection against stated failure modes in PR description. |
|
Contributor
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-09 16:32 UTC |
sugh01
approved these changes
May 9, 2026
freemans13
approved these changes
May 11, 2026
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Six recent push-to-main runs failed in
test_and_lint / Go test(~18% fail rate over the last 50 main pushes). Same tests recurred across different SHAs — races, not regressions. This PR de-flakes them and reduces shared-runner contention.TestGenesisHashWrongParamsstores/blockchain/sqlstore.chainParamswhileNew()background goroutine read it.insertGenesisTransactionto take*chaincfg.Params; test passes wrong params without mutating shared state.TestBlockAssembly_GetMiningCandidateservices/blockassemblycompleteWg.Done()fired before assembler acked subtree viaErrChan, soGetMiningCandidatecould seeNumTxs<3.require.EventuallypollingNumTxs == 3.TestHandleReorgWithInvalidBlock_Integrationservices/blockassemblyBlockAssembler.reset()storesbestBlock=B3beforesubtreeProcessor.Resetwritesprocessed_atfor moveForward blocks; hash-onlyEventuallycould pass while writes were in flight.Eventuallynow also requiresprocessed_atflushed on every chain B block.Test_consumerMessageHandler/context_cancellationservices/blockvalidationselectracedctx.Done()vs a buffered channel send; on fast-path, send succeeded anderrCh=nilwon.blockFoundChso handler's select can only firectx.Done().TestPerfSyncProducerThroughput(and siblings)util/kafka-race, and wait condition is a log line that fires before Kafka admin API is reliable.//go:build perf;make testskips them. Run withgo test -tags perf -run TestPerf ./util/kafka/.Test_processTransactionInternalAerospike,TestExtendedTxa1f6a4ff…(validator)services/propagation,services/validatortestcontainers/ryuk:0.13.0(the testcontainers cleanup sidecar).TESTCONTAINERS_RYUK_DISABLED=truein PR/main test workflow envs; alsoinit()intest/utils/aerospike/aerospike.gofor local dev. Tests alreadyTerminate()containers explicitly.Each fix verified locally under
-race -count=N— see Test plan below.Out of scope
TestDaemon_Start_AllServices(180s timeout, hit twice in window) — root cause is a TOCTOU race indaemon/test_daemon.go:getFreePort(open-then-close, port grabbed by another test before service binds). Real fix needs port reservation that survives until service binds. Filed as follow-up; the kafka perf build-tag here removes a major source of port contention which may also reduce its frequency.Chain Integrity Testandrun-daemon-testsare persistently red (30/30 nights) for unrelated reasons — ECR auth onteranode-coinbaseremoved in #4042, and a daemon e2e suite blowing the default 10m Go test timeout. Not addressed here per request.Test plan
go test -race -count=20 -run TestGenesisHashWrongParams ./stores/blockchain/sql/— 100/100 passgo test -race -count=10 -run '^TestBlockAssembly_GetMiningCandidate$' ./services/blockassembly/— 20/20 passgo test -race -count=20 -run Test_consumerMessageHandler ./services/blockvalidation/— 80/80 passgo test -race -count=3 -run '^TestHandleReorgWithInvalidBlock_Integration$' ./services/blockassembly/— 3/3 passgo test -timeout 120s ./util/kafka/— 133 pass, perf tests correctly excludedgo test -race -run 'TestGenesisHash|TestInvalidate|TestRevalidate|TestOnMainChain' ./stores/blockchain/sql/— 38 pass (signature change of internalinsertGenesisTransactionconsumed by 3 callers)go build ./...— greengo vet ./...— only pre-existing warnings intest/utils/helper.go