Skip to content

fix(tests): de-flake go test races on main#837

Merged
oskarszoon merged 1 commit into
bsv-blockchain:mainfrom
oskarszoon:flaky-test
May 11, 2026
Merged

fix(tests): de-flake go test races on main#837
oskarszoon merged 1 commit into
bsv-blockchain:mainfrom
oskarszoon:flaky-test

Conversation

@oskarszoon

Copy link
Copy Markdown
Contributor

Summary

Six recent push-to-main runs failed in test_and_lint / Go test (~18% fail rate over the last 50 main pushes). Same tests recurred across different SHAs — races, not regressions. This PR de-flakes them and reduces shared-runner contention.

Test Pkg Root cause Fix
TestGenesisHashWrongParams stores/blockchain/sql Mutated store.chainParams while New() background goroutine read it. Refactor insertGenesisTransaction to take *chaincfg.Params; test passes wrong params without mutating shared state.
TestBlockAssembly_GetMiningCandidate services/blockassembly completeWg.Done() fired before assembler acked subtree via ErrChan, so GetMiningCandidate could see NumTxs<3. Replace WaitGroup with require.Eventually polling NumTxs == 3.
TestHandleReorgWithInvalidBlock_Integration services/blockassembly BlockAssembler.reset() stores bestBlock=B3 before subtreeProcessor.Reset writes processed_at for moveForward blocks; hash-only Eventually could pass while writes were in flight. Eventually now also requires processed_at flushed on every chain B block.
Test_consumerMessageHandler/context_cancellation services/blockvalidation select raced ctx.Done() vs a buffered channel send; on fast-path, send succeeded and errCh=nil won. Cancel before constructing handler + unbuffered blockFoundCh so handler's select can only fire ctx.Done().
TestPerfSyncProducerThroughput (and siblings) util/kafka Not a unit test — boots Redpanda testcontainer and measures throughput; blows the per-job time budget under -race, and wait condition is a log line that fires before Kafka admin API is reliable. Gate behind //go:build perf; make test skips them. Run with go test -tags perf -run TestPerf ./util/kafka/.
Test_processTransactionInternalAerospike, TestExtendedTxa1f6a4ff… (validator) services/propagation, services/validator Docker Hub timeouts pulling testcontainers/ryuk:0.13.0 (the testcontainers cleanup sidecar). Set TESTCONTAINERS_RYUK_DISABLED=true in PR/main test workflow envs; also init() in test/utils/aerospike/aerospike.go for local dev. Tests already Terminate() containers explicitly.

Each fix verified locally under -race -count=N — see Test plan below.

Out of scope

  • TestDaemon_Start_AllServices (180s timeout, hit twice in window) — root cause is a TOCTOU race in daemon/test_daemon.go:getFreePort (open-then-close, port grabbed by another test before service binds). Real fix needs port reservation that survives until service binds. Filed as follow-up; the kafka perf build-tag here removes a major source of port contention which may also reduce its frequency.
  • Nightly Chain Integrity Test and run-daemon-tests are persistently red (30/30 nights) for unrelated reasons — ECR auth on teranode-coinbase removed in #4042, and a daemon e2e suite blowing the default 10m Go test timeout. Not addressed here per request.

Test plan

  • go test -race -count=20 -run TestGenesisHashWrongParams ./stores/blockchain/sql/ — 100/100 pass
  • go test -race -count=10 -run '^TestBlockAssembly_GetMiningCandidate$' ./services/blockassembly/ — 20/20 pass
  • go test -race -count=20 -run Test_consumerMessageHandler ./services/blockvalidation/ — 80/80 pass
  • go test -race -count=3 -run '^TestHandleReorgWithInvalidBlock_Integration$' ./services/blockassembly/ — 3/3 pass
  • go test -timeout 120s ./util/kafka/ — 133 pass, perf tests correctly excluded
  • go test -race -run 'TestGenesisHash|TestInvalidate|TestRevalidate|TestOnMainChain' ./stores/blockchain/sql/ — 38 pass (signature change of internal insertGenesisTransaction consumed by 3 callers)
  • go build ./... — green
  • go vet ./... — only pre-existing warnings in test/utils/helper.go

Six recent push-to-main runs failed in `test_and_lint / Go test`. Each
failure hit a different test, but the same tests recurred — these are
races, not regressions. Fixes:

- stores/blockchain/sql: TestGenesisHashWrongParams mutated
  store.chainParams while the New() background goroutine read it.
  insertGenesisTransaction now takes *chaincfg.Params so the test passes
  wrong params without mutating shared state.

- services/blockassembly: TestBlockAssembly_GetMiningCandidate signalled
  completion before the assembler had wired the subtree into the mining
  candidate. Replace the WaitGroup with require.Eventually polling
  NumTxs == 3.

- services/blockassembly: TestHandleReorgWithInvalidBlock_Integration
  observed bestBlock=B3 during the window between BlockAssembler.reset()
  storing bestBlock and subtreeProcessor.Reset writing processed_at. The
  Eventually now also requires processed_at to be flushed on every chain
  B block before downstream assertions run.

- services/blockvalidation: Test_consumerMessageHandler/context_cancellation
  raced ctx.Done() vs a buffered channel send: when blockHandler returned
  nil quickly, the buffered send succeeded and select picked errCh=nil.
  Cancel before constructing the handler and use an unbuffered channel so
  the handler's select can only fire ctx.Done().

- util/kafka: TestPerf* spin up a Redpanda testcontainer and measure
  throughput. They are not unit tests, blow the per-job time budget under
  -race, and the wait condition is a log line that fires before the admin
  API is reliable. Gate behind //go:build perf so make test skips them;
  run them with go test -tags perf.

- CI: disable testcontainers' Ryuk reaper sidecar in PR/main test
  workflows. Pulling testcontainers/ryuk:* from Docker Hub on shared
  runners is the dominant flake source for tests using the aerospike
  testcontainer (Test_processTransactionInternalAerospike, the validator
  TestExtendedTx*). Tests already call container.Terminate() explicitly,
  so the reaper provides no additional safety on ephemeral runners. Also
  set TESTCONTAINERS_RYUK_DISABLED in test/utils/aerospike/aerospike.go
  for local dev convenience.

Out of scope: TestDaemon_Start_AllServices needs a real getFreePort
TOCTOU fix — left for a follow-up.
@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

All files reviewed. Test de-flaking changes look good - each fix directly addresses its documented race condition with appropriate synchronization patterns.

Current Review:

No issues found. The PR correctly:

  • Eliminates shared state mutation in genesis hash test (stores/blockchain/sql)
  • Replaces WaitGroup with polling for mining candidate assembly
  • Strengthens reorg test to wait for DB flush completion
  • Fixes select-race by canceling context before handler construction
  • Gates performance tests behind build tag to reduce CI contention
  • Disables Ryuk sidecar to avoid Docker Hub timeouts

All race fixes verified through code inspection against stated failure modes in PR description.

@sonarqubecloud

sonarqubecloud Bot commented May 9, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-837 (b7d07e6)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 142
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.679µ 1.707µ ~ 1.000
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 61.81n 61.66n ~ 0.700
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 61.64n 61.59n ~ 0.400
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 61.77n 61.72n ~ 1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 31.58n 30.49n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 53.28n 52.43n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 107.9n 105.6n ~ 0.300
MiningCandidate_Stringify_Short-4 262.8n 263.3n ~ 0.700
MiningCandidate_Stringify_Long-4 1.901µ 1.909µ ~ 0.100
MiningSolution_Stringify-4 979.1n 981.1n ~ 0.400
BlockInfo_MarshalJSON-4 1.741µ 1.762µ ~ 0.100
NewFromBytes-4 127.3n 127.7n ~ 0.700
Mine_EasyDifficulty-4 60.19µ 60.84µ ~ 0.400
Mine_WithAddress-4 7.450µ 6.756µ ~ 0.700
BlockAssembler_AddTx-4 0.02626n 0.03101n ~ 0.400
AddNode-4 10.92 10.79 ~ 0.700
AddNodeWithMap-4 10.98 10.49 ~ 1.000
DirectSubtreeAdd/4_per_subtree-4 63.28n 62.10n ~ 0.100
DirectSubtreeAdd/64_per_subtree-4 31.44n 31.72n ~ 1.000
DirectSubtreeAdd/256_per_subtree-4 30.30n 30.03n ~ 0.600
DirectSubtreeAdd/1024_per_subtree-4 29.11n 28.87n ~ 0.100
DirectSubtreeAdd/2048_per_subtree-4 28.67n 28.41n ~ 0.100
SubtreeProcessorAdd/4_per_subtree-4 280.6n 280.6n ~ 0.800
SubtreeProcessorAdd/64_per_subtree-4 277.1n 273.6n ~ 1.000
SubtreeProcessorAdd/256_per_subtree-4 278.5n 278.4n ~ 0.700
SubtreeProcessorAdd/1024_per_subtree-4 272.4n 268.6n ~ 0.400
SubtreeProcessorAdd/2048_per_subtree-4 268.1n 269.9n ~ 0.700
SubtreeProcessorRotate/4_per_subtree-4 272.6n 275.7n ~ 0.400
SubtreeProcessorRotate/64_per_subtree-4 272.9n 274.3n ~ 0.700
SubtreeProcessorRotate/256_per_subtree-4 274.0n 273.4n ~ 0.700
SubtreeProcessorRotate/1024_per_subtree-4 272.7n 270.6n ~ 0.400
SubtreeNodeAddOnly/4_per_subtree-4 54.66n 54.39n ~ 0.400
SubtreeNodeAddOnly/64_per_subtree-4 34.56n 34.51n ~ 0.700
SubtreeNodeAddOnly/256_per_subtree-4 33.41n 33.41n ~ 0.800
SubtreeNodeAddOnly/1024_per_subtree-4 32.82n 32.71n ~ 1.000
SubtreeCreationOnly/4_per_subtree-4 113.7n 112.4n ~ 1.000
SubtreeCreationOnly/64_per_subtree-4 401.9n 399.4n ~ 0.400
SubtreeCreationOnly/256_per_subtree-4 1.347µ 1.433µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 4.400µ 4.499µ ~ 0.100
SubtreeCreationOnly/2048_per_subtree-4 8.128µ 8.336µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 269.7n 272.6n ~ 0.700
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 270.3n 272.3n ~ 0.700
ParallelGetAndSetIfNotExists/1k_nodes-4 581.7µ 592.8µ ~ 0.200
ParallelGetAndSetIfNotExists/10k_nodes-4 1.314m 1.362m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 6.700m 6.728m ~ 0.700
ParallelGetAndSetIfNotExists/100k_nodes-4 13.47m 13.47m ~ 1.000
SequentialGetAndSetIfNotExists/1k_nodes-4 657.2µ 661.3µ ~ 0.200
SequentialGetAndSetIfNotExists/10k_nodes-4 2.757m 2.756m ~ 0.700
SequentialGetAndSetIfNotExists/50k_nodes-4 10.49m 10.58m ~ 0.400
SequentialGetAndSetIfNotExists/100k_nodes-4 20.16m 20.03m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 635.8µ 635.3µ ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 4.172m 4.208m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 16.62m 16.84m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 703.2µ 696.5µ ~ 0.400
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 5.817m 5.744m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 38.09m 37.43m ~ 0.100
DiskTxMap_SetIfNotExists-4 3.895µ 3.614µ ~ 0.100
DiskTxMap_SetIfNotExists_Parallel-4 3.464µ 3.411µ ~ 0.100
DiskTxMap_ExistenceOnly-4 314.9n 323.4n ~ 0.100
Queue-4 193.3n 193.8n ~ 0.700
AtomicPointer-4 4.890n 4.803n ~ 1.000
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 872.8µ 872.9µ ~ 1.000
ReorgOptimizations/DedupFilterPipeline/New/10K-4 861.8µ 842.4µ ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4 119.3µ 111.7µ ~ 0.700
ReorgOptimizations/AllMarkFalse/New/10K-4 62.52µ 62.36µ ~ 1.000
ReorgOptimizations/HashSlicePool/Old/10K-4 64.99µ 71.35µ ~ 0.400
ReorgOptimizations/HashSlicePool/New/10K-4 12.46µ 11.78µ ~ 0.100
ReorgOptimizations/NodeFlags/Old/10K-4 5.342µ 6.076µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.877µ 2.396µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 10.38m 10.86m ~ 0.400
ReorgOptimizations/DedupFilterPipeline/New/100K-4 10.21m 10.52m ~ 0.700
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.176m 1.190m ~ 0.200
ReorgOptimizations/AllMarkFalse/New/100K-4 687.6µ 690.0µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/100K-4 688.1µ 706.5µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/100K-4 340.8µ 308.2µ ~ 0.700
ReorgOptimizations/NodeFlags/Old/100K-4 54.06µ 52.73µ ~ 0.400
ReorgOptimizations/NodeFlags/New/100K-4 18.79µ 17.84µ ~ 0.100
TxMapSetIfNotExists-4 51.98n 52.04n ~ 1.000
TxMapSetIfNotExistsDuplicate-4 38.43n 38.16n ~ 0.700
ChannelSendReceive-4 607.6n 589.0n ~ 0.100
CalcBlockWork-4 506.5n 499.6n ~ 0.400
CalculateWork-4 683.0n 679.9n ~ 0.200
BuildBlockLocatorString_Helpers/Size_10-4 1.330µ 1.312µ ~ 0.200
BuildBlockLocatorString_Helpers/Size_100-4 15.02µ 12.54µ ~ 0.400
BuildBlockLocatorString_Helpers/Size_1000-4 125.5µ 124.9µ ~ 1.000
CatchupWithHeaderCache-4 104.5m 104.4m ~ 0.700
_prepareTxsPerLevel-4 419.0m 412.2m ~ 0.200
_prepareTxsPerLevelOrdered-4 3.594m 3.902m ~ 0.700
_prepareTxsPerLevel_Comparison/Original-4 423.1m 422.8m ~ 1.000
_prepareTxsPerLevel_Comparison/Optimized-4 3.793m 3.642m ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.256m 1.263m ~ 1.000
SubtreeSizes/10k_tx_16_per_subtree-4 302.0µ 294.4µ ~ 0.400
SubtreeSizes/10k_tx_64_per_subtree-4 71.76µ 70.42µ ~ 0.200
SubtreeSizes/10k_tx_256_per_subtree-4 17.85µ 17.53µ ~ 0.200
SubtreeSizes/10k_tx_512_per_subtree-4 8.725µ 8.754µ ~ 0.700
SubtreeSizes/10k_tx_1024_per_subtree-4 4.354µ 4.388µ ~ 0.400
SubtreeSizes/10k_tx_2k_per_subtree-4 2.156µ 2.194µ ~ 0.100
BlockSizeScaling/10k_tx_64_per_subtree-4 68.84µ 69.35µ ~ 0.400
BlockSizeScaling/10k_tx_256_per_subtree-4 17.49µ 17.34µ ~ 0.100
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.328µ 4.310µ ~ 0.400
BlockSizeScaling/50k_tx_64_per_subtree-4 371.4µ 367.2µ ~ 0.700
BlockSizeScaling/50k_tx_256_per_subtree-4 86.75µ 86.56µ ~ 0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4 21.31µ 21.09µ ~ 0.400
SubtreeAllocations/small_subtrees_exists_check-4 148.7µ 148.2µ ~ 0.400
SubtreeAllocations/small_subtrees_data_fetch-4 158.4µ 156.2µ ~ 0.700
SubtreeAllocations/small_subtrees_full_validation-4 306.9µ 305.2µ ~ 0.400
SubtreeAllocations/medium_subtrees_exists_check-4 8.762µ 8.734µ ~ 1.000
SubtreeAllocations/medium_subtrees_data_fetch-4 9.273µ 9.133µ ~ 0.400
SubtreeAllocations/medium_subtrees_full_validation-4 17.43µ 17.32µ ~ 1.000
SubtreeAllocations/large_subtrees_exists_check-4 2.078µ 2.072µ ~ 0.400
SubtreeAllocations/large_subtrees_data_fetch-4 2.196µ 2.226µ ~ 0.200
SubtreeAllocations/large_subtrees_full_validation-4 4.309µ 4.314µ ~ 1.000
_BufferPoolAllocation/16KB-4 5.359µ 3.385µ ~ 0.200
_BufferPoolAllocation/32KB-4 8.294µ 9.107µ ~ 0.100
_BufferPoolAllocation/64KB-4 16.57µ 15.43µ ~ 0.700
_BufferPoolAllocation/128KB-4 30.39µ 29.64µ ~ 0.700
_BufferPoolAllocation/512KB-4 132.0µ 130.0µ ~ 0.400
_BufferPoolConcurrent/32KB-4 19.13µ 18.65µ ~ 0.700
_BufferPoolConcurrent/64KB-4 29.08µ 28.37µ ~ 0.100
_BufferPoolConcurrent/512KB-4 177.6µ 179.6µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/16KB-4 739.9µ 702.8µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 728.6µ 698.5µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 727.0µ 694.1µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 722.1µ 698.0µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 735.0µ 709.4µ ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4 38.40m 38.01m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4 38.17m 38.13m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/64KB-4 38.09m 38.13m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/128KB-4 38.21m 37.93m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 38.74m 38.12m ~ 0.200
_PooledVsNonPooled/Pooled-4 824.6n 819.2n ~ 0.100
_PooledVsNonPooled/NonPooled-4 7.363µ 6.902µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 8.665µ 9.483µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 12.34µ 11.85µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 12.68µ 10.98µ ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 328.2µ 324.1µ ~ 0.100
StoreBlock_Sequential/AboveCSVHeight-4 328.1µ 324.2µ ~ 0.400
GetUtxoHashes-4 268.5n 273.3n ~ 0.200
GetUtxoHashes_ManyOutputs-4 45.69µ 50.05µ ~ 0.100
_NewMetaDataFromBytes-4 232.7n 231.0n ~ 0.400
_Bytes-4 606.7n 621.3n ~ 0.200
_MetaBytes-4 566.9n 561.4n ~ 0.700

Threshold: >10% with p < 0.05 | Generated: 2026-05-09 16:32 UTC

@oskarszoon oskarszoon enabled auto-merge (squash) May 9, 2026 16:49
@oskarszoon oskarszoon merged commit ae75cbc into bsv-blockchain:main May 11, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants