Skip to content

fix(test): de-flake blockassembly system tests (FSM/gRPC startup races)#1000

Merged
oskarszoon merged 2 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/teranode-test-flaxy
Jun 1, 2026
Merged

fix(test): de-flake blockassembly system tests (FSM/gRPC startup races)#1000
oskarszoon merged 2 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/teranode-test-flaxy

Conversation

@oskarszoon

Copy link
Copy Markdown
Contributor

Summary

Fixes the six services/blockassembly system tests that fail intermittently under heavy CI load (#997). Both failure signatures originate in one test helper, BlockchainDaemon.StartBlockchainService() (test/nodeHelpers/blockchainDaemon.go), which had two fixed-timeout races that lose when the runner is contended. Test-helper only — no production code, CI, or Makefile changes.

Race 1 — connection refused, retried 3 times. The blockchain gRPC client was created before WaitForServiceToBeReady(), so it dialled before the listener was guaranteed bound and relied on its retry budget to bridge the gap; under load that budget was exhausted. StartGRPCServer binds the socket before readiness is signalled (GetListener → close readyChServe), so moving client creation to after WaitForServiceToBeReady() removes the race structurally.

Race 2 — timeout waiting for FSM to transition to RUNNING state. The helper polled GetFSMCurrentState on a 100ms ticker with a hard 10s deadline. GetFSMCurrentState deliberately reports IDLE until the subscription manager is ready, so under load the poll kept seeing IDLE and timed out. Replaced with WaitUntilFSMTransitionFromIdleState — a server-blocking RPC that returns the instant the FSM leaves IDLE and the subscription manager is ready — bounded by a generous 60s context backstop instead of a fixed budget, then assert the state is RUNNING.

Note: the obvious-looking WaitForFSMtoTransitionToGivenState is not usable here — its proto handler is unregistered and returns codes.Unimplemented over a real gRPC client. WaitUntilFSMTransitionFromIdleState is the implemented, event-driven path (and Run()'s only destination is RUNNING, so the post-wait assertion is exact).

Test Plan

  • go build ./test/nodeHelpers/... ./services/blockassembly/... + go vet ./test/nodeHelpers/... — clean
  • go test -race -tags testtxmetacache -count=10 ./test/nodeHelpers/... — PASS
  • go test -race -tags testtxmetacache -count=10 -run '<the six tests>' ./services/blockassembly/... — PASS, no connection refused / FSM timeout / Unimplemented
  • go test -race -tags testtxmetacache -count=1 ./services/blockassembly/... (whole package) — PASS

Closes #997

The client was created before WaitForServiceToBeReady(), so it dialed the
blockchain gRPC service before the listener was guaranteed bound and relied on
its retry budget to bridge the gap. Under heavy CI load that budget was
exhausted, producing 'connection refused, retried 3 times'. The listener is
bound before readiness is signalled (GetListener -> close readyCh -> Serve), so
moving client creation after WaitForServiceToBeReady() removes the race.

Refs bsv-blockchain#997
…0s poll

StartBlockchainService polled GetFSMCurrentState on a 100ms ticker with a hard
10s deadline; under heavy CI load that poll kept seeing IDLE (GetFSMCurrentState
reports IDLE until the subscription manager is ready) and timed out with
'timeout waiting for FSM to transition to RUNNING state'. Replace the fixed
budget with WaitUntilFSMTransitionFromIdleState, a server-blocking RPC that
returns the instant the FSM leaves IDLE and the subscription manager is ready,
guarded by a generous 60s context backstop, then assert the state is RUNNING.

Refs bsv-blockchain#997
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Review Summary:

This PR correctly fixes two race conditions in the test helper by:

  1. Race 1 Fix (gRPC connection): Moving client creation to after WaitForServiceToBeReady() eliminates the connection-refused race. The change is structurally sound — StartGRPCServer binds the socket before signaling readiness, so this ordering is correct.

  2. Race 2 Fix (FSM polling): Replacing the fixed 10s poll with WaitUntilFSMTransitionFromIdleState is the right approach. The RPC is event-driven, blocks server-side until the FSM leaves IDLE and the subscription manager is ready, and returns immediately when the condition is met.

The implementation is correct and well-documented. The 60s timeout is appropriately generous for a backstop, and the explicit state assertion after the wait provides defense-in-depth against future FSM changes.

Findings: No issues found. The changes are minimal, targeted, and follow best practices for eliminating timing dependencies in tests.

@sonarqubecloud

sonarqubecloud Bot commented Jun 1, 2026

Copy link
Copy Markdown

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-1000 (d4a3bc8)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 144
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.777µ 1.828µ ~ 0.700
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 61.68n 61.52n ~ 0.600
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 61.46n 61.54n ~ 0.700
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 61.49n 61.44n ~ 1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 30.94n 29.90n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 52.56n 51.69n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 111.2n 109.2n ~ 0.700
MiningCandidate_Stringify_Short-4 264.5n 260.9n ~ 0.100
MiningCandidate_Stringify_Long-4 1.919µ 1.919µ ~ 1.000
MiningSolution_Stringify-4 977.8n 975.6n ~ 0.700
BlockInfo_MarshalJSON-4 1.799µ 1.790µ ~ 0.100
NewFromBytes-4 130.5n 129.4n ~ 0.200
AddTxBatchColumnar_Validation-4 2.465µ 2.447µ ~ 0.700
OffsetValidationLoop-4 635.9n 640.1n ~ 1.000
Mine_EasyDifficulty-4 52.46µ 52.23µ ~ 1.000
Mine_WithAddress-4 5.539µ 5.515µ ~ 0.400
BlockAssembler_AddTx-4 0.02814n 0.02860n ~ 1.000
AddNode-4 12.02 11.20 ~ 0.100
AddNodeWithMap-4 13.13 11.67 ~ 0.100
DirectSubtreeAdd/4_per_subtree-4 58.06n 56.19n ~ 0.100
DirectSubtreeAdd/64_per_subtree-4 28.79n 28.76n ~ 1.000
DirectSubtreeAdd/256_per_subtree-4 27.68n 28.01n ~ 0.200
DirectSubtreeAdd/1024_per_subtree-4 26.50n 26.41n ~ 0.100
DirectSubtreeAdd/2048_per_subtree-4 26.03n 26.06n ~ 0.800
SubtreeProcessorAdd/4_per_subtree-4 292.8n 294.0n ~ 1.000
SubtreeProcessorAdd/64_per_subtree-4 287.3n 285.7n ~ 0.700
SubtreeProcessorAdd/256_per_subtree-4 289.3n 287.0n ~ 1.000
SubtreeProcessorAdd/1024_per_subtree-4 280.1n 281.4n ~ 0.700
SubtreeProcessorAdd/2048_per_subtree-4 279.9n 281.1n ~ 0.700
SubtreeProcessorRotate/4_per_subtree-4 285.9n 280.2n ~ 0.100
SubtreeProcessorRotate/64_per_subtree-4 282.9n 280.3n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 284.8n 279.2n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 285.9n 278.9n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 55.02n 55.10n ~ 1.000
SubtreeNodeAddOnly/64_per_subtree-4 36.03n 36.08n ~ 0.100
SubtreeNodeAddOnly/256_per_subtree-4 35.05n 35.06n ~ 0.700
SubtreeNodeAddOnly/1024_per_subtree-4 34.46n 34.43n ~ 0.200
SubtreeCreationOnly/4_per_subtree-4 110.0n 109.8n ~ 1.000
SubtreeCreationOnly/64_per_subtree-4 348.2n 347.5n ~ 1.000
SubtreeCreationOnly/256_per_subtree-4 1.222µ 1.221µ ~ 0.600
SubtreeCreationOnly/1024_per_subtree-4 3.797µ 3.796µ ~ 0.700
SubtreeCreationOnly/2048_per_subtree-4 6.749µ 6.784µ ~ 0.400
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 279.1n 279.6n ~ 0.700
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 276.2n 283.2n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 1.985m 2.022m ~ 0.100
ParallelGetAndSetIfNotExists/10k_nodes-4 5.184m 5.304m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 7.052m 7.182m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 9.641m 9.626m ~ 0.400
SequentialGetAndSetIfNotExists/1k_nodes-4 1.783m 1.810m ~ 0.400
SequentialGetAndSetIfNotExists/10k_nodes-4 4.391m 4.516m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 13.31m 13.76m ~ 0.100
SequentialGetAndSetIfNotExists/100k_nodes-4 24.70m 25.51m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.032m 2.044m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 8.414m 8.396m ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 13.13m 13.32m ~ 0.400
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 1.818m 1.869m ~ 0.200
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 7.916m 8.279m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 42.73m 43.60m ~ 0.100
DiskTxMap_SetIfNotExists-4 4.093µ 4.270µ ~ 0.100
DiskTxMap_SetIfNotExists_Parallel-4 3.953µ 3.888µ ~ 0.700
DiskTxMap_ExistenceOnly-4 412.1n 398.6n ~ 0.700
Queue-4 208.3n 208.2n ~ 1.000
AtomicPointer-4 8.143n 8.139n ~ 0.400
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 900.7µ 858.1µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4 833.4µ 775.4µ ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4 129.6µ 120.1µ ~ 0.100
ReorgOptimizations/AllMarkFalse/New/10K-4 58.85µ 58.36µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/10K-4 60.70µ 68.80µ ~ 0.400
ReorgOptimizations/HashSlicePool/New/10K-4 11.83µ 11.81µ ~ 0.200
ReorgOptimizations/NodeFlags/Old/10K-4 5.795µ 5.376µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.936µ 1.749µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 13.18m 12.15m ~ 0.200
ReorgOptimizations/DedupFilterPipeline/New/100K-4 13.70m 12.68m ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.309m 1.280m ~ 0.200
ReorgOptimizations/AllMarkFalse/New/100K-4 742.1µ 739.3µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/100K-4 683.6µ 623.2µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 332.2µ 355.6µ ~ 0.100
ReorgOptimizations/NodeFlags/Old/100K-4 52.32µ 50.97µ ~ 0.700
ReorgOptimizations/NodeFlags/New/100K-4 18.43µ 17.59µ ~ 0.400
TxMapSetIfNotExists-4 52.62n 52.39n ~ 0.400
TxMapSetIfNotExistsDuplicate-4 48.27n 48.08n ~ 0.700
ChannelSendReceive-4 672.3n 662.5n ~ 0.100
CalcBlockWork-4 510.6n 507.8n ~ 0.700
CalculateWork-4 711.3n 694.1n ~ 0.100
BuildBlockLocatorString_Helpers/Size_10-4 1.338µ 1.336µ ~ 0.500
BuildBlockLocatorString_Helpers/Size_100-4 15.15µ 12.87µ ~ 0.200
BuildBlockLocatorString_Helpers/Size_1000-4 127.4µ 160.0µ ~ 0.700
CatchupWithHeaderCache-4 104.6m 104.1m ~ 0.100
_prepareTxsPerLevel-4 442.5m 438.4m ~ 1.000
_prepareTxsPerLevelOrdered-4 4.409m 5.544m ~ 0.700
_prepareTxsPerLevel_Comparison/Original-4 438.5m 433.3m ~ 0.400
_prepareTxsPerLevel_Comparison/Optimized-4 4.298m 5.447m ~ 0.100
_BufferPoolAllocation/16KB-4 3.903µ 3.895µ ~ 1.000
_BufferPoolAllocation/32KB-4 8.922µ 9.562µ ~ 0.700
_BufferPoolAllocation/64KB-4 20.61µ 16.07µ ~ 0.100
_BufferPoolAllocation/128KB-4 37.98µ 29.91µ ~ 0.100
_BufferPoolAllocation/512KB-4 122.5µ 113.2µ ~ 0.100
_BufferPoolConcurrent/32KB-4 18.35µ 21.19µ ~ 0.400
_BufferPoolConcurrent/64KB-4 31.03µ 29.64µ ~ 0.100
_BufferPoolConcurrent/512KB-4 150.5µ 143.8µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/16KB-4 616.1µ 644.8µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 620.6µ 643.9µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 640.6µ 645.7µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/128KB-4 622.7µ 636.2µ ~ 0.200
_SubtreeDeserializationWithBufferSizes/512KB-4 604.9µ 610.3µ ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/16KB-4 36.89m 36.63m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/32KB-4 36.37m 36.58m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/64KB-4 36.37m 36.48m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/128KB-4 36.32m 36.32m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/512KB-4 35.95m 36.05m ~ 1.000
_PooledVsNonPooled/Pooled-4 830.0n 669.7n ~ 0.100
_PooledVsNonPooled/NonPooled-4 7.807µ 8.133µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 6.789µ 6.795µ ~ 0.700
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.535µ 9.506µ ~ 1.000
_MemoryFootprint/Alternative_64KB_32concurrent-4 9.117µ 9.228µ ~ 1.000
SubtreeSizes/10k_tx_4_per_subtree-4 1.271m 1.344m ~ 0.100
SubtreeSizes/10k_tx_16_per_subtree-4 299.9µ 302.4µ ~ 0.700
SubtreeSizes/10k_tx_64_per_subtree-4 70.71µ 70.67µ ~ 1.000
SubtreeSizes/10k_tx_256_per_subtree-4 17.51µ 17.65µ ~ 0.700
SubtreeSizes/10k_tx_512_per_subtree-4 8.812µ 8.934µ ~ 0.700
SubtreeSizes/10k_tx_1024_per_subtree-4 4.381µ 4.400µ ~ 1.000
SubtreeSizes/10k_tx_2k_per_subtree-4 2.172µ 2.173µ ~ 1.000
BlockSizeScaling/10k_tx_64_per_subtree-4 69.41µ 69.85µ ~ 0.100
BlockSizeScaling/10k_tx_256_per_subtree-4 17.50µ 17.31µ ~ 0.100
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.335µ 4.319µ ~ 1.000
BlockSizeScaling/50k_tx_64_per_subtree-4 368.0µ 362.1µ ~ 0.700
BlockSizeScaling/50k_tx_256_per_subtree-4 87.29µ 87.72µ ~ 0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4 21.82µ 21.45µ ~ 0.400
SubtreeAllocations/small_subtrees_exists_check-4 151.1µ 148.0µ ~ 0.700
SubtreeAllocations/small_subtrees_data_fetch-4 159.8µ 159.1µ ~ 0.700
SubtreeAllocations/small_subtrees_full_validation-4 312.9µ 308.0µ ~ 0.400
SubtreeAllocations/medium_subtrees_exists_check-4 8.868µ 8.835µ ~ 0.700
SubtreeAllocations/medium_subtrees_data_fetch-4 9.286µ 9.396µ ~ 0.200
SubtreeAllocations/medium_subtrees_full_validation-4 17.42µ 17.54µ ~ 1.000
SubtreeAllocations/large_subtrees_exists_check-4 2.105µ 2.113µ ~ 0.700
SubtreeAllocations/large_subtrees_data_fetch-4 2.208µ 2.253µ ~ 0.200
SubtreeAllocations/large_subtrees_full_validation-4 4.308µ 4.344µ ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 338.4µ 338.0µ ~ 0.700
StoreBlock_Sequential/AboveCSVHeight-4 336.6µ 338.5µ ~ 1.000
GetUtxoHashes-4 257.3n 255.5n ~ 0.500
GetUtxoHashes_ManyOutputs-4 42.62µ 45.90µ ~ 0.100
_NewMetaDataFromBytes-4 228.5n 227.7n ~ 1.000
_Bytes-4 396.2n 393.8n ~ 1.000
_MetaBytes-4 135.8n 136.2n ~ 0.400

Threshold: >10% with p < 0.05 | Generated: 2026-06-01 13:59 UTC

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. Verified all claims against the codebase:

  • Race 1 is structural: StartGRPCServer binds the listener via GetListener (util/grpc.go:37) before Serve (:125), so creating the client after WaitForServiceToBeReady() guarantees the socket is bound.
  • WaitUntilFSMTransitionFromIdleState (Server.go:2747) blocks on both FSM != IDLE and subscriptionManagerReady — exactly the condition that made GetFSMCurrentState return IDLE and starve the old 10s poll. Handler is registered.
  • The note about WaitForFSMtoTransitionToGivenState is accurate: the server method signature doesn't match the proto interface, so it falls through to Unimplemented.
  • RUN is the only event with Dst=RUNNING (fsm.go:48-56), so the post-wait assertion is exact.
  • Error construction is correct: errors.New peels the trailing error as the wrapped cause, then formats the remainder — no verb/arg mismatch.
  • go build + go vet ./test/nodeHelpers/... clean; fmt import still used.

Structural fix (ordering + correct sync primitive) rather than tuned timeouts. Test-helper only, low risk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky: services/blockassembly system tests (FSM-startup / gRPC connection races under CI load)

3 participants