fix(test): de-flake blockassembly system tests (FSM/gRPC startup races)#1000
Conversation
The client was created before WaitForServiceToBeReady(), so it dialed the blockchain gRPC service before the listener was guaranteed bound and relied on its retry budget to bridge the gap. Under heavy CI load that budget was exhausted, producing 'connection refused, retried 3 times'. The listener is bound before readiness is signalled (GetListener -> close readyCh -> Serve), so moving client creation after WaitForServiceToBeReady() removes the race. Refs bsv-blockchain#997
…0s poll StartBlockchainService polled GetFSMCurrentState on a 100ms ticker with a hard 10s deadline; under heavy CI load that poll kept seeing IDLE (GetFSMCurrentState reports IDLE until the subscription manager is ready) and timed out with 'timeout waiting for FSM to transition to RUNNING state'. Replace the fixed budget with WaitUntilFSMTransitionFromIdleState, a server-blocking RPC that returns the instant the FSM leaves IDLE and the subscription manager is ready, guarded by a generous 60s context backstop, then assert the state is RUNNING. Refs bsv-blockchain#997
|
🤖 Claude Code Review Status: Complete Review Summary: This PR correctly fixes two race conditions in the test helper by:
The implementation is correct and well-documented. The 60s timeout is appropriately generous for a backstop, and the explicit state assertion after the wait provides defense-in-depth against future FSM changes. Findings: No issues found. The changes are minimal, targeted, and follow best practices for eliminating timing dependencies in tests. |
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-01 13:59 UTC |
ordishs
left a comment
There was a problem hiding this comment.
Approve. Verified all claims against the codebase:
- Race 1 is structural:
StartGRPCServerbinds the listener viaGetListener(util/grpc.go:37) beforeServe(:125), so creating the client afterWaitForServiceToBeReady()guarantees the socket is bound. WaitUntilFSMTransitionFromIdleState(Server.go:2747) blocks on bothFSM != IDLEandsubscriptionManagerReady— exactly the condition that madeGetFSMCurrentStatereturn IDLE and starve the old 10s poll. Handler is registered.- The note about
WaitForFSMtoTransitionToGivenStateis accurate: the server method signature doesn't match the proto interface, so it falls through toUnimplemented. - RUN is the only event with Dst=RUNNING (fsm.go:48-56), so the post-wait assertion is exact.
- Error construction is correct:
errors.Newpeels the trailing error as the wrapped cause, then formats the remainder — no verb/arg mismatch. go build+go vet ./test/nodeHelpers/...clean;fmtimport still used.
Structural fix (ordering + correct sync primitive) rather than tuned timeouts. Test-helper only, low risk.



Summary
Fixes the six
services/blockassemblysystem tests that fail intermittently under heavy CI load (#997). Both failure signatures originate in one test helper,BlockchainDaemon.StartBlockchainService()(test/nodeHelpers/blockchainDaemon.go), which had two fixed-timeout races that lose when the runner is contended. Test-helper only — no production code, CI, or Makefile changes.Race 1 —
connection refused, retried 3 times. The blockchain gRPC client was created beforeWaitForServiceToBeReady(), so it dialled before the listener was guaranteed bound and relied on its retry budget to bridge the gap; under load that budget was exhausted.StartGRPCServerbinds the socket before readiness is signalled (GetListener→ closereadyCh→Serve), so moving client creation to afterWaitForServiceToBeReady()removes the race structurally.Race 2 —
timeout waiting for FSM to transition to RUNNING state. The helper polledGetFSMCurrentStateon a 100ms ticker with a hard 10s deadline.GetFSMCurrentStatedeliberately reportsIDLEuntil the subscription manager is ready, so under load the poll kept seeingIDLEand timed out. Replaced withWaitUntilFSMTransitionFromIdleState— a server-blocking RPC that returns the instant the FSM leaves IDLE and the subscription manager is ready — bounded by a generous 60s context backstop instead of a fixed budget, then assert the state isRUNNING.Note: the obvious-looking
WaitForFSMtoTransitionToGivenStateis not usable here — its proto handler is unregistered and returnscodes.Unimplementedover a real gRPC client.WaitUntilFSMTransitionFromIdleStateis the implemented, event-driven path (andRun()'s only destination isRUNNING, so the post-wait assertion is exact).Test Plan
go build ./test/nodeHelpers/... ./services/blockassembly/...+go vet ./test/nodeHelpers/...— cleango test -race -tags testtxmetacache -count=10 ./test/nodeHelpers/...— PASSgo test -race -tags testtxmetacache -count=10 -run '<the six tests>' ./services/blockassembly/...— PASS, noconnection refused/ FSM timeout /Unimplementedgo test -race -tags testtxmetacache -count=1 ./services/blockassembly/...(whole package) — PASSCloses #997