Six services/blockassembly system tests fail intermittently with daemon-startup timing errors. Confirmed flaky: the same commit (33e17d537) failed, then passed on an immediate re-run with no code change.
Tests
All in services/blockassembly/blockassembly_system_test.go (via test/nodeHelpers/blockchainDaemon.go):
Test_CoinbaseSubsidyHeight
TestShouldAddSubtreesToLongerChain
TestShouldHandleReorg
TestShouldHandleReorgWithLongerChain
TestResetWithBlockchainAhead_Integration
TestHandleReorgWithInvalidBlock_Integration
Symptoms
Two signatures:
timeout waiting for FSM to transition to RUNNING state (current state: IDLE) — blockchainDaemon.go:156
failed to connect to blockchain service for 'localhost:<port>', retried 3 times: ... connection refused
Failing tests sit at ~45s (the client connect-retry budget) or ~10s (the FSM poll timeout).
Mechanism
StartBlockchainService() starts the blockchain service in-process (goroutine + gRPC server on a local port), then polls the FSM for RUNNING with a 10s timeout (blockchainDaemon.go:148). Under heavy concurrent CI load — these ran alongside 14 benchmark jobs and the e2e shard matrices — the in-process gRPC server doesn't bind / the FSM doesn't reach RUNNING within the deadline.
Evidence
PR #996, "Teranode PR tests" run 26752608043:
- attempt 1 —
test job FAILED (these 6)
- attempt 2 (re-run, identical SHA) —
test job PASSED
Same commit, opposite outcomes ⇒ flaky.
Suggested directions
- Raise the 10s FSM-RUNNING poll timeout (and the client connect-retry budget); 10s is tight on a loaded runner.
- Or wrap these system tests in the existing retry mechanism (the e2e suites use
gotestsum_with_retry.sh).
- Longer term: have
StartBlockchainService() wait on a readiness signal rather than a fixed timeout.
Not blocking a specific PR — surfaced while profiling the unit-test suite.
Six
services/blockassemblysystem tests fail intermittently with daemon-startup timing errors. Confirmed flaky: the same commit (33e17d537) failed, then passed on an immediate re-run with no code change.Tests
All in
services/blockassembly/blockassembly_system_test.go(viatest/nodeHelpers/blockchainDaemon.go):Test_CoinbaseSubsidyHeightTestShouldAddSubtreesToLongerChainTestShouldHandleReorgTestShouldHandleReorgWithLongerChainTestResetWithBlockchainAhead_IntegrationTestHandleReorgWithInvalidBlock_IntegrationSymptoms
Two signatures:
timeout waiting for FSM to transition to RUNNING state (current state: IDLE)—blockchainDaemon.go:156failed to connect to blockchain service for 'localhost:<port>', retried 3 times: ... connection refusedFailing tests sit at ~45s (the client connect-retry budget) or ~10s (the FSM poll timeout).
Mechanism
StartBlockchainService()starts the blockchain service in-process (goroutine + gRPC server on a local port), then polls the FSM for RUNNING with a 10s timeout (blockchainDaemon.go:148). Under heavy concurrent CI load — these ran alongside 14 benchmark jobs and the e2e shard matrices — the in-process gRPC server doesn't bind / the FSM doesn't reach RUNNING within the deadline.Evidence
PR #996, "Teranode PR tests" run
26752608043:testjob FAILED (these 6)testjob PASSEDSame commit, opposite outcomes ⇒ flaky.
Suggested directions
gotestsum_with_retry.sh).StartBlockchainService()wait on a readiness signal rather than a fixed timeout.Not blocking a specific PR — surfaced while profiling the unit-test suite.