test(multinode): un-skip scenario_04 block-assembly isolation#995
Conversation
The three teranode robustness bugs that blocked this split-per-service chaos scenario have all landed: bsv-blockchain#969 - CreateUTXOSet nil-pointer panic at startup bsv-blockchain#971 / bsv-blockchain#979 - utxopersister double-reading the fileformat magic (the misdiagnosed "unknown magic" crash) bsv-blockchain#985 - CreateUTXOSet crashing the core sidecar on the 16-byte footer of a previous UTXO set during consolidation, surfaced by this very scenario Remove the t.Skip and rewrite the blocker comment to record the full story (including that the "unknown magic" was the fileformat double-read, not a legacy wire bug as first assumed). Verified end-to-end against current main: the scenario runs in full (node 3 stalls at baseline while its blockassembly is down, catches up to the survivors after restart, all three nodes converge) - 2m47s, no skip. Runs via 'make network-chaos-test-split'; not part of the standard PR checks.
|
🤖 Claude Code Review Status: Complete No issues found. This PR cleanly un-skips the block-assembly isolation test after all blocking bugs were fixed. The documentation accurately describes the test behavior and correctly references the three merged fix PRs (#969, #971/#979, #985). The test logic is sound: it verifies that killing the blockassembly service stalls block validation on node 3 (proving the WaitForBlockAssemblyReady gate works), then confirms catch-up occurs after restart. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-01 09:33 UTC |
|
ordishs
left a comment
There was a problem hiding this comment.
Approve — clean, low-risk test-only change. Verified all four referenced fixes (#969, #971, #979, #985) are on main, all harness symbols exist, and go vet -tags network_chaos passes clean. The rewritten comment block preserving the debugging narrative (including the misdiagnosis of the 'unknown magic' crash) is excellent hygiene. No production code touched; gated behind the network_chaos tag.
…6 (subtreevalidation pause) Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958 and the un-skip work in bsv-blockchain#995, plus the split-mode settings fix that scenario 05 turns out to depend on. Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts node 3 stalls at baseline (block-tx validation walks blockvalidation -> subtreevalidation -> validator), restarts validator, asserts catch-up and 3-node convergence. This only holds when subtreevalidation calls the standalone validator container over gRPC, NOT when it embeds an in-process validator. settings.conf:1212 ships useLocalValidator=true (the right default for all-in-one), so a vanilla docker.teranode{N}.test context would build the in-process validator and ignore the validator container entirely - making scenario 05 a no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips useLocalValidator=false per node so the kill is actually observable. The override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged. Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP) rather than killing, so the gRPC call from blockvalidation hangs (the frozen dependency failure mode, distinct from a process that has exited). First scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a failed assertion leaves the shared stack healthy for the next scenario's Reset. Both are gated on //go:build network_chaos like the rest of the split-topology suite.
…6 (subtreevalidation pause) Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958 and the un-skip work in bsv-blockchain#995, plus the split-mode useLocalValidator override that scenario 05 depends on and a chaos_unpause idempotency fix that scenario 06 depends on. Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts node 3 stalls at baseline (block-tx validation walks blockvalidation -> subtreevalidation -> validator), restarts validator, asserts catch-up and 3-node convergence. This only holds when subtreevalidation calls the standalone validator container over gRPC, NOT when it embeds an in-process validator. settings.conf ships useLocalValidator=true (the right default for all-in-one), so a vanilla docker.teranode{N}.test context would build the in-process validator and ignore the validator container entirely - making scenario 05 a silent no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips useLocalValidator=false per node so the kill is actually observable. The override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged. Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP) rather than killing, so the gRPC call from blockvalidation hangs (the frozen dependency failure mode, distinct from a process that has exited). First scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a failed assertion leaves the shared stack healthy for the next scenario's Reset. Bash fix: chaos_unpause is now idempotent across all three branches. The single-service and bulk-aio branches were previously running a bare 'docker unpause', which exits non-zero on an already-running container. Scenario 06 runs a defensive defer-unpause alongside an explicit one (defer is the safety net for assertion failure; explicit unblocks the catch-up assertion), and the bare docker call turned the second one into t.Fatalf, failing the test on a green run. The bulk-split branch already had the '|| true' pattern; this extends it for consistency. Same fix surfaced independently in PR review on bsv-blockchain#1070 against scenario 08. Idempotent semantics are what 'chaos cleanup' actually wants anyway. Both scenarios are gated on //go:build network_chaos like the rest of the split-topology suite.



Un-skips
scenario_04(block-assembly isolation) in the split-per-service chaos suite. This was the capstone of the chaos-harness work: the scenario was merged skipped (#958) pending teranode robustness fixes, all of which have now landed.What the scenario proves
In the split topology each node runs nine per-service containers, so we can kill
blockassemblyalone. The scenario killsteranode3-blockassembly, mines 5 blocks on node 1, and asserts node 3 stays pinned at baseline while itsblockvalidationcontainer is healthy — proving the hard runtime dependency (WaitForBlockAssemblyReadygates every inbound block). After restartingblockassembly, node 3 catches up and all three converge. This failure mode is invisible in the all-in-one topology, where you can't killblockassemblywithout taking the whole node down.Blockers, all now fixed
Bringing this scenario up surfaced a cascade of utxopersister bugs, each fix exposing the next:
CreateUTXOSetnil-pointer panic at startup (height-1 probe block).CreateUTXOSetcrashing thecoresidecar on the 16-byte footer (txCount+utxoCount) of a previous UTXO set during consolidation, hit when node 3 persists peer blocks while itsblockassemblyis down. This was the last blocker, surfaced by this very scenario.The comment block is rewritten to record this full story.
Verification
Ran end-to-end against current
main(which includes all three fixes):Full run, no skip — node 3 stalls at baseline while
blockassemblyis down, catches up after restart, all three nodes converge.Note: this suite is behind the
network_chaosbuild tag and runs viamake network-chaos-test-split(≈32 containers); it is not part of the standard PR checks, so CI here will not execute it.