test(multinode): add split-topology scenarios 07 (p2p), 08 (propagation freeze), 09 (asset control)#1070
Conversation
|
🤖 Claude Code Review Status: Complete Three new Current Review:
History:
Note: the PR's own test plan still has the end-to-end run against a live split stack unchecked. The fix is sound by inspection, but a real 3-node run is the final confirmation. |
| t.Logf("teranode2 reached height=%d with propagation still paused", caughtUp.Blocks) | ||
|
|
||
| t.Log("unpausing teranode2-propagation...") | ||
| s.UnpauseService(t, 2, "propagation") |
There was a problem hiding this comment.
[Major] Double-unpause fails the test on its own happy path.
This explicit s.UnpauseService(t, 2, "propagation") plus the defer s.UnpauseService(t, 2, "propagation") at line 60 means unpause runs twice on success. After this call the container is already running, so the deferred call shells out to docker unpause on a non-paused container.
The single-service path in compose/multinode.sh (chaos_unpause, line 657) runs docker unpause "$ctr" without the 2>/dev/null || true guard that the bulk path (line 665) has. docker unpause on a running container exits non-zero (Error response from daemon: ... is not paused), which MustRun turns into t.Fatalf. So the deferred cleanup fails the test even when every assertion has passed.
Fix options: drop this explicit unpause and let the defer be the single source of truth (the convergence check only compares block heights, which does not need propagation running); or make chaos_unpause's single-service branch idempotent like the bulk path: docker unpause "$ctr" 2>/dev/null || true. The first keeps the change inside this PR.
There was a problem hiding this comment.
Resolved by the multinode.sh change in this push: chaos_unpause now runs docker unpause ... 2>/dev/null || true on the single-service branch (matching the bulk-split branch), so the explicit + defer double-unpause on the happy path is idempotent and no longer trips MustRun -> t.Fatalf. Marking resolved.
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-11 08:20 UTC |
…eeze), 09 (asset isolation control) Three new split-topology chaos scenarios that broaden coverage beyond the validation pipeline (04/05/06) to the relay control plane and the off-path services, plus an idempotency fix to chaos_unpause that scenario 08 turns out to depend on (and scenario 06 in bsv-blockchain#1069 transparently benefits from). Scenario 07 (KillService, expected STALL): kills teranode3-p2p and asserts node 3 stays at baseline because the inter-node block-relay control plane in the test stack is routed exclusively through teranodeN-p2p. Legacy lives in core but is configured for SV bridging, not inter-teranode relay, so killing native p2p genuinely severs node 3 from the mesh. Restart, catch-up, 3-node convergence asserted as usual. Scenario 08 (PauseService, asymmetric expected NO STALL): the suite's first asymmetric scenario. Pauses teranode2-propagation and asserts node 2 still catches up via p2p + blockvalidation, because propagation is the tx-ingress path (RPC sendrawtransaction -> propagation -> validator -> blockassembly), not the block-relay path. Inbound peer blocks reach node 2 via p2p -> blockvalidation -> subtreevalidation; they never touch propagation. Uses PauseService (SIGSTOP) rather than KillService so any accidental gRPC call from a block-path site would hang the caller (frozen-dependency failure mode), making the coupling visible. Has a defer UnpauseService so a failed assertion still leaves the shared stack healthy for the next scenario. Scenario 09 (KillService on asset, control / null hypothesis): kills teranode3-asset and asserts node 3 still catches up. Asset is the HTTP server that exposes a node's subtree/block data outbound to peers and external clients; it is not on the receive-side consensus path. Without a control like this, the suite cannot distinguish "validator/p2p/etc kills stalled node 3 because they're on the consensus path" from "any KillService stalls node 3 because the harness or compose graph is broken." This scenario rules out the latter. If it surprisingly fails - node 3 does stall after asset is killed - that documents a real coupling (subtreevalidation reaching localhost asset rather than the originating peer's asset) worth chasing. Bash fix: chaos_unpause is now idempotent across all three branches. The single-service and bulk-aio branches were previously running a bare 'docker unpause', which exits non-zero on an already-running container. Scenarios 06 and 08 both run a defensive defer-unpause alongside an explicit one on the happy path (defer is the safety net for assertion failure; explicit is needed for the catch-up assertion in scenario 06), and the bare docker call was turning the second one into t.Fatalf, failing the test even when all assertions passed. The bulk-split branch already had the '|| true' pattern; this extends it for consistency. Raised in PR review on bsv-blockchain#1070. Idempotent semantics are what 'chaos cleanup' actually wants anyway. All three scenarios are gated on //go:build network_chaos like the rest of the suite.
f5807a8 to
21ec215
Compare
|
Confirmed. Walked it through:
Fix in Note: scenario 06 in PR #1069 has the same defer + explicit pattern (and is required to keep it - the explicit unpause unblocks the catch-up assertion), so it benefits transparently from this fix. PR #1069 was already amended for the separate |
|
…6 (subtreevalidation pause) Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958 and the un-skip work in bsv-blockchain#995, plus the split-mode useLocalValidator override that scenario 05 depends on and a chaos_unpause idempotency fix that scenario 06 depends on. Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts node 3 stalls at baseline (block-tx validation walks blockvalidation -> subtreevalidation -> validator), restarts validator, asserts catch-up and 3-node convergence. This only holds when subtreevalidation calls the standalone validator container over gRPC, NOT when it embeds an in-process validator. settings.conf ships useLocalValidator=true (the right default for all-in-one), so a vanilla docker.teranode{N}.test context would build the in-process validator and ignore the validator container entirely - making scenario 05 a silent no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips useLocalValidator=false per node so the kill is actually observable. The override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged. Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP) rather than killing, so the gRPC call from blockvalidation hangs (the frozen dependency failure mode, distinct from a process that has exited). First scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a failed assertion leaves the shared stack healthy for the next scenario's Reset. Bash fix: chaos_unpause is now idempotent across all three branches. The single-service and bulk-aio branches were previously running a bare 'docker unpause', which exits non-zero on an already-running container. Scenario 06 runs a defensive defer-unpause alongside an explicit one (defer is the safety net for assertion failure; explicit unblocks the catch-up assertion), and the bare docker call turned the second one into t.Fatalf, failing the test on a green run. The bulk-split branch already had the '|| true' pattern; this extends it for consistency. Same fix surfaced independently in PR review on bsv-blockchain#1070 against scenario 08. Idempotent semantics are what 'chaos cleanup' actually wants anyway. Both scenarios are gated on //go:build network_chaos like the rest of the split-topology suite.
ordishs
left a comment
There was a problem hiding this comment.
Approve with minor follow-ups.
Clean, idiomatic, low-risk additive test code. Conventions match scenario 04 precisely (build tag, package, harness API, ctx pattern, short() reuse). The multinode.sh fix is correct: scenario 08 intentionally calls UnpauseService both explicitly and via defer, and a bare docker unpause on an already-running container exits non-zero — which MustRun would turn into a spurious t.Fatalf. Suppressing it makes cleanup idempotent and brings all three chaos_unpause branches in line. The PauseService (SIGSTOP) vs KillService choice in 08 is well-reasoned: a frozen dependency hangs the caller rather than returning connection-refused, so a hidden block-path coupling surfaces as a real stall. Scenario 09 as an explicit null-hypothesis control is genuinely valuable.
Verified locally:
go vet -tags network_chaos ./test/multinode_split/...— clean (exit 0)
Follow-ups (none blocking the code itself):
-
E2E run still pending. Only the build is confirmed; the asymmetric assertions (07 stalls; 08/09 catch up) encode hypotheses about the dependency graph that a compile can't validate. Recommend gating merge on an actual
go test -tags network_chaos -run 'TestP2PIsolation|TestPropagationFreeze|TestAssetIsolation'against a live 3-node split stack (CI or local) — that's the whole value of these scenarios. -
Numbering / merge order. Scenarios 05 and 06 aren't in
mainor this PR (they're on a separate branch), so the body's "validation pipeline tests (04/05/06)" reference and the "first asymmetric scenario" framing in 08 assume a merge order. Againstmainthe files jump 04 → 07/08/09. Worth aligning the merge sequence or noting the dependency. -
Nit: 07/08/09 depend on
short()defined in scenario 04 — fine today (same package/tag), just a silent compile prerequisite to be aware of.



Summary
Three new split-topology chaos scenarios that broaden coverage beyond the validation pipeline tests (04/05/06) to the relay control plane and the off-path services. All gated on
//go:build network_chaos.Scenario 07 — P2P isolation (
KillService, expected stall)teranode3-p2p. The multinode docker stack relays blocks between teranodes only via the native p2p service, so killingteranodeN-p2pgenuinely severs node N from the mesh. (Legacy lives incorebut is configured for SV bridging, not inter-teranode relay.)Scenario 08 — Propagation freeze (
PauseService, asymmetric expected NO stall)teranode2-propagationand asserts node 2 still catches up viap2p+blockvalidation, because propagation is the tx-ingress path (RPC sendrawtransaction -> propagation -> validator -> blockassembly), not the block-relay path.PauseService(SIGSTOP) rather thanKillServiceso any accidental gRPC call from a block-path site would hang the caller (frozen-dependency failure mode), making a hidden coupling visible rather than letting it surface as connection-refused.defer s.UnpauseServiceso a failed assertion still leaves the shared stack healthy for the next scenario.Scenario 09 — Asset isolation (
KillService, control / null hypothesis)teranode3-assetand asserts node 3 still catches up. Asset is the outbound HTTP server that serves a node's subtree/block data to peers and external clients; it is not on the receive-side consensus path.KillServicestalls node 3 because the harness or compose graph is broken." This scenario rules out the latter.Test plan
go build -tags network_chaos ./test/multinode_split/...cleango test -tags network_chaos -run 'TestP2PIsolation|TestPropagationFreeze|TestAssetIsolation'end-to-end