test(multinode): add scenario 05 (validator isolation) and scenario 06 (subtreevalidation pause)#1069
Conversation
|
|
||
| stalledInfo, err := node3.GetBlockchainInfo(ctx) | ||
| require.NoError(t, err, "node 3 RPC must still answer (only validator is down, not core)") | ||
| require.Equal(t, baselineHeight, stalledInfo.Blocks, |
There was a problem hiding this comment.
[Major] Test premise likely contradicts the default config — this assertion may fail on the live stack.
The author's own hypothesis (line 27) resolves against this test from a static read of the wiring:
- subtreevalidation gets its validator via daemonStores.GetValidatorClient (daemon/daemon_services.go:766, passed into subtreevalidation.New at :821/:828).
- GetValidatorClient returns an in-process validator (validator.New) when appSettings.Validator.UseLocalValidator is true, and only a gRPC client (validator.NewClient) otherwise (daemon/daemon_stores.go:194-255).
- The only useLocalValidator setting in the repo is the global default 'useLocalValidator = true' (settings.conf:1212). The split stack runs under SETTINGS_CONTEXT=docker.teranode{N}.test (compose/cmd/gennodes/templates/docker-compose-split.yml.tmpl:144), and the generated settings_multinode.conf only sets validator_grpcAddress, not useLocalValidator — so it resolves to true.
With a local validator embedded in subtreevalidation, killing the standalone teranode3-validator container leaves the block-tx validation path intact, so node 3 would catch up rather than stall — exactly the falsification path the doc comment describes. The assertion require.Equal(t, baselineHeight, stalledInfo.Blocks) would then fail.
Worth confirming on the live stack (the corresponding PR checklist item is still unchecked). If the split topology is meant to force the remote validator, it likely needs useLocalValidator = false for the docker split context.
(Scenario 06 is unaffected: blockvalidation always constructs a gRPC subtreevalidation client at services/blockvalidation/Server.go:579, so pausing that service genuinely hangs the block path.)
There was a problem hiding this comment.
✅ Resolved by this revision. The template now emits useLocalValidator.docker.teranode{N} = false, gated behind {{ if not $.AllInOne }} so it applies only in split mode. That context key resolves for the runtime docker.teranode{N}.test context via gocore hierarchy (same pattern the sibling *_grpcAddress.docker.teranode{N} overrides already rely on), so subtreevalidation builds a gRPC validator client and killing the standalone validator container is now observable. All-in-one keeps the in-process validator. Scenario 05 is no longer a no-op.
|
🤖 Claude Code Review Status: Complete Current Review: Two new
No new issues found. Both prior History:
Note: the PR's end-to-end checklist item (running both tests against a live 3-node split stack) is still unchecked — the static read supports the premise, but a live run is the final confirmation. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-11 08:54 UTC |
…6 (subtreevalidation pause) Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958 and the un-skip work in bsv-blockchain#995, plus the split-mode settings fix that scenario 05 turns out to depend on. Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts node 3 stalls at baseline (block-tx validation walks blockvalidation -> subtreevalidation -> validator), restarts validator, asserts catch-up and 3-node convergence. This only holds when subtreevalidation calls the standalone validator container over gRPC, NOT when it embeds an in-process validator. settings.conf:1212 ships useLocalValidator=true (the right default for all-in-one), so a vanilla docker.teranode{N}.test context would build the in-process validator and ignore the validator container entirely - making scenario 05 a no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips useLocalValidator=false per node so the kill is actually observable. The override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged. Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP) rather than killing, so the gRPC call from blockvalidation hangs (the frozen dependency failure mode, distinct from a process that has exited). First scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a failed assertion leaves the shared stack healthy for the next scenario's Reset. Both are gated on //go:build network_chaos like the rest of the split-topology suite.
2063eef to
fcd6c00
Compare
|
Good catch, confirmed and fixed. Walked through the code:
Net effect was exactly what you described: the standalone The fix in the amended commit ( Scenario 06 left untouched per your read - |
| s.PauseService(t, 3, "subtreevalidation") | ||
| // Ensure the service is thawed even if an assertion below fails, so the | ||
| // shared stack is left healthy for the next scenario's Reset. | ||
| defer s.UnpauseService(t, 3, "subtreevalidation") |
There was a problem hiding this comment.
[Major] Double-unpause fails the test on the happy path.
The deferred UnpauseService (line 52) and the explicit one (line 74) both fire when all assertions pass: line 74 unpauses teranode3-subtreevalidation successfully, then on return the deferred call runs again.
The service-specific branch of chaos_unpause runs docker unpause WITHOUT error suppression (compose/multinode.sh:657), unlike the whole-node branch which swallows errors (line 665). docker unpause on an already-running container exits non-zero ("Error response from daemon: Container ... is not paused"), so MustRun (harness.go:243) calls t.Fatalf — failing a test whose body actually succeeded.
The defer for crash-safety is the right instinct; the problem is it is not idempotent with the explicit unpause. Two fixes:
- Make the defer a no-op once thawed — guard it with a "paused" bool that the explicit unpause flips to false.
- Or make UnpauseService idempotent at the script layer: suppress the not-paused error in the service branch (multinode.sh:657) the same way the whole-node branch already does. This also benefits any future scenario using this defer idiom.
(This is the first scenario to use the pause/unpause verbs, so the non-idempotent service path has not been exercised before.)
There was a problem hiding this comment.
✅ Addressed in this revision via fix option #2: compose/multinode.sh now runs docker unpause ... 2>/dev/null || true in the service-specific branch, making UnpauseService idempotent. The deferred + explicit unpause on the happy path no longer fails the test.
…eeze), 09 (asset isolation control) Three new split-topology chaos scenarios that broaden coverage beyond the validation pipeline (04/05/06) to the relay control plane and the off-path services, plus an idempotency fix to chaos_unpause that scenario 08 turns out to depend on (and scenario 06 in bsv-blockchain#1069 transparently benefits from). Scenario 07 (KillService, expected STALL): kills teranode3-p2p and asserts node 3 stays at baseline because the inter-node block-relay control plane in the test stack is routed exclusively through teranodeN-p2p. Legacy lives in core but is configured for SV bridging, not inter-teranode relay, so killing native p2p genuinely severs node 3 from the mesh. Restart, catch-up, 3-node convergence asserted as usual. Scenario 08 (PauseService, asymmetric expected NO STALL): the suite's first asymmetric scenario. Pauses teranode2-propagation and asserts node 2 still catches up via p2p + blockvalidation, because propagation is the tx-ingress path (RPC sendrawtransaction -> propagation -> validator -> blockassembly), not the block-relay path. Inbound peer blocks reach node 2 via p2p -> blockvalidation -> subtreevalidation; they never touch propagation. Uses PauseService (SIGSTOP) rather than KillService so any accidental gRPC call from a block-path site would hang the caller (frozen-dependency failure mode), making the coupling visible. Has a defer UnpauseService so a failed assertion still leaves the shared stack healthy for the next scenario. Scenario 09 (KillService on asset, control / null hypothesis): kills teranode3-asset and asserts node 3 still catches up. Asset is the HTTP server that exposes a node's subtree/block data outbound to peers and external clients; it is not on the receive-side consensus path. Without a control like this, the suite cannot distinguish "validator/p2p/etc kills stalled node 3 because they're on the consensus path" from "any KillService stalls node 3 because the harness or compose graph is broken." This scenario rules out the latter. If it surprisingly fails - node 3 does stall after asset is killed - that documents a real coupling (subtreevalidation reaching localhost asset rather than the originating peer's asset) worth chasing. Bash fix: chaos_unpause is now idempotent across all three branches. The single-service and bulk-aio branches were previously running a bare 'docker unpause', which exits non-zero on an already-running container. Scenarios 06 and 08 both run a defensive defer-unpause alongside an explicit one on the happy path (defer is the safety net for assertion failure; explicit is needed for the catch-up assertion in scenario 06), and the bare docker call was turning the second one into t.Fatalf, failing the test even when all assertions passed. The bulk-split branch already had the '|| true' pattern; this extends it for consistency. Raised in PR review on bsv-blockchain#1070. Idempotent semantics are what 'chaos cleanup' actually wants anyway. All three scenarios are gated on //go:build network_chaos like the rest of the suite.
ordishs
left a comment
There was a problem hiding this comment.
Request changes — one blocking bug in scenario 06
Nice scenarios, and the useLocalValidator override is genuinely load-bearing (verified: daemon_stores.go:194-249 builds an in-process validator when useLocalValidator=true, and subtreevalidation shares startValidationService, so without the flip scenario 05 would be a silent no-op). The call chain BlockValidation.go:2027 -> CheckBlockSubtrees -> ValidateWithOptions checks out too.
🔴 Blocking: scenario 06 fails on the happy path (non-idempotent unpause)
UnpauseService is called twice on a passing run — explicitly (line 74) and via defer (line 52) — but docker unpause is not idempotent, and the per-service path does not guard it:
# compose/multinode.sh:653-659 (per-service path)
docker unpause "$ctr" # <- no `|| true`The whole-node path right below it (line 665) does tolerate this: docker unpause "$ctr" 2>/dev/null || true.
Sequence on a green run: line 74 thaws the container → test body returns → the deferred call runs docker unpause on an already-running container → Docker exits non-zero (Container ... is not paused) → MustRun calls t.Fatalf (harness.go:243). The test fails even though every assertion passed.
This path has never been exercised: scenario 06 is the first pause/unpause user, the e2e checkbox is unchecked, and StartService (scenarios 04/05) is idempotent via compose up -d, so there was no prior coverage.
Recommended fix (cleanest, benefits all callers, matches StartService's documented idempotency): make the per-service unpause tolerant, mirroring the whole-node path:
docker unpause "$ctr" 2>/dev/null || trueThen the defer-plus-explicit safety idiom works as intended.
🟡 Minor (non-blocking)
- Brittle line reference: scenario 05's comment cites
settings.conf:1212by absolute line number — accurate now, but it will silently drift. Reference the key name (useLocalValidator) instead. - The fixed 15s settle sleeps carry inherent flakiness, but they match the deliberate, empirically-tuned choice in
scenario_04— fine as-is.
Please also close the e2e checkbox (go test -tags network_chaos -run TestSubtreeValidationIsolation) before merge — this bug surfaces on the first real run.
…6 (subtreevalidation pause) Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958 and the un-skip work in bsv-blockchain#995, plus the split-mode useLocalValidator override that scenario 05 depends on and a chaos_unpause idempotency fix that scenario 06 depends on. Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts node 3 stalls at baseline (block-tx validation walks blockvalidation -> subtreevalidation -> validator), restarts validator, asserts catch-up and 3-node convergence. This only holds when subtreevalidation calls the standalone validator container over gRPC, NOT when it embeds an in-process validator. settings.conf ships useLocalValidator=true (the right default for all-in-one), so a vanilla docker.teranode{N}.test context would build the in-process validator and ignore the validator container entirely - making scenario 05 a silent no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips useLocalValidator=false per node so the kill is actually observable. The override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged. Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP) rather than killing, so the gRPC call from blockvalidation hangs (the frozen dependency failure mode, distinct from a process that has exited). First scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a failed assertion leaves the shared stack healthy for the next scenario's Reset. Bash fix: chaos_unpause is now idempotent across all three branches. The single-service and bulk-aio branches were previously running a bare 'docker unpause', which exits non-zero on an already-running container. Scenario 06 runs a defensive defer-unpause alongside an explicit one (defer is the safety net for assertion failure; explicit unblocks the catch-up assertion), and the bare docker call turned the second one into t.Fatalf, failing the test on a green run. The bulk-split branch already had the '|| true' pattern; this extends it for consistency. Same fix surfaced independently in PR review on bsv-blockchain#1070 against scenario 08. Idempotent semantics are what 'chaos cleanup' actually wants anyway. Both scenarios are gated on //go:build network_chaos like the rest of the split-topology suite.
fcd6c00 to
de64bf5
Compare
|
Thanks, all three addressed in 🔴 chaos_unpause idempotency — fixed in 🟡 Brittle line reference — scenario 05's comment now reads "settings.conf ships useLocalValidator=true" with no line number. The split-mode override path ( 🟡 15s settle sleep — left as-is, matching scenario 04's deliberate choice. E2E checkbox — honest answer: I have not run the suite end-to-end yet. Local blockers are (a) Aerospike testcontainer rejects the host's default |
|
ordishs
left a comment
There was a problem hiding this comment.
Re-review — blocking issue resolved ✅
The non-idempotent unpause is fixed in de64bf57. chaos_unpause's per-service path now reads:
docker unpause "$ctr" 2>/dev/null || trueand the same guard was applied to the all-in-one path for consistency (the bulk-split path already had it), with a comment explaining exactly why — the defensive defer-plus-explicit unpause pattern in scenario 06 is now safe on the happy path. This is the cleanest of the options I suggested: idempotent semantics are what chaos cleanup actually wants, and it benefits every caller rather than only this test.
Scenario 06 keeps both the defer (line 52) and explicit (line 74) UnpauseService calls, which is correct now that unpause is idempotent — the deferred call is the safety net for a failed assertion.
The substantive premises I verified earlier still hold (the useLocalValidator override is load-bearing, and the ProcessBlock -> CheckBlockSubtrees -> ValidateWithOptions coupling is real), and nothing in this revision touches them.
🟡 Remaining non-blocking nit
- Scenario 05 still cites
settings.conf:1212by absolute line number — fine to leave, but it'll drift over time; referencing the key name would age better.
Approving. Worth running go test -tags network_chaos -run TestSubtreeValidationIsolation against a real 3-node stack to close the e2e checkbox before merge, but the code-level blocker is cleared.



Summary
Two new split-topology chaos scenarios in
test/multinode_split/, building on the harness landed in #958 and the un-skip work in #995. Both are gated on//go:build network_chaos.Scenario 05 — Validator isolation (
KillService)teranode3-validator, mines 5 blocks on node 1, asserts node 3 stays at baseline because the block-tx validation path isblockvalidation.ProcessBlock→subtreevalidation.CheckBlockSubtrees→validator.ValidateWithOptions, so with validator down node 3 cannot validate the txs inside the new blocks.baseline+5and 3-node convergence on node 1's tip.Scenario 06 — Subtreevalidation pause (
PauseService/UnpauseService)teranode3-subtreevalidationviadocker pause(SIGSTOP), so the gRPC call from blockvalidation hangs (no connection-refused, no responder), exercising the frozen / unresponsive dependency failure mode, distinct from a process that has exited.defer s.UnpauseServiceso a failed assertion still leaves the shared stack healthy for the next scenario'sReset.Test plan
go build -tags network_chaos ./test/multinode_split/...cleango test -tags network_chaos -run TestValidatorIsolationand... -run TestSubtreeValidationIsolationend-to-end against a 3-node split stack