Skip to content

test(multinode): add scenario 05 (validator isolation) and scenario 06 (subtreevalidation pause)#1069

Merged
liam merged 1 commit into
bsv-blockchain:mainfrom
liam:liam/multinode-split-scenario-05-06
Jun 11, 2026
Merged

test(multinode): add scenario 05 (validator isolation) and scenario 06 (subtreevalidation pause)#1069
liam merged 1 commit into
bsv-blockchain:mainfrom
liam:liam/multinode-split-scenario-05-06

Conversation

@liam

@liam liam commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Two new split-topology chaos scenarios in test/multinode_split/, building on the harness landed in #958 and the un-skip work in #995. Both are gated on //go:build network_chaos.

Scenario 05 — Validator isolation (KillService)

  • Kills teranode3-validator, mines 5 blocks on node 1, asserts node 3 stays at baseline because the block-tx validation path is blockvalidation.ProcessBlocksubtreevalidation.CheckBlockSubtreesvalidator.ValidateWithOptions, so with validator down node 3 cannot validate the txs inside the new blocks.
  • Restarts validator, asserts catch-up to height baseline+5 and 3-node convergence on node 1's tip.
  • Doc-comment also records a hypothesis worth falsifying: if subtreevalidation ever switches to an in-process validator on the block path, this test fails loudly and the real coupling is documented here rather than assumed.

Scenario 06 — Subtreevalidation pause (PauseService / UnpauseService)

  • PAUSES rather than kills teranode3-subtreevalidation via docker pause (SIGSTOP), so the gRPC call from blockvalidation hangs (no connection-refused, no responder), exercising the frozen / unresponsive dependency failure mode, distinct from a process that has exited.
  • First scenario to use the pause/unpause verbs.
  • Uses defer s.UnpauseService so a failed assertion still leaves the shared stack healthy for the next scenario's Reset.

Test plan

  • go build -tags network_chaos ./test/multinode_split/... clean
  • CI / a local docker stack to run go test -tags network_chaos -run TestValidatorIsolation and ... -run TestSubtreeValidationIsolation end-to-end against a 3-node split stack


stalledInfo, err := node3.GetBlockchainInfo(ctx)
require.NoError(t, err, "node 3 RPC must still answer (only validator is down, not core)")
require.Equal(t, baselineHeight, stalledInfo.Blocks,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Test premise likely contradicts the default config — this assertion may fail on the live stack.

The author's own hypothesis (line 27) resolves against this test from a static read of the wiring:

  • subtreevalidation gets its validator via daemonStores.GetValidatorClient (daemon/daemon_services.go:766, passed into subtreevalidation.New at :821/:828).
  • GetValidatorClient returns an in-process validator (validator.New) when appSettings.Validator.UseLocalValidator is true, and only a gRPC client (validator.NewClient) otherwise (daemon/daemon_stores.go:194-255).
  • The only useLocalValidator setting in the repo is the global default 'useLocalValidator = true' (settings.conf:1212). The split stack runs under SETTINGS_CONTEXT=docker.teranode{N}.test (compose/cmd/gennodes/templates/docker-compose-split.yml.tmpl:144), and the generated settings_multinode.conf only sets validator_grpcAddress, not useLocalValidator — so it resolves to true.

With a local validator embedded in subtreevalidation, killing the standalone teranode3-validator container leaves the block-tx validation path intact, so node 3 would catch up rather than stall — exactly the falsification path the doc comment describes. The assertion require.Equal(t, baselineHeight, stalledInfo.Blocks) would then fail.

Worth confirming on the live stack (the corresponding PR checklist item is still unchecked). If the split topology is meant to force the remote validator, it likely needs useLocalValidator = false for the docker split context.

(Scenario 06 is unaffected: blockvalidation always constructs a gRPC subtreevalidation client at services/blockvalidation/Server.go:579, so pausing that service genuinely hangs the block path.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Resolved by this revision. The template now emits useLocalValidator.docker.teranode{N} = false, gated behind {{ if not $.AllInOne }} so it applies only in split mode. That context key resolves for the runtime docker.teranode{N}.test context via gocore hierarchy (same pattern the sibling *_grpcAddress.docker.teranode{N} overrides already rely on), so subtreevalidation builds a gRPC validator client and killing the standalone validator container is now observable. All-in-one keeps the in-process validator. Scenario 05 is no longer a no-op.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

Current Review:

Two new network_chaos-gated split-topology scenarios plus the supporting config/script changes. The implementation is clean and the doc-comment claims check out against the code:

  • The validator coupling is accurate: blockvalidation.ProcessBlocksubtreevalidation.CheckBlockSubtreesvalidator.ValidateWithOptions (services/subtreevalidation/SubtreeValidation.go:425), and useLocalValidator does select in-process vs. gRPC validator at daemon/daemon_stores.go:194. The useLocalValidator.docker.teranode{N} = false override resolves for the docker.teranode{N}.test runtime context via the same hierarchy the sibling *_grpcAddress overrides rely on, so the kill is genuinely observable.
  • The multinode.sh docker unpause … 2>/dev/null || true change correctly makes UnpauseService idempotent, which is what scenario 06's defer + explicit unpause needs. Error suppression is bounded — assert_known_service validates the name first, so only the "not paused" error is swallowed.
  • Stack.Reset restarts exited and unpauses frozen containers, so scenario 05 leaving the validator killed on an early assertion failure is recovered by the next scenario's Reset (consistent with scenario 04).

No new issues found. Both prior [Major] threads are addressed by this revision (validator-isolation no-op via the template override; double-unpause via the script fix).

History:

  • ✅ Fixed: Scenario 05 useLocalValidator no-op — template now emits useLocalValidator.docker.teranode{N} = false for split mode.
  • ✅ Fixed: Scenario 06 double-unpause t.Fatalfmultinode.sh now suppresses the not-paused error in the service branch.

Note: the PR's end-to-end checklist item (running both tests against a live 3-node split stack) is still unchecked — the static read supports the premise, but a live run is the final confirmation.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-1069 (b6bebb0)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 132
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.597µ 1.599µ ~ 1.000
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 71.10n 71.22n ~ 1.000
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 71.13n 71.72n ~ 0.100
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 71.10n 71.51n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 32.18n 32.52n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 55.37n 55.07n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 126.9n 127.4n ~ 0.400
MiningCandidate_Stringify_Short-4 221.6n 219.2n ~ 0.100
MiningCandidate_Stringify_Long-4 1.620µ 1.636µ ~ 0.100
MiningSolution_Stringify-4 854.5n 861.0n ~ 0.100
BlockInfo_MarshalJSON-4 1.715µ 1.718µ ~ 0.800
NewFromBytes-4 123.7n 123.9n ~ 1.000
AddTxBatchColumnar_Validation-4 2.366µ 2.355µ ~ 1.000
OffsetValidationLoop-4 718.4n 718.5n ~ 0.700
Mine_EasyDifficulty-4 62.03µ 61.82µ ~ 1.000
Mine_WithAddress-4 7.400µ 7.452µ ~ 1.000
DiskTxMap_SetIfNotExists-4 3.415µ 3.449µ ~ 0.700
DiskTxMap_SetIfNotExists_Parallel-4 3.263µ 3.248µ ~ 0.400
DiskTxMap_ExistenceOnly-4 322.2n 306.7n ~ 0.200
Queue-4 188.6n 189.3n ~ 1.000
AtomicPointer-4 4.715n 4.714n ~ 1.000
TxMapSetIfNotExists-4 52.75n 53.03n ~ 0.400
TxMapSetIfNotExistsDuplicate-4 40.53n 40.03n ~ 0.400
ChannelSendReceive-4 566.8n 564.8n ~ 0.100
BlockAssembler_AddTx-4 0.02430n 0.02356n ~ 1.000
AddNode-4 10.21 11.02 ~ 0.200
AddNodeWithMap-4 10.70 10.69 ~ 1.000
DirectSubtreeAdd/4_per_subtree-4 47.40n 47.88n ~ 0.400
DirectSubtreeAdd/64_per_subtree-4 23.16n 23.17n ~ 1.000
DirectSubtreeAdd/256_per_subtree-4 21.88n 21.89n ~ 1.000
DirectSubtreeAdd/1024_per_subtree-4 20.85n 20.83n ~ 0.900
DirectSubtreeAdd/2048_per_subtree-4 20.51n 20.47n ~ 0.500
SubtreeProcessorAdd/4_per_subtree-4 290.7n 258.7n ~ 0.100
SubtreeProcessorAdd/64_per_subtree-4 277.3n 260.9n ~ 0.400
SubtreeProcessorAdd/256_per_subtree-4 264.9n 241.5n ~ 0.700
SubtreeProcessorAdd/1024_per_subtree-4 280.8n 235.0n ~ 0.100
SubtreeProcessorAdd/2048_per_subtree-4 267.5n 233.5n ~ 0.100
SubtreeProcessorRotate/4_per_subtree-4 283.4n 249.4n ~ 0.100
SubtreeProcessorRotate/64_per_subtree-4 269.1n 230.5n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 268.9n 242.7n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 270.4n 238.9n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 43.92n 43.46n ~ 0.200
SubtreeNodeAddOnly/64_per_subtree-4 28.42n 28.41n ~ 1.000
SubtreeNodeAddOnly/256_per_subtree-4 27.42n 27.55n ~ 0.100
SubtreeNodeAddOnly/1024_per_subtree-4 26.97n 26.94n ~ 1.000
SubtreeCreationOnly/4_per_subtree-4 93.59n 101.70n ~ 0.100
SubtreeCreationOnly/64_per_subtree-4 332.5n 361.7n ~ 0.100
SubtreeCreationOnly/256_per_subtree-4 1.143µ 1.090µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 3.605µ 4.084µ ~ 0.100
SubtreeCreationOnly/2048_per_subtree-4 6.602µ 6.928µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 273.8n 233.6n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 258.0n 237.0n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 7.548m 8.170m ~ 0.400
ParallelGetAndSetIfNotExists/10k_nodes-4 11.60m 11.79m ~ 0.200
ParallelGetAndSetIfNotExists/50k_nodes-4 14.03m 14.09m ~ 0.700
ParallelGetAndSetIfNotExists/100k_nodes-4 16.11m 17.39m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 7.452m 8.996m ~ 0.100
SequentialGetAndSetIfNotExists/10k_nodes-4 12.21m 14.31m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 17.73m 21.45m ~ 0.100
SequentialGetAndSetIfNotExists/100k_nodes-4 24.35m 27.28m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 8.348m 8.348m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 14.11m 13.21m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 15.51m 15.63m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 9.008m 9.619m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 13.76m 14.55m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 51.49m 50.95m ~ 0.700
CalcBlockWork-4 510.4n 511.8n ~ 0.700
CalculateWork-4 696.1n 706.6n ~ 0.100
CheckOldBlockIDs/on-chain-prefetch/1000-4 62.55µ 72.33µ ~ 0.700
CheckOldBlockIDs/off-chain-prefetch/1000-4 50.40µ 50.60µ ~ 0.700
CheckOldBlockIDs/on-chain-prefetch/10000-4 450.4µ 459.3µ ~ 0.100
CheckOldBlockIDs/off-chain-prefetch/10000-4 343.4µ 346.7µ ~ 0.700
BuildBlockLocatorString_Helpers/Size_10-4 1.334µ 1.370µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 12.78µ 13.08µ ~ 0.400
BuildBlockLocatorString_Helpers/Size_1000-4 126.2µ 130.2µ ~ 0.100
CatchupWithHeaderCache-4 103.9m 104.1m ~ 0.100
_prepareTxsPerLevel-4 406.6m 417.3m ~ 0.400
_prepareTxsPerLevelOrdered-4 3.806m 3.830m ~ 1.000
_prepareTxsPerLevel_Comparison/Original-4 419.6m 420.3m ~ 0.700
_prepareTxsPerLevel_Comparison/Optimized-4 3.774m 3.894m ~ 0.200
SubtreeSizes/10k_tx_4_per_subtree-4 1.391m 1.401m ~ 0.700
SubtreeSizes/10k_tx_16_per_subtree-4 327.9µ 325.5µ ~ 0.400
SubtreeSizes/10k_tx_64_per_subtree-4 78.03µ 78.23µ ~ 0.700
SubtreeSizes/10k_tx_256_per_subtree-4 19.42µ 19.60µ ~ 0.700
SubtreeSizes/10k_tx_512_per_subtree-4 9.601µ 9.603µ ~ 1.000
SubtreeSizes/10k_tx_1024_per_subtree-4 4.729µ 4.753µ ~ 1.000
SubtreeSizes/10k_tx_2k_per_subtree-4 2.376µ 2.365µ ~ 0.800
BlockSizeScaling/10k_tx_64_per_subtree-4 75.28µ 75.93µ ~ 0.700
BlockSizeScaling/10k_tx_256_per_subtree-4 19.25µ 19.06µ ~ 0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.777µ 4.718µ ~ 0.700
BlockSizeScaling/50k_tx_64_per_subtree-4 396.7µ 400.3µ ~ 0.700
BlockSizeScaling/50k_tx_256_per_subtree-4 95.39µ 94.67µ ~ 0.200
BlockSizeScaling/50k_tx_1024_per_subtree-4 23.49µ 23.24µ ~ 0.700
SubtreeAllocations/small_subtrees_exists_check-4 164.1µ 165.6µ ~ 1.000
SubtreeAllocations/small_subtrees_data_fetch-4 162.5µ 163.9µ ~ 0.400
SubtreeAllocations/small_subtrees_full_validation-4 327.9µ 326.9µ ~ 0.200
SubtreeAllocations/medium_subtrees_exists_check-4 9.711µ 9.558µ ~ 0.100
SubtreeAllocations/medium_subtrees_data_fetch-4 9.542µ 9.498µ ~ 0.700
SubtreeAllocations/medium_subtrees_full_validation-4 19.07µ 19.01µ ~ 0.700
SubtreeAllocations/large_subtrees_exists_check-4 2.326µ 2.312µ ~ 0.700
SubtreeAllocations/large_subtrees_data_fetch-4 2.331µ 2.325µ ~ 1.000
SubtreeAllocations/large_subtrees_full_validation-4 4.828µ 4.795µ ~ 0.700
_BufferPoolAllocation/16KB-4 3.547µ 2.851µ ~ 0.200
_BufferPoolAllocation/32KB-4 7.204µ 7.744µ ~ 0.400
_BufferPoolAllocation/64KB-4 12.71µ 12.21µ ~ 0.200
_BufferPoolAllocation/128KB-4 24.00µ 26.06µ ~ 0.100
_BufferPoolAllocation/512KB-4 93.11µ 102.55µ ~ 0.400
_BufferPoolConcurrent/32KB-4 14.56µ 17.03µ ~ 0.100
_BufferPoolConcurrent/64KB-4 22.64µ 25.37µ ~ 0.200
_BufferPoolConcurrent/512KB-4 111.3µ 111.2µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/16KB-4 500.7µ 483.4µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 502.2µ 479.7µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 499.2µ 482.2µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 496.2µ 486.8µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 480.3µ 473.3µ ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/16KB-4 28.24m 27.91m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/32KB-4 27.93m 27.93m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/64KB-4 28.49m 27.92m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/128KB-4 28.46m 27.84m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 29.65m 27.63m ~ 0.100
_PooledVsNonPooled/Pooled-4 651.5n 642.9n ~ 0.100
_PooledVsNonPooled/NonPooled-4 5.709µ 5.939µ ~ 0.700
_MemoryFootprint/Current_512KB_32concurrent-4 5.523µ 5.005µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 8.544µ 7.134µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 8.342µ 6.938µ ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 338.8µ 342.5µ ~ 1.000
StoreBlock_Sequential/AboveCSVHeight-4 337.0µ 344.5µ ~ 0.100
GetUtxoHashes-4 282.8n 290.0n ~ 0.700
GetUtxoHashes_ManyOutputs-4 45.90µ 50.24µ ~ 0.700
_NewMetaDataFromBytes-4 214.2n 213.8n ~ 1.000
_Bytes-4 391.8n 396.4n ~ 0.700
_MetaBytes-4 139.6n 139.3n ~ 0.700

Threshold: >10% with p < 0.05 | Generated: 2026-06-11 08:54 UTC

liam added a commit to liam/teranode that referenced this pull request Jun 11, 2026
…6 (subtreevalidation pause)

Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958
and the un-skip work in bsv-blockchain#995, plus the split-mode settings fix that scenario 05
turns out to depend on.

Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts
node 3 stalls at baseline (block-tx validation walks blockvalidation ->
subtreevalidation -> validator), restarts validator, asserts catch-up and
3-node convergence.

This only holds when subtreevalidation calls the standalone validator
container over gRPC, NOT when it embeds an in-process validator.
settings.conf:1212 ships useLocalValidator=true (the right default for
all-in-one), so a vanilla docker.teranode{N}.test context would build the
in-process validator and ignore the validator container entirely - making
scenario 05 a no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay
generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips
useLocalValidator=false per node so the kill is actually observable. The
override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged.

Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP)
rather than killing, so the gRPC call from blockvalidation hangs (the frozen
dependency failure mode, distinct from a process that has exited). First
scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a
failed assertion leaves the shared stack healthy for the next scenario's Reset.

Both are gated on //go:build network_chaos like the rest of the split-topology
suite.
@liam liam force-pushed the liam/multinode-split-scenario-05-06 branch from 2063eef to fcd6c00 Compare June 11, 2026 06:54
@liam

liam commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Good catch, confirmed and fixed. Walked through the code:

  • settings.conf:1212 ships useLocalValidator = true.
  • daemon/daemon_stores.go:194-243 shows that path builds the in-process validator wired up to the local utxostore + kafka producers + blockassembly client.
  • compose/cmd/gennodes/templates/settings.conf.tmpl (split-mode overlay) only overrode the gRPC addresses, not this flag.

Net effect was exactly what you described: the standalone teranodeN-validator container would be dead code in split mode, and scenario 05's stall assertion would never fire.

The fix in the amended commit (fcd6c00a3) adds a useLocalValidator.docker.teranode{N} = false per-node override inside the existing {{if not $.AllInOne}} block of the gennodes template. Verified by regenerating with both -allinone=0 (override emitted for each node) and -allinone=1 (override correctly absent, all-in-one mode unchanged). Updated scenario 05's doc-comment to reference this dependency and warn that dropping the override in future will make the test fail loudly and point right at it.

Scenario 06 left untouched per your read - Server.go:579 confirms blockvalidation always uses the gRPC subtreevalidation client, so PauseService genuinely hangs the block path there.

s.PauseService(t, 3, "subtreevalidation")
// Ensure the service is thawed even if an assertion below fails, so the
// shared stack is left healthy for the next scenario's Reset.
defer s.UnpauseService(t, 3, "subtreevalidation")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Double-unpause fails the test on the happy path.

The deferred UnpauseService (line 52) and the explicit one (line 74) both fire when all assertions pass: line 74 unpauses teranode3-subtreevalidation successfully, then on return the deferred call runs again.

The service-specific branch of chaos_unpause runs docker unpause WITHOUT error suppression (compose/multinode.sh:657), unlike the whole-node branch which swallows errors (line 665). docker unpause on an already-running container exits non-zero ("Error response from daemon: Container ... is not paused"), so MustRun (harness.go:243) calls t.Fatalf — failing a test whose body actually succeeded.

The defer for crash-safety is the right instinct; the problem is it is not idempotent with the explicit unpause. Two fixes:

  1. Make the defer a no-op once thawed — guard it with a "paused" bool that the explicit unpause flips to false.
  2. Or make UnpauseService idempotent at the script layer: suppress the not-paused error in the service branch (multinode.sh:657) the same way the whole-node branch already does. This also benefits any future scenario using this defer idiom.

(This is the first scenario to use the pause/unpause verbs, so the non-idempotent service path has not been exercised before.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Addressed in this revision via fix option #2: compose/multinode.sh now runs docker unpause ... 2>/dev/null || true in the service-specific branch, making UnpauseService idempotent. The deferred + explicit unpause on the happy path no longer fails the test.

@liam liam requested a review from ordishs June 11, 2026 08:00
liam added a commit to liam/teranode that referenced this pull request Jun 11, 2026
…eeze), 09 (asset isolation control)

Three new split-topology chaos scenarios that broaden coverage beyond the
validation pipeline (04/05/06) to the relay control plane and the off-path
services, plus an idempotency fix to chaos_unpause that scenario 08 turns
out to depend on (and scenario 06 in bsv-blockchain#1069 transparently benefits from).

Scenario 07 (KillService, expected STALL): kills teranode3-p2p and asserts
node 3 stays at baseline because the inter-node block-relay control plane
in the test stack is routed exclusively through teranodeN-p2p. Legacy lives
in core but is configured for SV bridging, not inter-teranode relay, so
killing native p2p genuinely severs node 3 from the mesh. Restart, catch-up,
3-node convergence asserted as usual.

Scenario 08 (PauseService, asymmetric expected NO STALL): the suite's first
asymmetric scenario. Pauses teranode2-propagation and asserts node 2 still
catches up via p2p + blockvalidation, because propagation is the tx-ingress
path (RPC sendrawtransaction -> propagation -> validator -> blockassembly),
not the block-relay path. Inbound peer blocks reach node 2 via p2p ->
blockvalidation -> subtreevalidation; they never touch propagation. Uses
PauseService (SIGSTOP) rather than KillService so any accidental gRPC call
from a block-path site would hang the caller (frozen-dependency failure
mode), making the coupling visible. Has a defer UnpauseService so a failed
assertion still leaves the shared stack healthy for the next scenario.

Scenario 09 (KillService on asset, control / null hypothesis): kills
teranode3-asset and asserts node 3 still catches up. Asset is the HTTP
server that exposes a node's subtree/block data outbound to peers and
external clients; it is not on the receive-side consensus path. Without
a control like this, the suite cannot distinguish "validator/p2p/etc kills
stalled node 3 because they're on the consensus path" from "any KillService
stalls node 3 because the harness or compose graph is broken." This
scenario rules out the latter. If it surprisingly fails - node 3 does
stall after asset is killed - that documents a real coupling
(subtreevalidation reaching localhost asset rather than the originating
peer's asset) worth chasing.

Bash fix: chaos_unpause is now idempotent across all three branches. The
single-service and bulk-aio branches were previously running a bare
'docker unpause', which exits non-zero on an already-running container.
Scenarios 06 and 08 both run a defensive defer-unpause alongside an
explicit one on the happy path (defer is the safety net for assertion
failure; explicit is needed for the catch-up assertion in scenario 06),
and the bare docker call was turning the second one into t.Fatalf, failing
the test even when all assertions passed. The bulk-split branch already
had the '|| true' pattern; this extends it for consistency. Raised in PR
review on bsv-blockchain#1070. Idempotent semantics are what 'chaos cleanup' actually
wants anyway.

All three scenarios are gated on //go:build network_chaos like the rest of
the suite.

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Request changes — one blocking bug in scenario 06

Nice scenarios, and the useLocalValidator override is genuinely load-bearing (verified: daemon_stores.go:194-249 builds an in-process validator when useLocalValidator=true, and subtreevalidation shares startValidationService, so without the flip scenario 05 would be a silent no-op). The call chain BlockValidation.go:2027 -> CheckBlockSubtrees -> ValidateWithOptions checks out too.

🔴 Blocking: scenario 06 fails on the happy path (non-idempotent unpause)

UnpauseService is called twice on a passing run — explicitly (line 74) and via defer (line 52) — but docker unpause is not idempotent, and the per-service path does not guard it:

# compose/multinode.sh:653-659  (per-service path)
    docker unpause "$ctr"        # <- no `|| true`

The whole-node path right below it (line 665) does tolerate this: docker unpause "$ctr" 2>/dev/null || true.

Sequence on a green run: line 74 thaws the container → test body returns → the deferred call runs docker unpause on an already-running container → Docker exits non-zero (Container ... is not paused) → MustRun calls t.Fatalf (harness.go:243). The test fails even though every assertion passed.

This path has never been exercised: scenario 06 is the first pause/unpause user, the e2e checkbox is unchecked, and StartService (scenarios 04/05) is idempotent via compose up -d, so there was no prior coverage.

Recommended fix (cleanest, benefits all callers, matches StartService's documented idempotency): make the per-service unpause tolerant, mirroring the whole-node path:

    docker unpause "$ctr" 2>/dev/null || true

Then the defer-plus-explicit safety idiom works as intended.

🟡 Minor (non-blocking)

  • Brittle line reference: scenario 05's comment cites settings.conf:1212 by absolute line number — accurate now, but it will silently drift. Reference the key name (useLocalValidator) instead.
  • The fixed 15s settle sleeps carry inherent flakiness, but they match the deliberate, empirically-tuned choice in scenario_04 — fine as-is.

Please also close the e2e checkbox (go test -tags network_chaos -run TestSubtreeValidationIsolation) before merge — this bug surfaces on the first real run.

…6 (subtreevalidation pause)

Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958
and the un-skip work in bsv-blockchain#995, plus the split-mode useLocalValidator override
that scenario 05 depends on and a chaos_unpause idempotency fix that
scenario 06 depends on.

Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts
node 3 stalls at baseline (block-tx validation walks blockvalidation ->
subtreevalidation -> validator), restarts validator, asserts catch-up and
3-node convergence.

This only holds when subtreevalidation calls the standalone validator
container over gRPC, NOT when it embeds an in-process validator.
settings.conf ships useLocalValidator=true (the right default for all-in-one),
so a vanilla docker.teranode{N}.test context would build the in-process
validator and ignore the validator container entirely - making scenario 05
a silent no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay
generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips
useLocalValidator=false per node so the kill is actually observable. The
override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged.

Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP)
rather than killing, so the gRPC call from blockvalidation hangs (the frozen
dependency failure mode, distinct from a process that has exited). First
scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a
failed assertion leaves the shared stack healthy for the next scenario's Reset.

Bash fix: chaos_unpause is now idempotent across all three branches. The
single-service and bulk-aio branches were previously running a bare
'docker unpause', which exits non-zero on an already-running container.
Scenario 06 runs a defensive defer-unpause alongside an explicit one (defer
is the safety net for assertion failure; explicit unblocks the catch-up
assertion), and the bare docker call turned the second one into t.Fatalf,
failing the test on a green run. The bulk-split branch already had the
'|| true' pattern; this extends it for consistency. Same fix surfaced
independently in PR review on bsv-blockchain#1070 against scenario 08. Idempotent
semantics are what 'chaos cleanup' actually wants anyway.

Both scenarios are gated on //go:build network_chaos like the rest of the
split-topology suite.
@liam liam force-pushed the liam/multinode-split-scenario-05-06 branch from fcd6c00 to de64bf5 Compare June 11, 2026 08:39
@liam

liam commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks, all three addressed in de64bf570:

🔴 chaos_unpause idempotency — fixed in compose/multinode.sh, extending the existing bulk-split 2>/dev/null || true pattern to the single-service and bulk-aio branches. Reproduced the docker behaviour first (docker unpause <already-running-container> exits 1 with Container ... is not paused), so MustRun would indeed turn the defer into t.Fatalf on a green run. Same fix was raised independently in PR review on #1070 against scenario 08; the diff there matches this one bit-for-bit, so whichever PR lands first the other rebases cleanly to drop the duplicate.

🟡 Brittle line reference — scenario 05's comment now reads "settings.conf ships useLocalValidator=true" with no line number. The split-mode override path (compose/cmd/gennodes/templates/settings.conf.tmpl) is still named since it's stable. Good catch, line numbers in docstrings are a slow-motion regression.

🟡 15s settle sleep — left as-is, matching scenario 04's deliberate choice.

E2E checkbox — honest answer: I have not run the suite end-to-end yet. Local blockers are (a) Aerospike testcontainer rejects the host's default ulimit -n 1024 (it wants 15000), and (b) Isolate/Heal need passwordless sudo. Both surmountable but I don't want to claim the box is ticked without actually having run go test -tags network_chaos -run TestSubtreeValidationIsolation against a real 3-node split stack. Happy to either set up the local infra and report back here, or leave the box for CI to tick once the network-chaos workflow runs against this branch. Let me know which you'd prefer.

@liam liam requested a review from sugh01 June 11, 2026 08:39
@sonarqubecloud

Copy link
Copy Markdown

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review — blocking issue resolved ✅

The non-idempotent unpause is fixed in de64bf57. chaos_unpause's per-service path now reads:

docker unpause "$ctr" 2>/dev/null || true

and the same guard was applied to the all-in-one path for consistency (the bulk-split path already had it), with a comment explaining exactly why — the defensive defer-plus-explicit unpause pattern in scenario 06 is now safe on the happy path. This is the cleanest of the options I suggested: idempotent semantics are what chaos cleanup actually wants, and it benefits every caller rather than only this test.

Scenario 06 keeps both the defer (line 52) and explicit (line 74) UnpauseService calls, which is correct now that unpause is idempotent — the deferred call is the safety net for a failed assertion.

The substantive premises I verified earlier still hold (the useLocalValidator override is load-bearing, and the ProcessBlock -> CheckBlockSubtrees -> ValidateWithOptions coupling is real), and nothing in this revision touches them.

🟡 Remaining non-blocking nit

  • Scenario 05 still cites settings.conf:1212 by absolute line number — fine to leave, but it'll drift over time; referencing the key name would age better.

Approving. Worth running go test -tags network_chaos -run TestSubtreeValidationIsolation against a real 3-node stack to close the e2e checkbox before merge, but the code-level blocker is cleared.

@liam liam merged commit 71e00be into bsv-blockchain:main Jun 11, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants