Skip to content

test(multinode): add split-topology scenarios 07 (p2p), 08 (propagation freeze), 09 (asset control)#1070

Merged
liam merged 1 commit into
bsv-blockchain:mainfrom
liam:liam/multinode-split-scenarios-07-09
Jun 11, 2026
Merged

test(multinode): add split-topology scenarios 07 (p2p), 08 (propagation freeze), 09 (asset control)#1070
liam merged 1 commit into
bsv-blockchain:mainfrom
liam:liam/multinode-split-scenarios-07-09

Conversation

@liam

@liam liam commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Three new split-topology chaos scenarios that broaden coverage beyond the validation pipeline tests (04/05/06) to the relay control plane and the off-path services. All gated on //go:build network_chaos.

Scenario 07 — P2P isolation (KillService, expected stall)

  • Kills teranode3-p2p. The multinode docker stack relays blocks between teranodes only via the native p2p service, so killing teranodeN-p2p genuinely severs node N from the mesh. (Legacy lives in core but is configured for SV bridging, not inter-teranode relay.)
  • Mine 5 on node 1, assert node 3 stalls at baseline, restart, assert catch-up and 3-node convergence.

Scenario 08 — Propagation freeze (PauseService, asymmetric expected NO stall)

  • The suite's first asymmetric scenario. Pauses teranode2-propagation and asserts node 2 still catches up via p2p + blockvalidation, because propagation is the tx-ingress path (RPC sendrawtransaction -> propagation -> validator -> blockassembly), not the block-relay path.
  • Uses PauseService (SIGSTOP) rather than KillService so any accidental gRPC call from a block-path site would hang the caller (frozen-dependency failure mode), making a hidden coupling visible rather than letting it surface as connection-refused.
  • defer s.UnpauseService so a failed assertion still leaves the shared stack healthy for the next scenario.

Scenario 09 — Asset isolation (KillService, control / null hypothesis)

  • Kills teranode3-asset and asserts node 3 still catches up. Asset is the outbound HTTP server that serves a node's subtree/block data to peers and external clients; it is not on the receive-side consensus path.
  • Without a control like this, the suite cannot distinguish "validator/p2p/etc kills stalled node 3 because they're on the consensus path" from "any KillService stalls node 3 because the harness or compose graph is broken." This scenario rules out the latter.
  • If it surprisingly fails (node 3 does stall after asset is killed), that documents a real coupling worth chasing, namely subtreevalidation reaching localhost asset rather than the originating peer's asset.

Test plan

  • go build -tags network_chaos ./test/multinode_split/... clean
  • CI / a local 3-node split stack to run go test -tags network_chaos -run 'TestP2PIsolation|TestPropagationFreeze|TestAssetIsolation' end-to-end

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

Three new network_chaos-gated split-topology scenarios (07 p2p kill, 08 propagation freeze, 09 asset control). The scenarios are well-constructed: they mirror scenario 04's shape, use real harness APIs (KillService / PauseService / UnpauseService / StartService all exist with matching signatures), and the targeted service names (p2p, propagation, asset) are all in KNOWN_SPLIT_SERVICES. The asymmetric (08) and control/null-hypothesis (09) framing is a genuinely useful addition, and the doc comments are accurate against the harness and multinode.sh.

Current Review:

  • No outstanding issues. The previously-flagged double-unpause failure is now fixed: chaos_unpause in multinode.sh runs docker unpause … 2>/dev/null || true on the single-service branch (matching the bulk-split branch), so scenario 08's explicit + defer unpause is idempotent and no longer trips MustRunt.Fatalf. The accompanying comment documents the rationale clearly.

History:

  • ✅ Fixed: [Major] Scenario 08 double-unpause failing the test on its own happy path — resolved by making chaos_unpause idempotent.

Note: the PR's own test plan still has the end-to-end run against a live split stack unchecked. The fix is sound by inspection, but a real 3-node run is the final confirmation.

t.Logf("teranode2 reached height=%d with propagation still paused", caughtUp.Blocks)

t.Log("unpausing teranode2-propagation...")
s.UnpauseService(t, 2, "propagation")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Double-unpause fails the test on its own happy path.

This explicit s.UnpauseService(t, 2, "propagation") plus the defer s.UnpauseService(t, 2, "propagation") at line 60 means unpause runs twice on success. After this call the container is already running, so the deferred call shells out to docker unpause on a non-paused container.

The single-service path in compose/multinode.sh (chaos_unpause, line 657) runs docker unpause "$ctr" without the 2>/dev/null || true guard that the bulk path (line 665) has. docker unpause on a running container exits non-zero (Error response from daemon: ... is not paused), which MustRun turns into t.Fatalf. So the deferred cleanup fails the test even when every assertion has passed.

Fix options: drop this explicit unpause and let the defer be the single source of truth (the convergence check only compares block heights, which does not need propagation running); or make chaos_unpause's single-service branch idempotent like the bulk path: docker unpause "$ctr" 2>/dev/null || true. The first keeps the change inside this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved by the multinode.sh change in this push: chaos_unpause now runs docker unpause ... 2>/dev/null || true on the single-service branch (matching the bulk-split branch), so the explicit + defer double-unpause on the happy path is idempotent and no longer trips MustRun -> t.Fatalf. Marking resolved.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-1070 (0f32fec)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 132
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.802µ 1.611µ ~ 0.100
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 71.44n 71.39n ~ 1.000
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 71.30n 71.72n ~ 0.200
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 71.44n 71.16n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 33.02n 32.65n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 55.22n 55.51n ~ 1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 129.0n 129.4n ~ 0.600
MiningCandidate_Stringify_Short-4 222.9n 218.8n ~ 0.100
MiningCandidate_Stringify_Long-4 1.640µ 1.621µ ~ 0.400
MiningSolution_Stringify-4 856.7n 846.8n ~ 0.100
BlockInfo_MarshalJSON-4 1.722µ 1.715µ ~ 0.400
NewFromBytes-4 131.1n 132.2n ~ 0.800
AddTxBatchColumnar_Validation-4 2.691µ 2.542µ ~ 0.400
OffsetValidationLoop-4 595.2n 596.5n ~ 0.700
Mine_EasyDifficulty-4 67.22µ 67.09µ ~ 0.700
Mine_WithAddress-4 7.078µ 7.050µ ~ 0.400
BlockAssembler_AddTx-4 0.02328n 0.02189n ~ 0.700
AddNode-4 10.97 11.01 ~ 1.000
AddNodeWithMap-4 11.83 11.25 ~ 0.100
DirectSubtreeAdd/4_per_subtree-4 58.94n 60.08n ~ 1.000
DirectSubtreeAdd/64_per_subtree-4 32.26n 32.60n ~ 0.400
DirectSubtreeAdd/256_per_subtree-4 30.62n 30.62n ~ 1.000
DirectSubtreeAdd/1024_per_subtree-4 29.34n 29.35n ~ 1.000
DirectSubtreeAdd/2048_per_subtree-4 29.04n 28.92n ~ 0.300
SubtreeProcessorAdd/4_per_subtree-4 297.2n 328.6n ~ 0.100
SubtreeProcessorAdd/64_per_subtree-4 318.7n 281.5n ~ 0.100
SubtreeProcessorAdd/256_per_subtree-4 331.6n 299.9n ~ 0.700
SubtreeProcessorAdd/1024_per_subtree-4 293.1n 335.2n ~ 0.400
SubtreeProcessorAdd/2048_per_subtree-4 319.4n 273.9n ~ 0.200
SubtreeProcessorRotate/4_per_subtree-4 295.7n 293.7n ~ 1.000
SubtreeProcessorRotate/64_per_subtree-4 273.5n 292.4n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 274.0n 313.8n ~ 0.200
SubtreeProcessorRotate/1024_per_subtree-4 290.3n 296.9n ~ 0.700
SubtreeNodeAddOnly/4_per_subtree-4 53.91n 54.10n ~ 0.400
SubtreeNodeAddOnly/64_per_subtree-4 34.26n 34.29n ~ 1.000
SubtreeNodeAddOnly/256_per_subtree-4 33.29n 33.17n ~ 0.100
SubtreeNodeAddOnly/1024_per_subtree-4 32.60n 32.56n ~ 0.700
SubtreeCreationOnly/4_per_subtree-4 101.9n 122.2n ~ 0.100
SubtreeCreationOnly/64_per_subtree-4 494.5n 542.1n ~ 0.700
SubtreeCreationOnly/256_per_subtree-4 1.443µ 1.411µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 4.427µ 4.516µ ~ 0.100
SubtreeCreationOnly/2048_per_subtree-4 8.393µ 9.841µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 359.2n 288.6n ~ 0.700
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 309.3n 535.0n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 11.33m 14.21m ~ 0.100
ParallelGetAndSetIfNotExists/10k_nodes-4 14.74m 17.81m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 17.68m 23.35m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 18.35m 18.45m ~ 0.400
SequentialGetAndSetIfNotExists/1k_nodes-4 11.55m 12.25m ~ 0.100
SequentialGetAndSetIfNotExists/10k_nodes-4 15.16m 13.78m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 26.26m 24.70m ~ 0.100
SequentialGetAndSetIfNotExists/100k_nodes-4 36.50m 34.25m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 12.82m 14.69m ~ 0.200
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 14.36m 14.34m ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 19.55m 19.25m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 16.76m 28.43m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 16.08m 15.56m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 75.70m 51.01m ~ 0.100
DiskTxMap_SetIfNotExists-4 4.309µ 4.473µ ~ 1.000
DiskTxMap_SetIfNotExists_Parallel-4 3.957µ 3.936µ ~ 0.400
DiskTxMap_ExistenceOnly-4 404.8n 396.7n ~ 1.000
Queue-4 211.1n 209.8n ~ 0.100
AtomicPointer-4 8.147n 8.149n ~ 0.700
TxMapSetIfNotExists-4 54.73n 54.91n ~ 0.700
TxMapSetIfNotExistsDuplicate-4 48.42n 48.25n ~ 0.100
ChannelSendReceive-4 669.8n 658.0n ~ 0.100
CalcBlockWork-4 505.3n 555.2n ~ 0.200
CalculateWork-4 694.4n 693.5n ~ 0.700
CheckOldBlockIDs/on-chain-prefetch/1000-4 63.53µ 71.72µ ~ 0.700
CheckOldBlockIDs/off-chain-prefetch/1000-4 51.57µ 50.70µ ~ 0.100
CheckOldBlockIDs/on-chain-prefetch/10000-4 464.4µ 456.4µ ~ 0.200
CheckOldBlockIDs/off-chain-prefetch/10000-4 349.6µ 349.9µ ~ 0.700
BuildBlockLocatorString_Helpers/Size_10-4 1.368µ 1.334µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 13.13µ 12.80µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 128.4µ 126.5µ ~ 0.100
CatchupWithHeaderCache-4 104.5m 104.5m ~ 1.000
_BufferPoolAllocation/16KB-4 5.043µ 4.014µ ~ 0.400
_BufferPoolAllocation/32KB-4 9.207µ 9.742µ ~ 1.000
_BufferPoolAllocation/64KB-4 17.77µ 17.83µ ~ 1.000
_BufferPoolAllocation/128KB-4 37.99µ 32.29µ ~ 0.100
_BufferPoolAllocation/512KB-4 142.8µ 120.5µ ~ 0.100
_BufferPoolConcurrent/32KB-4 24.03µ 20.52µ ~ 0.100
_BufferPoolConcurrent/64KB-4 35.70µ 32.74µ ~ 0.400
_BufferPoolConcurrent/512KB-4 152.2µ 153.0µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/16KB-4 704.7µ 660.4µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 633.2µ 657.0µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/64KB-4 627.3µ 663.1µ ~ 0.200
_SubtreeDeserializationWithBufferSizes/128KB-4 643.0µ 673.9µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 622.6µ 622.9µ ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/16KB-4 36.64m 36.82m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/32KB-4 36.54m 36.72m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/64KB-4 36.57m 36.95m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/128KB-4 36.59m 36.69m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/512KB-4 36.25m 36.27m ~ 1.000
_PooledVsNonPooled/Pooled-4 738.9n 740.5n ~ 1.000
_PooledVsNonPooled/NonPooled-4 8.536µ 8.287µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 6.720µ 6.836µ ~ 0.700
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.627µ 9.747µ ~ 0.200
_MemoryFootprint/Alternative_64KB_32concurrent-4 9.150µ 9.111µ ~ 1.000
_prepareTxsPerLevel-4 424.7m 424.7m ~ 1.000
_prepareTxsPerLevelOrdered-4 5.159m 4.669m ~ 0.400
_prepareTxsPerLevel_Comparison/Original-4 426.1m 417.5m ~ 0.100
_prepareTxsPerLevel_Comparison/Optimized-4 5.407m 4.283m ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.451m 1.446m ~ 1.000
SubtreeSizes/10k_tx_16_per_subtree-4 328.0µ 329.9µ ~ 1.000
SubtreeSizes/10k_tx_64_per_subtree-4 80.31µ 80.21µ ~ 0.700
SubtreeSizes/10k_tx_256_per_subtree-4 19.94µ 20.01µ ~ 0.700
SubtreeSizes/10k_tx_512_per_subtree-4 9.872µ 9.890µ ~ 1.000
SubtreeSizes/10k_tx_1024_per_subtree-4 4.881µ 4.951µ ~ 0.200
SubtreeSizes/10k_tx_2k_per_subtree-4 2.448µ 2.466µ ~ 0.200
BlockSizeScaling/10k_tx_64_per_subtree-4 77.66µ 78.68µ ~ 0.200
BlockSizeScaling/10k_tx_256_per_subtree-4 19.88µ 19.67µ ~ 0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.945µ 4.862µ ~ 0.200
BlockSizeScaling/50k_tx_64_per_subtree-4 394.3µ 397.2µ ~ 1.000
BlockSizeScaling/50k_tx_256_per_subtree-4 97.16µ 97.03µ ~ 1.000
BlockSizeScaling/50k_tx_1024_per_subtree-4 24.54µ 24.79µ ~ 0.400
SubtreeAllocations/small_subtrees_exists_check-4 159.4µ 159.1µ ~ 1.000
SubtreeAllocations/small_subtrees_data_fetch-4 166.8µ 166.9µ ~ 1.000
SubtreeAllocations/small_subtrees_full_validation-4 326.2µ 333.9µ ~ 0.700
SubtreeAllocations/medium_subtrees_exists_check-4 9.769µ 9.737µ ~ 1.000
SubtreeAllocations/medium_subtrees_data_fetch-4 10.33µ 10.37µ ~ 0.700
SubtreeAllocations/medium_subtrees_full_validation-4 20.23µ 20.05µ ~ 0.200
SubtreeAllocations/large_subtrees_exists_check-4 2.384µ 2.364µ ~ 0.700
SubtreeAllocations/large_subtrees_data_fetch-4 2.581µ 2.565µ ~ 0.400
SubtreeAllocations/large_subtrees_full_validation-4 5.024µ 5.069µ ~ 0.200
StoreBlock_Sequential/BelowCSVHeight-4 317.5µ 328.9µ ~ 0.400
StoreBlock_Sequential/AboveCSVHeight-4 320.9µ 319.0µ ~ 1.000
GetUtxoHashes-4 255.8n 257.6n ~ 0.400
GetUtxoHashes_ManyOutputs-4 49.61µ 49.41µ ~ 0.700
_NewMetaDataFromBytes-4 233.5n 234.1n ~ 0.400
_Bytes-4 419.0n 417.6n ~ 0.100
_MetaBytes-4 138.3n 138.3n ~ 1.000

Threshold: >10% with p < 0.05 | Generated: 2026-06-11 08:20 UTC

…eeze), 09 (asset isolation control)

Three new split-topology chaos scenarios that broaden coverage beyond the
validation pipeline (04/05/06) to the relay control plane and the off-path
services, plus an idempotency fix to chaos_unpause that scenario 08 turns
out to depend on (and scenario 06 in bsv-blockchain#1069 transparently benefits from).

Scenario 07 (KillService, expected STALL): kills teranode3-p2p and asserts
node 3 stays at baseline because the inter-node block-relay control plane
in the test stack is routed exclusively through teranodeN-p2p. Legacy lives
in core but is configured for SV bridging, not inter-teranode relay, so
killing native p2p genuinely severs node 3 from the mesh. Restart, catch-up,
3-node convergence asserted as usual.

Scenario 08 (PauseService, asymmetric expected NO STALL): the suite's first
asymmetric scenario. Pauses teranode2-propagation and asserts node 2 still
catches up via p2p + blockvalidation, because propagation is the tx-ingress
path (RPC sendrawtransaction -> propagation -> validator -> blockassembly),
not the block-relay path. Inbound peer blocks reach node 2 via p2p ->
blockvalidation -> subtreevalidation; they never touch propagation. Uses
PauseService (SIGSTOP) rather than KillService so any accidental gRPC call
from a block-path site would hang the caller (frozen-dependency failure
mode), making the coupling visible. Has a defer UnpauseService so a failed
assertion still leaves the shared stack healthy for the next scenario.

Scenario 09 (KillService on asset, control / null hypothesis): kills
teranode3-asset and asserts node 3 still catches up. Asset is the HTTP
server that exposes a node's subtree/block data outbound to peers and
external clients; it is not on the receive-side consensus path. Without
a control like this, the suite cannot distinguish "validator/p2p/etc kills
stalled node 3 because they're on the consensus path" from "any KillService
stalls node 3 because the harness or compose graph is broken." This
scenario rules out the latter. If it surprisingly fails - node 3 does
stall after asset is killed - that documents a real coupling
(subtreevalidation reaching localhost asset rather than the originating
peer's asset) worth chasing.

Bash fix: chaos_unpause is now idempotent across all three branches. The
single-service and bulk-aio branches were previously running a bare
'docker unpause', which exits non-zero on an already-running container.
Scenarios 06 and 08 both run a defensive defer-unpause alongside an
explicit one on the happy path (defer is the safety net for assertion
failure; explicit is needed for the catch-up assertion in scenario 06),
and the bare docker call was turning the second one into t.Fatalf, failing
the test even when all assertions passed. The bulk-split branch already
had the '|| true' pattern; this extends it for consistency. Raised in PR
review on bsv-blockchain#1070. Idempotent semantics are what 'chaos cleanup' actually
wants anyway.

All three scenarios are gated on //go:build network_chaos like the rest of
the suite.
@liam liam force-pushed the liam/multinode-split-scenarios-07-09 branch from f5807a8 to 21ec215 Compare June 11, 2026 08:04
@liam

liam commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Confirmed. Walked it through:

  • compose/multinode.sh:657 (single-service unpause) and :674 (bulk-aio unpause) were both bare docker unpause "$ctr". Only the bulk-split branch at :665 had the 2>/dev/null || true idempotency guard.
  • Reproduced the docker behaviour: docker unpause <already-running-container> exits 1 with Error response from daemon: Container ... is not paused.
  • So on scenario 08's happy path: explicit unpause succeeds, WaitForConverged passes, then defer s.UnpauseService fires, the single-service docker unpause exits 1, MustRun turns it into t.Fatalf. Test fails despite all assertions passing - exactly as you described.

Fix in 21ec215a4 makes chaos_unpause idempotent across all three branches (extending the existing bulk-split pattern). Idempotent is what "chaos cleanup" actually wants anyway - a defer unpause that happens to also be called explicitly should not be a failure mode.

Note: scenario 06 in PR #1069 has the same defer + explicit pattern (and is required to keep it - the explicit unpause unblocks the catch-up assertion), so it benefits transparently from this fix. PR #1069 was already amended for the separate useLocalValidator finding; not amending it again for the chaos_unpause fix to avoid duplicate diffs across the two PRs. As long as both land, both scenarios are correct.

@sonarqubecloud

Copy link
Copy Markdown

@liam liam requested review from ordishs and sugh01 June 11, 2026 08:36
liam added a commit to liam/teranode that referenced this pull request Jun 11, 2026
…6 (subtreevalidation pause)

Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958
and the un-skip work in bsv-blockchain#995, plus the split-mode useLocalValidator override
that scenario 05 depends on and a chaos_unpause idempotency fix that
scenario 06 depends on.

Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts
node 3 stalls at baseline (block-tx validation walks blockvalidation ->
subtreevalidation -> validator), restarts validator, asserts catch-up and
3-node convergence.

This only holds when subtreevalidation calls the standalone validator
container over gRPC, NOT when it embeds an in-process validator.
settings.conf ships useLocalValidator=true (the right default for all-in-one),
so a vanilla docker.teranode{N}.test context would build the in-process
validator and ignore the validator container entirely - making scenario 05
a silent no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay
generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips
useLocalValidator=false per node so the kill is actually observable. The
override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged.

Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP)
rather than killing, so the gRPC call from blockvalidation hangs (the frozen
dependency failure mode, distinct from a process that has exited). First
scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a
failed assertion leaves the shared stack healthy for the next scenario's Reset.

Bash fix: chaos_unpause is now idempotent across all three branches. The
single-service and bulk-aio branches were previously running a bare
'docker unpause', which exits non-zero on an already-running container.
Scenario 06 runs a defensive defer-unpause alongside an explicit one (defer
is the safety net for assertion failure; explicit unblocks the catch-up
assertion), and the bare docker call turned the second one into t.Fatalf,
failing the test on a green run. The bulk-split branch already had the
'|| true' pattern; this extends it for consistency. Same fix surfaced
independently in PR review on bsv-blockchain#1070 against scenario 08. Idempotent
semantics are what 'chaos cleanup' actually wants anyway.

Both scenarios are gated on //go:build network_chaos like the rest of the
split-topology suite.

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve with minor follow-ups.

Clean, idiomatic, low-risk additive test code. Conventions match scenario 04 precisely (build tag, package, harness API, ctx pattern, short() reuse). The multinode.sh fix is correct: scenario 08 intentionally calls UnpauseService both explicitly and via defer, and a bare docker unpause on an already-running container exits non-zero — which MustRun would turn into a spurious t.Fatalf. Suppressing it makes cleanup idempotent and brings all three chaos_unpause branches in line. The PauseService (SIGSTOP) vs KillService choice in 08 is well-reasoned: a frozen dependency hangs the caller rather than returning connection-refused, so a hidden block-path coupling surfaces as a real stall. Scenario 09 as an explicit null-hypothesis control is genuinely valuable.

Verified locally:

  • go vet -tags network_chaos ./test/multinode_split/... — clean (exit 0)

Follow-ups (none blocking the code itself):

  1. E2E run still pending. Only the build is confirmed; the asymmetric assertions (07 stalls; 08/09 catch up) encode hypotheses about the dependency graph that a compile can't validate. Recommend gating merge on an actual go test -tags network_chaos -run 'TestP2PIsolation|TestPropagationFreeze|TestAssetIsolation' against a live 3-node split stack (CI or local) — that's the whole value of these scenarios.

  2. Numbering / merge order. Scenarios 05 and 06 aren't in main or this PR (they're on a separate branch), so the body's "validation pipeline tests (04/05/06)" reference and the "first asymmetric scenario" framing in 08 assume a merge order. Against main the files jump 04 → 07/08/09. Worth aligning the merge sequence or noting the dependency.

  3. Nit: 07/08/09 depend on short() defined in scenario 04 — fine today (same package/tag), just a silent compile prerequisite to be aware of.

@liam liam merged commit 38dab9c into bsv-blockchain:main Jun 11, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants