test(multinode): split-per-service chaos harness + scenario_04 (skipped pending teranode fixes) by liam · Pull Request #958 · bsv-blockchain/teranode

liam · 2026-05-27T14:24:53Z

Summary

Extends the existing all-in-one network-chaos harness (test/multinode/) with a sibling split-per-service variant (test/multinode_split/) and ships the first scenario that targets it.

The harness work + ulimits fix is independent infrastructure; the actual scenario is t.Skip-ed pending two teranode robustness issues that surfaced while developing it (details below).

What's in this PR

fix(compose): bump aerospike nofile to 65536 in both templates (ea8f1eb41)

aerospike.conf.tmpl requests proto-fd-max = 15000 but the aerospike compose service definition inherits the docker daemon's 1024 nofile default, so aerospike aborts at startup with CRITICAL: 1024 system file descriptors not enough on any host whose /etc/docker/daemon.json doesn't override default-ulimits. Adds an explicit ulimits.nofile: 65536 in both topologies so the generated stack starts cleanly out of the box. This silently fixes the existing all-in-one make network-chaos-test for affected hosts too.

test(multinode): add split-per-service chaos suite + scenario_04 (ab52826ec)

Harness extensions in test/multinode/harness/:

Stack.splitMode flag; new ProvisionSplit() constructor passes -allinone=0 to multinode.sh up.
New KillService / StartService / PauseService / UnpauseService methods delegate to multinode.sh chaos <verb> <n> <svc> and refuse to run unless the stack is in split mode.
Container helpers (Reset, waitNodeReady, dumpDiagnostics) refactored to enumerate every per-service container for a node via nodeContainers(n) rather than assuming the monolithic teranodeN-multinode name.

New package test/multinode_split/:

TestMain provisions a 3-node split stack (~32 containers, the smallest mesh that gives one isolation target plus two survivors).
scenario_04_block_assembly_isolation documents the blockvalidation → blockassembly runtime dependency: blockvalidation's WaitForBlockAssemblyReady gate fires on every inbound block, so killing blockassembly stalls validation even though the blockvalidation container is healthy. Visibility into that coupling is the point of split-mode chaos and is unreachable from the all-in-one suite.

New make network-chaos-test-split target (separate from network-chaos-test because the two stacks cannot coexist and split bring-up is materially slower).

test(multinode): split-aware diagnostics + IPv4 RPC; skip scenario_04 pending teranode fixes (8caa1e71f)

rpc.go: pin BaseURL to 127.0.0.1 instead of localhost so the harness's polling loops don't trip on docker's IPv4-only proxy (::1 ECONNREFUSED races).
wait.go: dumpNodeLogs was hardcoded to the monolith container name and silently failed under split mode; now enumerates via docker ps.
scenario_04 flagged t.Skip() with a comment block pinning the two teranode bugs that block reliable execution (see below).

Skipped scenario: the teranode bugs blocking it

Surfaced by running the scenario; both are out of scope for this PR but documented in code so they're not lost.

utxopersister.CreateUTXOSet nil-pointer panic on startup (services/utxopersister/UTXOSet.go:527). The trigger says "Processing block height 1" but processNextBlock then logs "Processing block height 0" and CreateUTXOSet SIGSEGVs. Brings down the core sidecar before tests can run, so TestMain's mesh probe fails with heights=map[N:-1]. Non-deterministic but triggers often enough to make the scenario unrunnable.
legacy peer-protocol "unknown magic" crash on the receiving node when a peer broadcasts a block produced by a freshly-restarted blockassembly. ServiceManager treats it as fatal and gracefully exits the entire core sidecar, manifesting as RPC connection refused on healthy-looking peers during convergence.

Once both are fixed, deleting one t.Skip(...) line re-enables the scenario.

Test plan

go build -tags network_chaos ./test/... clean
go vet -tags network_chaos ./test/... clean
go test ./compose/cmd/gennodes/ still passes (template change)
compose/multinode.sh up 3 -allinone=0 brings up a healthy 3-node split stack with the new ulimits
go test -tags network_chaos -run TestBlockAssemblyIsolation ./test/multinode_split/ skips cleanly (1m mesh setup + immediate skip)
For reviewers: existing make network-chaos-test (all-in-one) still passes locally — please confirm in your environment
For follow-up: verify scenario_04 passes after the two teranode bugs land

aerospike.conf.tmpl requests proto-fd-max=15000 but the aerospike service definition inherited the docker daemon's 1024 nofile default, so aerospike aborted at startup with: CRITICAL (config): 1024 system file descriptors not enough, config specified 15000 This affected both topologies; the all-in-one network-chaos suite was silently broken on any host whose /etc/docker/daemon.json doesn't set default-ulimits. Set ulimits.nofile on the aerospike service so the generated compose ships a working stack regardless of host config.

Extend the harness with split-mode awareness so tests can target individual service containers, then ship the first scenario that showcases what split-mode chaos buys you beyond the all-in-one suite. Harness changes (test/multinode/harness/): - Stack gains a splitMode flag; new ProvisionSplit() constructor passes -allinone=0 to multinode.sh up. - Container helpers (Reset, waitNodeReady, dumpDiagnostics) now enumerate every per-service container for a node rather than assuming the single monolithic teranodeN-multinode name. - chaos.go: KillService / StartService / PauseService / UnpauseService delegate to multinode.sh chaos <verb> <n> <svc> and refuse to run unless the stack was provisioned in split mode. New package test/multinode_split/: - TestMain provisions a 3-node split stack and shares it across scenarios (mirrors the all-in-one pattern in test/multinode/). - scenario_04_block_assembly_isolation pins the real failure mode observed when blockassembly is killed: blockvalidation gates on WaitForBlockAssemblyReady for every inbound block, so node 3's chain stalls at baseline even though every other service is up. Restarting blockassembly clears the gate and the node catches up. This dependency is invisible from the all-in-one suite because you can't kill blockassembly there without taking the whole node down with it. Surfacing that hidden coupling is the point. New make target: - network-chaos-test-split runs the split suite separately from network-chaos-test (the two stacks can't coexist; split takes materially longer to start). Verified locally on an -allinone=0 stack: scenario passes in ~2m end-to-end. Stack teardown is clean.

… pending teranode fixes Harness improvements that stand on their own merit: - rpc.go: pin BaseURL to 127.0.0.1 instead of localhost. Docker's per-port proxy only listens on IPv4 by default, so a localhost dial that Happy-Eyeballs to ::1 first occasionally surfaces ECONNREFUSED in polling loops even though the IPv4 listener is fine. Pinning to 127.0.0.1 side-steps the dual-stack race. - wait.go: dumpNodeLogs was still hardcoded to teranodeN-multinode, which doesn't exist in split mode. Enumerate via docker ps with the same regex pattern Stack.nodeContainers uses so failure diagnostics work under either topology. Skip scenario_04 with t.Skip until two teranode robustness issues are fixed: 1. utxopersister.CreateUTXOSet nil-pointer panic on startup when processing the height-1 probe block ("Processing block <nil> height 0" → SIGSEGV in UTXOSet.go:527). Crashes core sidecars before the test starts; TestMain reports "waitForMesh: probe block ... did not propagate" with heights map[N:-1] (RPC unreachable because core exited). Non-deterministic but triggers often enough to make the test unrunnable. 2. legacy peer-protocol parser returns "unknown magic: [...]" when receiving a block from a peer whose blockassembly was killed and restarted; ServiceManager treats it as fatal and bails the whole core sidecar on the *receiving* node, so the failure manifests as RPC connection-refused on healthy-looking nodes during the converge wait. The scenario's assertion structure is preserved (and trimmed to stop after catch-up rather than continuing through the buggy mining-after- restart path). Once both teranode bugs land, removing the t.Skip re-enables the test. The harness extension itself (ProvisionSplit, KillService/StartService /PauseService/UnpauseService, split-aware Reset / nodeContainers, ulimits on aerospike) is independent of these bugs and remains useful infrastructure for future split-mode scenarios.

github-actions · 2026-05-27T14:26:16Z

🤖 Claude Code Review

Status: Complete

Summary

This PR extends the network-chaos test harness with split-per-service topology support and infrastructure fixes. The implementation is well-structured and follows good testing practices. One minor documentation inconsistency was found.

Findings

[Minor] Documentation accuracy issue in test/multinode_split/main_test.go:11

The package comment states:

Use make network-chaos-test (with the appropriate split-mode entry point, when added) to run them.

However, this PR already adds the split-mode target make network-chaos-test-split in the Makefile. The comment should be updated to reference the new target:

// Use make network-chaos-test-split to run them.

Code Quality

Strengths:

Well-factored harness refactoring: nodeContainers(), exitedContainers(), and startContainers() cleanly abstract split vs all-in-one topology differences
Proper split-mode guards: requireSplit() prevents misuse of per-service chaos methods in all-in-one mode
Thorough test scenario documentation: TestBlockAssemblyIsolation clearly documents both the test shape and the blocking bugs
Good error handling: container restart failures are accumulated and reported together
Idiomatic Go test patterns: proper use of t.Helper(), t.Skip() with rationale, and testify/require

Infrastructure fixes are sound:

Aerospike ulimits fix addresses a real startup issue (proto-fd-max=15000 vs 1024 default)
IPv4 localhost fix (127.0.0.1 vs localhost) prevents IPv6 connection race conditions

Verification

The PR appropriately skips the actual scenario test (t.Skip) due to two documented teranode bugs, while shipping the harness infrastructure. This is a reasonable engineering tradeoff—the harness work is independent and valuable even before the bugs are fixed.

sonarqubecloud · 2026-05-27T14:37:19Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-05-27T14:40:13Z

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-958 (57115c8)

Summary

Regressions: 0
Improvements: 0
Unchanged: 144
Significance level: p < 0.05

All benchmark results (sec/op)

Benchmark	Baseline	Current	Change	p-value
_NewBlockFromBytes-4	1.873µ	1.584µ	~	0.200
SplitSyncedParentMap_SetIfNotExists/256_buckets-4	71.22n	71.37n	~	0.400
SplitSyncedParentMap_SetIfNotExists/16_buckets-4	71.29n	71.30n	~	0.700
SplitSyncedParentMap_SetIfNotExists/1_bucket-4	71.23n	71.24n	~	1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets...	34.04n	32.71n	~	0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_...	56.94n	54.15n	~	0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa...	139.2n	130.3n	~	0.400
MiningCandidate_Stringify_Short-4	223.8n	228.5n	~	0.500
MiningCandidate_Stringify_Long-4	1.645µ	1.630µ	~	0.400
MiningSolution_Stringify-4	859.6n	843.6n	~	0.100
BlockInfo_MarshalJSON-4	1.829µ	1.729µ	~	0.100
NewFromBytes-4	124.3n	123.5n	~	0.700
AddTxBatchColumnar_Validation-4	2.569µ	2.629µ	~	0.400
OffsetValidationLoop-4	546.2n	544.0n	~	0.700
Mine_EasyDifficulty-4	60.37µ	60.59µ	~	1.000
Mine_WithAddress-4	6.756µ	6.718µ	~	1.000
DirectSubtreeAdd/4_per_subtree-4	55.97n	58.56n	~	0.200
DirectSubtreeAdd/64_per_subtree-4	29.31n	29.17n	~	0.700
DirectSubtreeAdd/256_per_subtree-4	28.16n	28.05n	~	0.200
DirectSubtreeAdd/1024_per_subtree-4	26.74n	26.77n	~	1.000
DirectSubtreeAdd/2048_per_subtree-4	26.38n	26.33n	~	0.800
SubtreeProcessorAdd/4_per_subtree-4	294.2n	293.0n	~	0.700
SubtreeProcessorAdd/64_per_subtree-4	293.5n	285.7n	~	0.200
SubtreeProcessorAdd/256_per_subtree-4	291.1n	285.5n	~	0.400
SubtreeProcessorAdd/1024_per_subtree-4	277.6n	278.2n	~	0.400
SubtreeProcessorAdd/2048_per_subtree-4	277.6n	279.9n	~	0.100
SubtreeProcessorRotate/4_per_subtree-4	282.5n	287.4n	~	0.200
SubtreeProcessorRotate/64_per_subtree-4	280.7n	282.5n	~	0.700
SubtreeProcessorRotate/256_per_subtree-4	280.1n	281.2n	~	0.100
SubtreeProcessorRotate/1024_per_subtree-4	280.4n	283.2n	~	0.100
SubtreeNodeAddOnly/4_per_subtree-4	56.29n	56.06n	~	0.100
SubtreeNodeAddOnly/64_per_subtree-4	36.59n	36.36n	~	0.400
SubtreeNodeAddOnly/256_per_subtree-4	35.49n	35.31n	~	0.200
SubtreeNodeAddOnly/1024_per_subtree-4	34.89n	34.73n	~	0.100
SubtreeCreationOnly/4_per_subtree-4	114.1n	112.7n	~	0.400
SubtreeCreationOnly/64_per_subtree-4	369.5n	363.7n	~	0.300
SubtreeCreationOnly/256_per_subtree-4	1.268µ	1.274µ	~	0.700
SubtreeCreationOnly/1024_per_subtree-4	3.959µ	3.953µ	~	0.400
SubtreeCreationOnly/2048_per_subtree-4	7.270µ	7.255µ	~	1.000
SubtreeProcessorOverheadBreakdown/64_per_subtree-4	281.7n	284.5n	~	0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4	282.5n	283.2n	~	0.100
ParallelGetAndSetIfNotExists/1k_nodes-4	2.044m	2.000m	~	0.100
ParallelGetAndSetIfNotExists/10k_nodes-4	5.249m	5.159m	~	0.700
ParallelGetAndSetIfNotExists/50k_nodes-4	7.374m	7.319m	~	0.700
ParallelGetAndSetIfNotExists/100k_nodes-4	10.01m	10.02m	~	1.000
SequentialGetAndSetIfNotExists/1k_nodes-4	1.786m	1.785m	~	1.000
SequentialGetAndSetIfNotExists/10k_nodes-4	4.512m	4.612m	~	0.700
SequentialGetAndSetIfNotExists/50k_nodes-4	13.44m	13.54m	~	0.200
SequentialGetAndSetIfNotExists/100k_nodes-4	24.84m	24.92m	~	0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4	2.100m	2.038m	~	0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4	8.387m	8.266m	~	0.100
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4	13.51m	13.08m	~	0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4	1.843m	1.796m	~	0.400
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4	8.055m	8.048m	~	0.700
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4	43.54m	43.02m	~	0.200
DiskTxMap_SetIfNotExists-4	3.925µ	3.929µ	~	1.000
DiskTxMap_SetIfNotExists_Parallel-4	3.675µ	3.613µ	~	0.400
DiskTxMap_ExistenceOnly-4	416.2n	376.5n	~	1.000
Queue-4	190.6n	186.9n	~	0.200
AtomicPointer-4	3.282n	3.238n	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4	854.6µ	841.0µ	~	0.200
ReorgOptimizations/DedupFilterPipeline/New/10K-4	768.6µ	790.9µ	~	0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4	127.0µ	105.5µ	~	0.100
ReorgOptimizations/AllMarkFalse/New/10K-4	64.38µ	64.36µ	~	0.700
ReorgOptimizations/HashSlicePool/Old/10K-4	53.68µ	52.65µ	~	1.000
ReorgOptimizations/HashSlicePool/New/10K-4	11.19µ	11.22µ	~	0.200
ReorgOptimizations/NodeFlags/Old/10K-4	4.468µ	4.809µ	~	0.100
ReorgOptimizations/NodeFlags/New/10K-4	1.520µ	1.612µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4	9.657m	9.843m	~	1.000
ReorgOptimizations/DedupFilterPipeline/New/100K-4	10.35m	10.37m	~	0.700
ReorgOptimizations/AllMarkFalse/Old/100K-4	1.078m	1.086m	~	0.400
ReorgOptimizations/AllMarkFalse/New/100K-4	707.0µ	704.0µ	~	0.100
ReorgOptimizations/HashSlicePool/Old/100K-4	645.4µ	650.4µ	~	0.700
ReorgOptimizations/HashSlicePool/New/100K-4	207.1µ	201.2µ	~	0.400
ReorgOptimizations/NodeFlags/Old/100K-4	48.25µ	46.84µ	~	1.000
ReorgOptimizations/NodeFlags/New/100K-4	17.00µ	17.42µ	~	0.700
TxMapSetIfNotExists-4	49.46n	49.50n	~	1.000
TxMapSetIfNotExistsDuplicate-4	41.35n	41.25n	~	0.400
ChannelSendReceive-4	589.0n	633.9n	~	0.100
BlockAssembler_AddTx-4	0.02747n	0.02839n	~	0.700
AddNode-4	11.94	12.65	~	0.100
AddNodeWithMap-4	12.31	13.01	~	0.100
CalcBlockWork-4	516.4n	518.3n	~	1.000
CalculateWork-4	710.8n	735.7n	~	0.700
BuildBlockLocatorString_Helpers/Size_10-4	1.342µ	1.339µ	~	0.800
BuildBlockLocatorString_Helpers/Size_100-4	14.71µ	15.27µ	~	1.000
BuildBlockLocatorString_Helpers/Size_1000-4	127.4µ	127.7µ	~	0.200
CatchupWithHeaderCache-4	104.4m	104.5m	~	0.200
_prepareTxsPerLevel-4	411.0m	415.9m	~	1.000
_prepareTxsPerLevelOrdered-4	4.005m	3.695m	~	0.700
_prepareTxsPerLevel_Comparison/Original-4	413.4m	411.6m	~	0.400
_prepareTxsPerLevel_Comparison/Optimized-4	3.814m	3.665m	~	0.100
SubtreeSizes/10k_tx_4_per_subtree-4	1.347m	1.381m	~	0.100
SubtreeSizes/10k_tx_16_per_subtree-4	323.6µ	325.2µ	~	0.400
SubtreeSizes/10k_tx_64_per_subtree-4	76.63µ	77.20µ	~	0.400
SubtreeSizes/10k_tx_256_per_subtree-4	19.38µ	19.23µ	~	0.200
SubtreeSizes/10k_tx_512_per_subtree-4	9.564µ	9.609µ	~	0.100
SubtreeSizes/10k_tx_1024_per_subtree-4	4.734µ	4.766µ	~	0.400
SubtreeSizes/10k_tx_2k_per_subtree-4	2.347µ	2.354µ	~	1.000
BlockSizeScaling/10k_tx_64_per_subtree-4	75.71µ	75.41µ	~	0.400
BlockSizeScaling/10k_tx_256_per_subtree-4	19.05µ	19.11µ	~	1.000
BlockSizeScaling/10k_tx_1024_per_subtree-4	4.769µ	4.721µ	~	0.700
BlockSizeScaling/50k_tx_64_per_subtree-4	400.4µ	401.0µ	~	0.700
BlockSizeScaling/50k_tx_256_per_subtree-4	94.98µ	95.58µ	~	0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4	23.63µ	23.44µ	~	0.700
SubtreeAllocations/small_subtrees_exists_check-4	163.8µ	160.1µ	~	0.400
SubtreeAllocations/small_subtrees_data_fetch-4	160.7µ	162.0µ	~	0.100
SubtreeAllocations/small_subtrees_full_validation-4	328.8µ	329.7µ	~	1.000
SubtreeAllocations/medium_subtrees_exists_check-4	9.523µ	9.457µ	~	0.100
SubtreeAllocations/medium_subtrees_data_fetch-4	9.660µ	9.506µ	~	0.100
SubtreeAllocations/medium_subtrees_full_validation-4	19.32µ	18.99µ	~	0.200
SubtreeAllocations/large_subtrees_exists_check-4	2.295µ	2.262µ	~	0.200
SubtreeAllocations/large_subtrees_data_fetch-4	2.330µ	2.317µ	~	0.700
SubtreeAllocations/large_subtrees_full_validation-4	4.773µ	4.793µ	~	0.400
_BufferPoolAllocation/16KB-4	4.260µ	5.043µ	~	0.700
_BufferPoolAllocation/32KB-4	8.649µ	8.096µ	~	0.100
_BufferPoolAllocation/64KB-4	19.97µ	16.74µ	~	0.400
_BufferPoolAllocation/128KB-4	30.35µ	27.43µ	~	0.200
_BufferPoolAllocation/512KB-4	123.7µ	113.5µ	~	0.200
_BufferPoolConcurrent/32KB-4	19.12µ	19.44µ	~	0.200
_BufferPoolConcurrent/64KB-4	29.97µ	30.30µ	~	0.200
_BufferPoolConcurrent/512KB-4	147.6µ	144.6µ	~	0.400
_SubtreeDeserializationWithBufferSizes/16KB-4	672.2µ	732.7µ	~	0.100
_SubtreeDeserializationWithBufferSizes/32KB-4	719.2µ	723.5µ	~	0.400
_SubtreeDeserializationWithBufferSizes/64KB-4	709.4µ	697.8µ	~	0.700
_SubtreeDeserializationWithBufferSizes/128KB-4	726.6µ	721.8µ	~	1.000
_SubtreeDeserializationWithBufferSizes/512KB-4	651.1µ	624.3µ	~	0.400
_SubtreeDataDeserializationWithBufferSizes/16KB-4	36.86m	37.07m	~	1.000
_SubtreeDataDeserializationWithBufferSizes/32KB-4	37.06m	36.34m	~	0.100
_SubtreeDataDeserializationWithBufferSizes/64KB-4	36.93m	37.20m	~	0.700
_SubtreeDataDeserializationWithBufferSizes/128KB-4	37.23m	36.09m	~	0.200
_SubtreeDataDeserializationWithBufferSizes/512KB-4	36.73m	37.56m	~	0.400
_PooledVsNonPooled/Pooled-4	833.5n	838.0n	~	0.100
_PooledVsNonPooled/NonPooled-4	7.815µ	8.484µ	~	0.200
_MemoryFootprint/Current_512KB_32concurrent-4	7.247µ	6.748µ	~	0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4	9.565µ	10.645µ	~	0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4	9.299µ	9.201µ	~	0.700
StoreBlock_Sequential/BelowCSVHeight-4	336.8µ	347.6µ	~	0.200
StoreBlock_Sequential/AboveCSVHeight-4	344.4µ	340.7µ	~	0.700
GetUtxoHashes-4	261.7n	264.9n	~	0.400
GetUtxoHashes_ManyOutputs-4	42.20µ	42.21µ	~	1.000
_NewMetaDataFromBytes-4	226.4n	227.1n	~	0.700
_Bytes-4	394.2n	398.8n	~	0.100
_MetaBytes-4	136.8n	137.1n	~	0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-27 14:39 UTC

ordishs

LGTM. Harness refactor is backward-compatible, ulimits and 127.0.0.1 fixes are solid wins on their own, and the per-service chaos API is cleanly guarded. Findings from the review are non-blocking — leaving them for follow-up at your discretion.

…6 (subtreevalidation pause) Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958 and the un-skip work in bsv-blockchain#995, plus the split-mode settings fix that scenario 05 turns out to depend on. Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts node 3 stalls at baseline (block-tx validation walks blockvalidation -> subtreevalidation -> validator), restarts validator, asserts catch-up and 3-node convergence. This only holds when subtreevalidation calls the standalone validator container over gRPC, NOT when it embeds an in-process validator. settings.conf:1212 ships useLocalValidator=true (the right default for all-in-one), so a vanilla docker.teranode{N}.test context would build the in-process validator and ignore the validator container entirely - making scenario 05 a no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips useLocalValidator=false per node so the kill is actually observable. The override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged. Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP) rather than killing, so the gRPC call from blockvalidation hangs (the frozen dependency failure mode, distinct from a process that has exited). First scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a failed assertion leaves the shared stack healthy for the next scenario's Reset. Both are gated on //go:build network_chaos like the rest of the split-topology suite.

…6 (subtreevalidation pause) Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958 and the un-skip work in bsv-blockchain#995, plus the split-mode useLocalValidator override that scenario 05 depends on and a chaos_unpause idempotency fix that scenario 06 depends on. Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts node 3 stalls at baseline (block-tx validation walks blockvalidation -> subtreevalidation -> validator), restarts validator, asserts catch-up and 3-node convergence. This only holds when subtreevalidation calls the standalone validator container over gRPC, NOT when it embeds an in-process validator. settings.conf ships useLocalValidator=true (the right default for all-in-one), so a vanilla docker.teranode{N}.test context would build the in-process validator and ignore the validator container entirely - making scenario 05 a silent no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips useLocalValidator=false per node so the kill is actually observable. The override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged. Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP) rather than killing, so the gRPC call from blockvalidation hangs (the frozen dependency failure mode, distinct from a process that has exited). First scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a failed assertion leaves the shared stack healthy for the next scenario's Reset. Bash fix: chaos_unpause is now idempotent across all three branches. The single-service and bulk-aio branches were previously running a bare 'docker unpause', which exits non-zero on an already-running container. Scenario 06 runs a defensive defer-unpause alongside an explicit one (defer is the safety net for assertion failure; explicit unblocks the catch-up assertion), and the bare docker call turned the second one into t.Fatalf, failing the test on a green run. The bulk-split branch already had the '|| true' pattern; this extends it for consistency. Same fix surfaced independently in PR review on bsv-blockchain#1070 against scenario 08. Idempotent semantics are what 'chaos cleanup' actually wants anyway. Both scenarios are gated on //go:build network_chaos like the rest of the split-topology suite.

liam added 3 commits May 27, 2026 13:18

liam requested review from ordishs and sugh01 May 27, 2026 14:26

ordishs approved these changes May 28, 2026

View reviewed changes

sugh01 approved these changes May 28, 2026

View reviewed changes

liam requested a review from freemans13 May 28, 2026 09:22

liam merged commit 7db6ae6 into bsv-blockchain:main May 28, 2026
25 checks passed

This was referenced May 28, 2026

fix(utxopersister): drop double-read of fileformat magic in verifyLastSet #971

Merged

test(multinode): un-skip scenario_04 block-assembly isolation #995

Merged

liam mentioned this pull request Jun 10, 2026

test(multinode): add scenario 05 (validator isolation) and scenario 06 (subtreevalidation pause) #1069

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(multinode): split-per-service chaos harness + scenario_04 (skipped pending teranode fixes)#958

test(multinode): split-per-service chaos harness + scenario_04 (skipped pending teranode fixes)#958
liam merged 3 commits into
bsv-blockchain:mainfrom
liam:liam/multinode-split-chaos-tests

liam commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 27, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

ordishs left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

liam commented May 27, 2026

Summary

What's in this PR

Skipped scenario: the teranode bugs blocking it

Test plan

Uh oh!

github-actions Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Findings

[Minor] Documentation accuracy issue in test/multinode_split/main_test.go:11

Code Quality

Verification

Uh oh!

sonarqubecloud Bot commented May 27, 2026

Quality Gate passed

Uh oh!

github-actions Bot commented May 27, 2026

Benchmark Comparison Report

Summary

Uh oh!

ordishs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 27, 2026 •

edited

Loading