Skip to content

test(multinode): split-per-service chaos harness + scenario_04 (skipped pending teranode fixes)#958

Merged
liam merged 3 commits into
bsv-blockchain:mainfrom
liam:liam/multinode-split-chaos-tests
May 28, 2026
Merged

test(multinode): split-per-service chaos harness + scenario_04 (skipped pending teranode fixes)#958
liam merged 3 commits into
bsv-blockchain:mainfrom
liam:liam/multinode-split-chaos-tests

Conversation

@liam

@liam liam commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

Extends the existing all-in-one network-chaos harness (test/multinode/) with a sibling split-per-service variant (test/multinode_split/) and ships the first scenario that targets it.

The harness work + ulimits fix is independent infrastructure; the actual scenario is t.Skip-ed pending two teranode robustness issues that surfaced while developing it (details below).

What's in this PR

fix(compose): bump aerospike nofile to 65536 in both templates (ea8f1eb41)

aerospike.conf.tmpl requests proto-fd-max = 15000 but the aerospike compose service definition inherits the docker daemon's 1024 nofile default, so aerospike aborts at startup with CRITICAL: 1024 system file descriptors not enough on any host whose /etc/docker/daemon.json doesn't override default-ulimits. Adds an explicit ulimits.nofile: 65536 in both topologies so the generated stack starts cleanly out of the box. This silently fixes the existing all-in-one make network-chaos-test for affected hosts too.

test(multinode): add split-per-service chaos suite + scenario_04 (ab52826ec)

Harness extensions in test/multinode/harness/:

  • Stack.splitMode flag; new ProvisionSplit() constructor passes -allinone=0 to multinode.sh up.
  • New KillService / StartService / PauseService / UnpauseService methods delegate to multinode.sh chaos <verb> <n> <svc> and refuse to run unless the stack is in split mode.
  • Container helpers (Reset, waitNodeReady, dumpDiagnostics) refactored to enumerate every per-service container for a node via nodeContainers(n) rather than assuming the monolithic teranodeN-multinode name.

New package test/multinode_split/:

  • TestMain provisions a 3-node split stack (~32 containers, the smallest mesh that gives one isolation target plus two survivors).
  • scenario_04_block_assembly_isolation documents the blockvalidation → blockassembly runtime dependency: blockvalidation's WaitForBlockAssemblyReady gate fires on every inbound block, so killing blockassembly stalls validation even though the blockvalidation container is healthy. Visibility into that coupling is the point of split-mode chaos and is unreachable from the all-in-one suite.

New make network-chaos-test-split target (separate from network-chaos-test because the two stacks cannot coexist and split bring-up is materially slower).

test(multinode): split-aware diagnostics + IPv4 RPC; skip scenario_04 pending teranode fixes (8caa1e71f)

  • rpc.go: pin BaseURL to 127.0.0.1 instead of localhost so the harness's polling loops don't trip on docker's IPv4-only proxy (::1 ECONNREFUSED races).
  • wait.go: dumpNodeLogs was hardcoded to the monolith container name and silently failed under split mode; now enumerates via docker ps.
  • scenario_04 flagged t.Skip() with a comment block pinning the two teranode bugs that block reliable execution (see below).

Skipped scenario: the teranode bugs blocking it

Surfaced by running the scenario; both are out of scope for this PR but documented in code so they're not lost.

  1. utxopersister.CreateUTXOSet nil-pointer panic on startup (services/utxopersister/UTXOSet.go:527). The trigger says "Processing block height 1" but processNextBlock then logs "Processing block height 0" and CreateUTXOSet SIGSEGVs. Brings down the core sidecar before tests can run, so TestMain's mesh probe fails with heights=map[N:-1]. Non-deterministic but triggers often enough to make the scenario unrunnable.

  2. legacy peer-protocol "unknown magic" crash on the receiving node when a peer broadcasts a block produced by a freshly-restarted blockassembly. ServiceManager treats it as fatal and gracefully exits the entire core sidecar, manifesting as RPC connection refused on healthy-looking peers during convergence.

Once both are fixed, deleting one t.Skip(...) line re-enables the scenario.

Test plan

  • go build -tags network_chaos ./test/... clean
  • go vet -tags network_chaos ./test/... clean
  • go test ./compose/cmd/gennodes/ still passes (template change)
  • compose/multinode.sh up 3 -allinone=0 brings up a healthy 3-node split stack with the new ulimits
  • go test -tags network_chaos -run TestBlockAssemblyIsolation ./test/multinode_split/ skips cleanly (1m mesh setup + immediate skip)
  • For reviewers: existing make network-chaos-test (all-in-one) still passes locally — please confirm in your environment
  • For follow-up: verify scenario_04 passes after the two teranode bugs land

liam added 3 commits May 27, 2026 13:18
aerospike.conf.tmpl requests proto-fd-max=15000 but the aerospike
service definition inherited the docker daemon's 1024 nofile default,
so aerospike aborted at startup with:

  CRITICAL (config): 1024 system file descriptors not enough,
                     config specified 15000

This affected both topologies; the all-in-one network-chaos suite was
silently broken on any host whose /etc/docker/daemon.json doesn't set
default-ulimits. Set ulimits.nofile on the aerospike service so the
generated compose ships a working stack regardless of host config.
Extend the harness with split-mode awareness so tests can target
individual service containers, then ship the first scenario that
showcases what split-mode chaos buys you beyond the all-in-one suite.

Harness changes (test/multinode/harness/):
  - Stack gains a splitMode flag; new ProvisionSplit() constructor
    passes -allinone=0 to multinode.sh up.
  - Container helpers (Reset, waitNodeReady, dumpDiagnostics) now
    enumerate every per-service container for a node rather than
    assuming the single monolithic teranodeN-multinode name.
  - chaos.go: KillService / StartService / PauseService /
    UnpauseService delegate to multinode.sh chaos <verb> <n> <svc>
    and refuse to run unless the stack was provisioned in split mode.

New package test/multinode_split/:
  - TestMain provisions a 3-node split stack and shares it across
    scenarios (mirrors the all-in-one pattern in test/multinode/).
  - scenario_04_block_assembly_isolation pins the real failure mode
    observed when blockassembly is killed: blockvalidation gates on
    WaitForBlockAssemblyReady for every inbound block, so node 3's
    chain stalls at baseline even though every other service is up.
    Restarting blockassembly clears the gate and the node catches up.
    This dependency is invisible from the all-in-one suite because
    you can't kill blockassembly there without taking the whole node
    down with it. Surfacing that hidden coupling is the point.

New make target:
  - network-chaos-test-split runs the split suite separately from
    network-chaos-test (the two stacks can't coexist; split takes
    materially longer to start).

Verified locally on an -allinone=0 stack: scenario passes in ~2m
end-to-end. Stack teardown is clean.
… pending teranode fixes

Harness improvements that stand on their own merit:

  - rpc.go: pin BaseURL to 127.0.0.1 instead of localhost. Docker's
    per-port proxy only listens on IPv4 by default, so a localhost dial
    that Happy-Eyeballs to ::1 first occasionally surfaces ECONNREFUSED
    in polling loops even though the IPv4 listener is fine. Pinning to
    127.0.0.1 side-steps the dual-stack race.

  - wait.go: dumpNodeLogs was still hardcoded to teranodeN-multinode,
    which doesn't exist in split mode. Enumerate via docker ps with the
    same regex pattern Stack.nodeContainers uses so failure diagnostics
    work under either topology.

Skip scenario_04 with t.Skip until two teranode robustness issues are
fixed:

  1. utxopersister.CreateUTXOSet nil-pointer panic on startup when
     processing the height-1 probe block ("Processing block <nil>
     height 0" → SIGSEGV in UTXOSet.go:527). Crashes core sidecars
     before the test starts; TestMain reports
     "waitForMesh: probe block ... did not propagate" with heights
     map[N:-1] (RPC unreachable because core exited). Non-deterministic
     but triggers often enough to make the test unrunnable.

  2. legacy peer-protocol parser returns "unknown magic: [...]" when
     receiving a block from a peer whose blockassembly was killed and
     restarted; ServiceManager treats it as fatal and bails the whole
     core sidecar on the *receiving* node, so the failure manifests as
     RPC connection-refused on healthy-looking nodes during the
     converge wait.

The scenario's assertion structure is preserved (and trimmed to stop
after catch-up rather than continuing through the buggy mining-after-
restart path). Once both teranode bugs land, removing the t.Skip
re-enables the test.

The harness extension itself (ProvisionSplit, KillService/StartService
/PauseService/UnpauseService, split-aware Reset / nodeContainers,
ulimits on aerospike) is independent of these bugs and remains useful
infrastructure for future split-mode scenarios.
@github-actions

github-actions Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Summary

This PR extends the network-chaos test harness with split-per-service topology support and infrastructure fixes. The implementation is well-structured and follows good testing practices. One minor documentation inconsistency was found.

Findings

[Minor] Documentation accuracy issue in test/multinode_split/main_test.go:11

The package comment states:

Use make network-chaos-test (with the appropriate split-mode entry point, when added) to run them.

However, this PR already adds the split-mode target make network-chaos-test-split in the Makefile. The comment should be updated to reference the new target:

// Use make network-chaos-test-split to run them.

Code Quality

Strengths:

  • Well-factored harness refactoring: nodeContainers(), exitedContainers(), and startContainers() cleanly abstract split vs all-in-one topology differences
  • Proper split-mode guards: requireSplit() prevents misuse of per-service chaos methods in all-in-one mode
  • Thorough test scenario documentation: TestBlockAssemblyIsolation clearly documents both the test shape and the blocking bugs
  • Good error handling: container restart failures are accumulated and reported together
  • Idiomatic Go test patterns: proper use of t.Helper(), t.Skip() with rationale, and testify/require

Infrastructure fixes are sound:

  • Aerospike ulimits fix addresses a real startup issue (proto-fd-max=15000 vs 1024 default)
  • IPv4 localhost fix (127.0.0.1 vs localhost) prevents IPv6 connection race conditions

Verification

The PR appropriately skips the actual scenario test (t.Skip) due to two documented teranode bugs, while shipping the harness infrastructure. This is a reasonable engineering tradeoff—the harness work is independent and valuable even before the bugs are fixed.

@liam liam requested review from ordishs and sugh01 May 27, 2026 14:26
@sonarqubecloud

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-958 (57115c8)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 144
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.873µ 1.584µ ~ 0.200
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 71.22n 71.37n ~ 0.400
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 71.29n 71.30n ~ 0.700
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 71.23n 71.24n ~ 1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 34.04n 32.71n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 56.94n 54.15n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 139.2n 130.3n ~ 0.400
MiningCandidate_Stringify_Short-4 223.8n 228.5n ~ 0.500
MiningCandidate_Stringify_Long-4 1.645µ 1.630µ ~ 0.400
MiningSolution_Stringify-4 859.6n 843.6n ~ 0.100
BlockInfo_MarshalJSON-4 1.829µ 1.729µ ~ 0.100
NewFromBytes-4 124.3n 123.5n ~ 0.700
AddTxBatchColumnar_Validation-4 2.569µ 2.629µ ~ 0.400
OffsetValidationLoop-4 546.2n 544.0n ~ 0.700
Mine_EasyDifficulty-4 60.37µ 60.59µ ~ 1.000
Mine_WithAddress-4 6.756µ 6.718µ ~ 1.000
DirectSubtreeAdd/4_per_subtree-4 55.97n 58.56n ~ 0.200
DirectSubtreeAdd/64_per_subtree-4 29.31n 29.17n ~ 0.700
DirectSubtreeAdd/256_per_subtree-4 28.16n 28.05n ~ 0.200
DirectSubtreeAdd/1024_per_subtree-4 26.74n 26.77n ~ 1.000
DirectSubtreeAdd/2048_per_subtree-4 26.38n 26.33n ~ 0.800
SubtreeProcessorAdd/4_per_subtree-4 294.2n 293.0n ~ 0.700
SubtreeProcessorAdd/64_per_subtree-4 293.5n 285.7n ~ 0.200
SubtreeProcessorAdd/256_per_subtree-4 291.1n 285.5n ~ 0.400
SubtreeProcessorAdd/1024_per_subtree-4 277.6n 278.2n ~ 0.400
SubtreeProcessorAdd/2048_per_subtree-4 277.6n 279.9n ~ 0.100
SubtreeProcessorRotate/4_per_subtree-4 282.5n 287.4n ~ 0.200
SubtreeProcessorRotate/64_per_subtree-4 280.7n 282.5n ~ 0.700
SubtreeProcessorRotate/256_per_subtree-4 280.1n 281.2n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 280.4n 283.2n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 56.29n 56.06n ~ 0.100
SubtreeNodeAddOnly/64_per_subtree-4 36.59n 36.36n ~ 0.400
SubtreeNodeAddOnly/256_per_subtree-4 35.49n 35.31n ~ 0.200
SubtreeNodeAddOnly/1024_per_subtree-4 34.89n 34.73n ~ 0.100
SubtreeCreationOnly/4_per_subtree-4 114.1n 112.7n ~ 0.400
SubtreeCreationOnly/64_per_subtree-4 369.5n 363.7n ~ 0.300
SubtreeCreationOnly/256_per_subtree-4 1.268µ 1.274µ ~ 0.700
SubtreeCreationOnly/1024_per_subtree-4 3.959µ 3.953µ ~ 0.400
SubtreeCreationOnly/2048_per_subtree-4 7.270µ 7.255µ ~ 1.000
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 281.7n 284.5n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 282.5n 283.2n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 2.044m 2.000m ~ 0.100
ParallelGetAndSetIfNotExists/10k_nodes-4 5.249m 5.159m ~ 0.700
ParallelGetAndSetIfNotExists/50k_nodes-4 7.374m 7.319m ~ 0.700
ParallelGetAndSetIfNotExists/100k_nodes-4 10.01m 10.02m ~ 1.000
SequentialGetAndSetIfNotExists/1k_nodes-4 1.786m 1.785m ~ 1.000
SequentialGetAndSetIfNotExists/10k_nodes-4 4.512m 4.612m ~ 0.700
SequentialGetAndSetIfNotExists/50k_nodes-4 13.44m 13.54m ~ 0.200
SequentialGetAndSetIfNotExists/100k_nodes-4 24.84m 24.92m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.100m 2.038m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 8.387m 8.266m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 13.51m 13.08m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 1.843m 1.796m ~ 0.400
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 8.055m 8.048m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 43.54m 43.02m ~ 0.200
DiskTxMap_SetIfNotExists-4 3.925µ 3.929µ ~ 1.000
DiskTxMap_SetIfNotExists_Parallel-4 3.675µ 3.613µ ~ 0.400
DiskTxMap_ExistenceOnly-4 416.2n 376.5n ~ 1.000
Queue-4 190.6n 186.9n ~ 0.200
AtomicPointer-4 3.282n 3.238n ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 854.6µ 841.0µ ~ 0.200
ReorgOptimizations/DedupFilterPipeline/New/10K-4 768.6µ 790.9µ ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4 127.0µ 105.5µ ~ 0.100
ReorgOptimizations/AllMarkFalse/New/10K-4 64.38µ 64.36µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/10K-4 53.68µ 52.65µ ~ 1.000
ReorgOptimizations/HashSlicePool/New/10K-4 11.19µ 11.22µ ~ 0.200
ReorgOptimizations/NodeFlags/Old/10K-4 4.468µ 4.809µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.520µ 1.612µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 9.657m 9.843m ~ 1.000
ReorgOptimizations/DedupFilterPipeline/New/100K-4 10.35m 10.37m ~ 0.700
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.078m 1.086m ~ 0.400
ReorgOptimizations/AllMarkFalse/New/100K-4 707.0µ 704.0µ ~ 0.100
ReorgOptimizations/HashSlicePool/Old/100K-4 645.4µ 650.4µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/100K-4 207.1µ 201.2µ ~ 0.400
ReorgOptimizations/NodeFlags/Old/100K-4 48.25µ 46.84µ ~ 1.000
ReorgOptimizations/NodeFlags/New/100K-4 17.00µ 17.42µ ~ 0.700
TxMapSetIfNotExists-4 49.46n 49.50n ~ 1.000
TxMapSetIfNotExistsDuplicate-4 41.35n 41.25n ~ 0.400
ChannelSendReceive-4 589.0n 633.9n ~ 0.100
BlockAssembler_AddTx-4 0.02747n 0.02839n ~ 0.700
AddNode-4 11.94 12.65 ~ 0.100
AddNodeWithMap-4 12.31 13.01 ~ 0.100
CalcBlockWork-4 516.4n 518.3n ~ 1.000
CalculateWork-4 710.8n 735.7n ~ 0.700
BuildBlockLocatorString_Helpers/Size_10-4 1.342µ 1.339µ ~ 0.800
BuildBlockLocatorString_Helpers/Size_100-4 14.71µ 15.27µ ~ 1.000
BuildBlockLocatorString_Helpers/Size_1000-4 127.4µ 127.7µ ~ 0.200
CatchupWithHeaderCache-4 104.4m 104.5m ~ 0.200
_prepareTxsPerLevel-4 411.0m 415.9m ~ 1.000
_prepareTxsPerLevelOrdered-4 4.005m 3.695m ~ 0.700
_prepareTxsPerLevel_Comparison/Original-4 413.4m 411.6m ~ 0.400
_prepareTxsPerLevel_Comparison/Optimized-4 3.814m 3.665m ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.347m 1.381m ~ 0.100
SubtreeSizes/10k_tx_16_per_subtree-4 323.6µ 325.2µ ~ 0.400
SubtreeSizes/10k_tx_64_per_subtree-4 76.63µ 77.20µ ~ 0.400
SubtreeSizes/10k_tx_256_per_subtree-4 19.38µ 19.23µ ~ 0.200
SubtreeSizes/10k_tx_512_per_subtree-4 9.564µ 9.609µ ~ 0.100
SubtreeSizes/10k_tx_1024_per_subtree-4 4.734µ 4.766µ ~ 0.400
SubtreeSizes/10k_tx_2k_per_subtree-4 2.347µ 2.354µ ~ 1.000
BlockSizeScaling/10k_tx_64_per_subtree-4 75.71µ 75.41µ ~ 0.400
BlockSizeScaling/10k_tx_256_per_subtree-4 19.05µ 19.11µ ~ 1.000
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.769µ 4.721µ ~ 0.700
BlockSizeScaling/50k_tx_64_per_subtree-4 400.4µ 401.0µ ~ 0.700
BlockSizeScaling/50k_tx_256_per_subtree-4 94.98µ 95.58µ ~ 0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4 23.63µ 23.44µ ~ 0.700
SubtreeAllocations/small_subtrees_exists_check-4 163.8µ 160.1µ ~ 0.400
SubtreeAllocations/small_subtrees_data_fetch-4 160.7µ 162.0µ ~ 0.100
SubtreeAllocations/small_subtrees_full_validation-4 328.8µ 329.7µ ~ 1.000
SubtreeAllocations/medium_subtrees_exists_check-4 9.523µ 9.457µ ~ 0.100
SubtreeAllocations/medium_subtrees_data_fetch-4 9.660µ 9.506µ ~ 0.100
SubtreeAllocations/medium_subtrees_full_validation-4 19.32µ 18.99µ ~ 0.200
SubtreeAllocations/large_subtrees_exists_check-4 2.295µ 2.262µ ~ 0.200
SubtreeAllocations/large_subtrees_data_fetch-4 2.330µ 2.317µ ~ 0.700
SubtreeAllocations/large_subtrees_full_validation-4 4.773µ 4.793µ ~ 0.400
_BufferPoolAllocation/16KB-4 4.260µ 5.043µ ~ 0.700
_BufferPoolAllocation/32KB-4 8.649µ 8.096µ ~ 0.100
_BufferPoolAllocation/64KB-4 19.97µ 16.74µ ~ 0.400
_BufferPoolAllocation/128KB-4 30.35µ 27.43µ ~ 0.200
_BufferPoolAllocation/512KB-4 123.7µ 113.5µ ~ 0.200
_BufferPoolConcurrent/32KB-4 19.12µ 19.44µ ~ 0.200
_BufferPoolConcurrent/64KB-4 29.97µ 30.30µ ~ 0.200
_BufferPoolConcurrent/512KB-4 147.6µ 144.6µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/16KB-4 672.2µ 732.7µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 719.2µ 723.5µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/64KB-4 709.4µ 697.8µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/128KB-4 726.6µ 721.8µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/512KB-4 651.1µ 624.3µ ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/16KB-4 36.86m 37.07m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/32KB-4 37.06m 36.34m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/64KB-4 36.93m 37.20m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/128KB-4 37.23m 36.09m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/512KB-4 36.73m 37.56m ~ 0.400
_PooledVsNonPooled/Pooled-4 833.5n 838.0n ~ 0.100
_PooledVsNonPooled/NonPooled-4 7.815µ 8.484µ ~ 0.200
_MemoryFootprint/Current_512KB_32concurrent-4 7.247µ 6.748µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.565µ 10.645µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 9.299µ 9.201µ ~ 0.700
StoreBlock_Sequential/BelowCSVHeight-4 336.8µ 347.6µ ~ 0.200
StoreBlock_Sequential/AboveCSVHeight-4 344.4µ 340.7µ ~ 0.700
GetUtxoHashes-4 261.7n 264.9n ~ 0.400
GetUtxoHashes_ManyOutputs-4 42.20µ 42.21µ ~ 1.000
_NewMetaDataFromBytes-4 226.4n 227.1n ~ 0.700
_Bytes-4 394.2n 398.8n ~ 0.100
_MetaBytes-4 136.8n 137.1n ~ 0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-27 14:39 UTC

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Harness refactor is backward-compatible, ulimits and 127.0.0.1 fixes are solid wins on their own, and the per-service chaos API is cleanly guarded. Findings from the review are non-blocking — leaving them for follow-up at your discretion.

@liam liam requested a review from freemans13 May 28, 2026 09:22
@liam liam merged commit 7db6ae6 into bsv-blockchain:main May 28, 2026
25 checks passed
liam added a commit to liam/teranode that referenced this pull request Jun 11, 2026
…6 (subtreevalidation pause)

Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958
and the un-skip work in bsv-blockchain#995, plus the split-mode settings fix that scenario 05
turns out to depend on.

Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts
node 3 stalls at baseline (block-tx validation walks blockvalidation ->
subtreevalidation -> validator), restarts validator, asserts catch-up and
3-node convergence.

This only holds when subtreevalidation calls the standalone validator
container over gRPC, NOT when it embeds an in-process validator.
settings.conf:1212 ships useLocalValidator=true (the right default for
all-in-one), so a vanilla docker.teranode{N}.test context would build the
in-process validator and ignore the validator container entirely - making
scenario 05 a no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay
generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips
useLocalValidator=false per node so the kill is actually observable. The
override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged.

Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP)
rather than killing, so the gRPC call from blockvalidation hangs (the frozen
dependency failure mode, distinct from a process that has exited). First
scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a
failed assertion leaves the shared stack healthy for the next scenario's Reset.

Both are gated on //go:build network_chaos like the rest of the split-topology
suite.
liam added a commit to liam/teranode that referenced this pull request Jun 11, 2026
…6 (subtreevalidation pause)

Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958
and the un-skip work in bsv-blockchain#995, plus the split-mode useLocalValidator override
that scenario 05 depends on and a chaos_unpause idempotency fix that
scenario 06 depends on.

Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts
node 3 stalls at baseline (block-tx validation walks blockvalidation ->
subtreevalidation -> validator), restarts validator, asserts catch-up and
3-node convergence.

This only holds when subtreevalidation calls the standalone validator
container over gRPC, NOT when it embeds an in-process validator.
settings.conf ships useLocalValidator=true (the right default for all-in-one),
so a vanilla docker.teranode{N}.test context would build the in-process
validator and ignore the validator container entirely - making scenario 05
a silent no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay
generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips
useLocalValidator=false per node so the kill is actually observable. The
override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged.

Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP)
rather than killing, so the gRPC call from blockvalidation hangs (the frozen
dependency failure mode, distinct from a process that has exited). First
scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a
failed assertion leaves the shared stack healthy for the next scenario's Reset.

Bash fix: chaos_unpause is now idempotent across all three branches. The
single-service and bulk-aio branches were previously running a bare
'docker unpause', which exits non-zero on an already-running container.
Scenario 06 runs a defensive defer-unpause alongside an explicit one (defer
is the safety net for assertion failure; explicit unblocks the catch-up
assertion), and the bare docker call turned the second one into t.Fatalf,
failing the test on a green run. The bulk-split branch already had the
'|| true' pattern; this extends it for consistency. Same fix surfaced
independently in PR review on bsv-blockchain#1070 against scenario 08. Idempotent
semantics are what 'chaos cleanup' actually wants anyway.

Both scenarios are gated on //go:build network_chaos like the rest of the
split-topology suite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants