test(multinode): split-per-service chaos harness + scenario_04 (skipped pending teranode fixes)#958
Conversation
aerospike.conf.tmpl requests proto-fd-max=15000 but the aerospike
service definition inherited the docker daemon's 1024 nofile default,
so aerospike aborted at startup with:
CRITICAL (config): 1024 system file descriptors not enough,
config specified 15000
This affected both topologies; the all-in-one network-chaos suite was
silently broken on any host whose /etc/docker/daemon.json doesn't set
default-ulimits. Set ulimits.nofile on the aerospike service so the
generated compose ships a working stack regardless of host config.
Extend the harness with split-mode awareness so tests can target
individual service containers, then ship the first scenario that
showcases what split-mode chaos buys you beyond the all-in-one suite.
Harness changes (test/multinode/harness/):
- Stack gains a splitMode flag; new ProvisionSplit() constructor
passes -allinone=0 to multinode.sh up.
- Container helpers (Reset, waitNodeReady, dumpDiagnostics) now
enumerate every per-service container for a node rather than
assuming the single monolithic teranodeN-multinode name.
- chaos.go: KillService / StartService / PauseService /
UnpauseService delegate to multinode.sh chaos <verb> <n> <svc>
and refuse to run unless the stack was provisioned in split mode.
New package test/multinode_split/:
- TestMain provisions a 3-node split stack and shares it across
scenarios (mirrors the all-in-one pattern in test/multinode/).
- scenario_04_block_assembly_isolation pins the real failure mode
observed when blockassembly is killed: blockvalidation gates on
WaitForBlockAssemblyReady for every inbound block, so node 3's
chain stalls at baseline even though every other service is up.
Restarting blockassembly clears the gate and the node catches up.
This dependency is invisible from the all-in-one suite because
you can't kill blockassembly there without taking the whole node
down with it. Surfacing that hidden coupling is the point.
New make target:
- network-chaos-test-split runs the split suite separately from
network-chaos-test (the two stacks can't coexist; split takes
materially longer to start).
Verified locally on an -allinone=0 stack: scenario passes in ~2m
end-to-end. Stack teardown is clean.
… pending teranode fixes
Harness improvements that stand on their own merit:
- rpc.go: pin BaseURL to 127.0.0.1 instead of localhost. Docker's
per-port proxy only listens on IPv4 by default, so a localhost dial
that Happy-Eyeballs to ::1 first occasionally surfaces ECONNREFUSED
in polling loops even though the IPv4 listener is fine. Pinning to
127.0.0.1 side-steps the dual-stack race.
- wait.go: dumpNodeLogs was still hardcoded to teranodeN-multinode,
which doesn't exist in split mode. Enumerate via docker ps with the
same regex pattern Stack.nodeContainers uses so failure diagnostics
work under either topology.
Skip scenario_04 with t.Skip until two teranode robustness issues are
fixed:
1. utxopersister.CreateUTXOSet nil-pointer panic on startup when
processing the height-1 probe block ("Processing block <nil>
height 0" → SIGSEGV in UTXOSet.go:527). Crashes core sidecars
before the test starts; TestMain reports
"waitForMesh: probe block ... did not propagate" with heights
map[N:-1] (RPC unreachable because core exited). Non-deterministic
but triggers often enough to make the test unrunnable.
2. legacy peer-protocol parser returns "unknown magic: [...]" when
receiving a block from a peer whose blockassembly was killed and
restarted; ServiceManager treats it as fatal and bails the whole
core sidecar on the *receiving* node, so the failure manifests as
RPC connection-refused on healthy-looking nodes during the
converge wait.
The scenario's assertion structure is preserved (and trimmed to stop
after catch-up rather than continuing through the buggy mining-after-
restart path). Once both teranode bugs land, removing the t.Skip
re-enables the test.
The harness extension itself (ProvisionSplit, KillService/StartService
/PauseService/UnpauseService, split-aware Reset / nodeContainers,
ulimits on aerospike) is independent of these bugs and remains useful
infrastructure for future split-mode scenarios.
|
🤖 Claude Code Review Status: Complete SummaryThis PR extends the network-chaos test harness with split-per-service topology support and infrastructure fixes. The implementation is well-structured and follows good testing practices. One minor documentation inconsistency was found. Findings[Minor] Documentation accuracy issue in test/multinode_split/main_test.go:11The package comment states:
However, this PR already adds the split-mode target // Use make network-chaos-test-split to run them.Code QualityStrengths:
Infrastructure fixes are sound:
VerificationThe PR appropriately skips the actual scenario test (t.Skip) due to two documented teranode bugs, while shipping the harness infrastructure. This is a reasonable engineering tradeoff—the harness work is independent and valuable even before the bugs are fixed. |
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-27 14:39 UTC |
ordishs
left a comment
There was a problem hiding this comment.
LGTM. Harness refactor is backward-compatible, ulimits and 127.0.0.1 fixes are solid wins on their own, and the per-service chaos API is cleanly guarded. Findings from the review are non-blocking — leaving them for follow-up at your discretion.
…6 (subtreevalidation pause) Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958 and the un-skip work in bsv-blockchain#995, plus the split-mode settings fix that scenario 05 turns out to depend on. Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts node 3 stalls at baseline (block-tx validation walks blockvalidation -> subtreevalidation -> validator), restarts validator, asserts catch-up and 3-node convergence. This only holds when subtreevalidation calls the standalone validator container over gRPC, NOT when it embeds an in-process validator. settings.conf:1212 ships useLocalValidator=true (the right default for all-in-one), so a vanilla docker.teranode{N}.test context would build the in-process validator and ignore the validator container entirely - making scenario 05 a no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips useLocalValidator=false per node so the kill is actually observable. The override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged. Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP) rather than killing, so the gRPC call from blockvalidation hangs (the frozen dependency failure mode, distinct from a process that has exited). First scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a failed assertion leaves the shared stack healthy for the next scenario's Reset. Both are gated on //go:build network_chaos like the rest of the split-topology suite.
…6 (subtreevalidation pause) Two new split-topology chaos scenarios building on the harness landed in bsv-blockchain#958 and the un-skip work in bsv-blockchain#995, plus the split-mode useLocalValidator override that scenario 05 depends on and a chaos_unpause idempotency fix that scenario 06 depends on. Scenario 05: kills teranode3-validator, mines 5 blocks on node 1, asserts node 3 stalls at baseline (block-tx validation walks blockvalidation -> subtreevalidation -> validator), restarts validator, asserts catch-up and 3-node convergence. This only holds when subtreevalidation calls the standalone validator container over gRPC, NOT when it embeds an in-process validator. settings.conf ships useLocalValidator=true (the right default for all-in-one), so a vanilla docker.teranode{N}.test context would build the in-process validator and ignore the validator container entirely - making scenario 05 a silent no-op (raised in PR review on bsv-blockchain#1069). The split-mode overlay generated by compose/cmd/gennodes/templates/settings.conf.tmpl now flips useLocalValidator=false per node so the kill is actually observable. The override is gated on {{if not $.AllInOne}}, so all-in-one mode is unchanged. Scenario 06: PAUSES teranode3-subtreevalidation via docker pause (SIGSTOP) rather than killing, so the gRPC call from blockvalidation hangs (the frozen dependency failure mode, distinct from a process that has exited). First scenario to exercise the pause/unpause verbs. Uses defer UnpauseService so a failed assertion leaves the shared stack healthy for the next scenario's Reset. Bash fix: chaos_unpause is now idempotent across all three branches. The single-service and bulk-aio branches were previously running a bare 'docker unpause', which exits non-zero on an already-running container. Scenario 06 runs a defensive defer-unpause alongside an explicit one (defer is the safety net for assertion failure; explicit unblocks the catch-up assertion), and the bare docker call turned the second one into t.Fatalf, failing the test on a green run. The bulk-split branch already had the '|| true' pattern; this extends it for consistency. Same fix surfaced independently in PR review on bsv-blockchain#1070 against scenario 08. Idempotent semantics are what 'chaos cleanup' actually wants anyway. Both scenarios are gated on //go:build network_chaos like the rest of the split-topology suite.



Summary
Extends the existing all-in-one network-chaos harness (
test/multinode/) with a sibling split-per-service variant (test/multinode_split/) and ships the first scenario that targets it.The harness work + ulimits fix is independent infrastructure; the actual scenario is
t.Skip-ed pending two teranode robustness issues that surfaced while developing it (details below).What's in this PR
fix(compose): bump aerospike nofile to 65536 in both templates(ea8f1eb41)aerospike.conf.tmplrequestsproto-fd-max = 15000but the aerospike compose service definition inherits the docker daemon's 1024 nofile default, so aerospike aborts at startup withCRITICAL: 1024 system file descriptors not enoughon any host whose/etc/docker/daemon.jsondoesn't overridedefault-ulimits. Adds an explicitulimits.nofile: 65536in both topologies so the generated stack starts cleanly out of the box. This silently fixes the existing all-in-onemake network-chaos-testfor affected hosts too.test(multinode): add split-per-service chaos suite + scenario_04(ab52826ec)Harness extensions in
test/multinode/harness/:Stack.splitModeflag; newProvisionSplit()constructor passes-allinone=0tomultinode.sh up.KillService/StartService/PauseService/UnpauseServicemethods delegate tomultinode.sh chaos <verb> <n> <svc>and refuse to run unless the stack is in split mode.Reset,waitNodeReady,dumpDiagnostics) refactored to enumerate every per-service container for a node vianodeContainers(n)rather than assuming the monolithicteranodeN-multinodename.New package
test/multinode_split/:TestMainprovisions a 3-node split stack (~32 containers, the smallest mesh that gives one isolation target plus two survivors).scenario_04_block_assembly_isolationdocuments theblockvalidation → blockassemblyruntime dependency: blockvalidation'sWaitForBlockAssemblyReadygate fires on every inbound block, so killing blockassembly stalls validation even though the blockvalidation container is healthy. Visibility into that coupling is the point of split-mode chaos and is unreachable from the all-in-one suite.New
make network-chaos-test-splittarget (separate fromnetwork-chaos-testbecause the two stacks cannot coexist and split bring-up is materially slower).test(multinode): split-aware diagnostics + IPv4 RPC; skip scenario_04 pending teranode fixes(8caa1e71f)rpc.go: pinBaseURLto127.0.0.1instead oflocalhostso the harness's polling loops don't trip on docker's IPv4-only proxy (::1ECONNREFUSED races).wait.go:dumpNodeLogswas hardcoded to the monolith container name and silently failed under split mode; now enumerates viadocker ps.scenario_04flaggedt.Skip()with a comment block pinning the two teranode bugs that block reliable execution (see below).Skipped scenario: the teranode bugs blocking it
Surfaced by running the scenario; both are out of scope for this PR but documented in code so they're not lost.
utxopersister.CreateUTXOSetnil-pointer panic on startup (services/utxopersister/UTXOSet.go:527). The trigger says "Processing block height 1" butprocessNextBlockthen logs "Processing block height 0" andCreateUTXOSetSIGSEGVs. Brings down thecoresidecar before tests can run, soTestMain's mesh probe fails withheights=map[N:-1]. Non-deterministic but triggers often enough to make the scenario unrunnable.legacy peer-protocol "unknown magic" crash on the receiving node when a peer broadcasts a block produced by a freshly-restarted blockassembly.
ServiceManagertreats it as fatal and gracefully exits the entire core sidecar, manifesting as RPCconnection refusedon healthy-looking peers during convergence.Once both are fixed, deleting one
t.Skip(...)line re-enables the scenario.Test plan
go build -tags network_chaos ./test/...cleango vet -tags network_chaos ./test/...cleango test ./compose/cmd/gennodes/still passes (template change)compose/multinode.sh up 3 -allinone=0brings up a healthy 3-node split stack with the new ulimitsgo test -tags network_chaos -run TestBlockAssemblyIsolation ./test/multinode_split/skips cleanly (1m mesh setup + immediate skip)make network-chaos-test(all-in-one) still passes locally — please confirm in your environment