Skip to content

feat(compose): add -allinone=0 split-per-service topology for chaos testing#932

Merged
liam merged 8 commits into
bsv-blockchain:mainfrom
liam:liam/compose-split-services
May 27, 2026
Merged

feat(compose): add -allinone=0 split-per-service topology for chaos testing#932
liam merged 8 commits into
bsv-blockchain:mainfrom
liam:liam/compose-split-services

Conversation

@liam

@liam liam commented May 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a -allinone=0 flag to compose/multinode.sh up <N> that decomposes each logical node into one container per microservice instead of one container running every service. Chaos commands can now target individual services (kill the Validator while BlockAssembly keeps running, isolate just the P2P container, freeze the SubtreeValidator) without taking down the whole node.

Default behaviour is unchanged: up <N> without -allinone=0 produces a bit-for-bit identical compose file (verified by diff -r against the pre-change generator output).

Split topology

Per node, 9 containers replace the single all-in-one container:

  • teranodeN-blockchain — postgres-backed chain state, gRPC :8087
  • teranodeN-blockassembly — gRPC :8085
  • teranodeN-blockvalidation — gRPC :8088
  • teranodeN-subtreevalidation — gRPC :8089 (only consumer of subtree_quorum)
  • teranodeN-validator — gRPC :8081
  • teranodeN-propagation — gRPC :8084, HTTP :8833
  • teranodeN-p2p — libp2p, NET_ADMIN (only container in the node that gets this; chaos isolate/slow target this netns)
  • teranodeN-asset — HTTP :8090
  • teranodeN-core — RPC :9292 + alert + blockpersister + utxopersister + pruner + legacy bundled together so RPC stays reachable for blast/generate-blocks

Key design choices

Topology as data, not boilerplate. Service shape lives in a single []splitService slice in compose/cmd/gennodes/main.go. The template iterates range .Nodes then range $.SplitServices and emits one block per (node, service) pair. Adding a 10th service is one struct literal; the docker-compose template went from 542 lines of copy-paste to 169 lines.

Per-service mount scoping. Each container bind-mounts only the data subdirs its daemon actually opens, audited against daemon/daemon_services.go:

Service Mounts
blockchain (none)
blockassembly txstore, subtreestore, external
blockvalidation txstore, subtreestore, external
subtreevalidation txstore, subtreestore, subtree_quorum, external
validator external
propagation txstore
p2p (none)
asset txstore, subtreestore, external
core txstore, subtreestore, blockstore, external

(external is the Aerospike client-side overflow path, mounted wherever GetUtxoStore is called.)

Healthchecks on every container, service_healthy gate for blockchain dependents. Each split container has a healthcheck: nc -z localhost <primary-port>. grpc-go's Server.Serve(lis) binds the listener and registers the handler in one call, so a successful TCP accept is a reliable proxy for "ready". The 8 services that dial blockchain at boot wait on condition: service_healthy (not service_started), removing the 20-60s of connection-refused log noise at up time.

Inter-service gRPC routing via settings overlay. The generated settings_multinode.conf emits per-node *_grpcAddress.docker.teranodeN overrides routing to sibling DNS names — e.g. blockchain_grpcAddress.docker.teranode1 = teranode1-blockchain:${BLOCKCHAIN_GRPC_PORT}. The daemon's existing settings context resolution does the rest.

multinode.sh extensions

  • chaos kill/start/pause/unpause <node> [service] — optional service arg in split mode targets one container. Without it, operates on all containers in the node.
  • chaos isolate <node> and chaos slow <node> <ms> target the -p2p container (network namespace owner).
  • cmd_status groups split-mode containers by node prefix and tallies running services per node.
  • cmd_logs <N>-<svc> accepts split-service log targets.
  • cmd_blast enumerates distinct nodes correctly across split containers and looks up propagation port on the right container.
  • Helper functions is_split_mode, peer_containers, node_service_containers capture the mode-detection logic in one place.

Test plan

  • go test ./compose/cmd/gennodes/ passes — new tests assert: 9 services x 3 nodes, correct entrypoint flags, NET_ADMIN only on -p2p (exactly 3 occurrences), per-service mount scoping (both required-and-present and out-of-scope-absent), healthcheck ports per service, and that all 24 blockchain dependency edges (8 deps x 3 nodes) gate on service_healthy.
  • go test ./compose/... — existing TestWriteAll_RendersCompleteBundle still passes (buildSpec defaults AllInOne=true).
  • go vet ./compose/... clean.
  • bash -n compose/multinode.sh clean.
  • Regression: diff -r between pre-change and post-change generator output with -n 3 (default mode) is empty.
  • End-to-end smoke test: compose/multinode.sh up 3 -allinone=0, verify all 27 service containers + 7 infra come up, generate blocks on node 1, observe propagation to nodes 2/3, kill teranode2-validator only, confirm nodes 1/3 keep working.

Operational notes

  • Resource pressure: 3 nodes x 9 containers + 7 infra ≈ 34 containers. Heavier than all-in-one mode; usage text recommends >=16 GB RAM.
  • Settings keys are sourced from settings.conf variables (${BLOCKCHAIN_GRPC_PORT} etc.) so port reassignments in settings.conf flow through without template edits.
  • The core sidecar is intentional and matches the "core 8 only" scope; further splitting (RPC/Alert/etc. each in their own container) is straightforward by adding entries to buildSplitServices().

@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

Current Review:

All previously reported critical bugs have been fixed:

  • ✅ Port variable names corrected (BLOCK_ASSEMBLY_GRPC_PORT, etc.)
  • ✅ Subtreevalidation port corrected to 8086
  • ✅ Test assertions updated to match

Minor observations remain in existing inline comments (service name validation, unpause counter reporting, mode detection edge cases, documentation precision) - these are non-blocking and well-documented.

Code Quality:

  • Well-structured with excellent test coverage (404-line test file)
  • Clear separation between all-in-one and split topologies
  • Comprehensive shell script with proper error handling and validation
  • Good documentation with usage examples

History:

  • Previously reported critical bugs (port configuration, healthcheck ports, test assertions) have been resolved

@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-932 (84a2aee)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 144
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.834µ 1.927µ ~ 0.700
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 71.26n 71.20n ~ 1.000
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 71.08n 71.27n ~ 0.100
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 71.08n 71.26n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 34.40n 42.18n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 63.54n 66.22n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 206.7n 205.8n ~ 1.000
MiningCandidate_Stringify_Short-4 225.1n 225.1n ~ 0.800
MiningCandidate_Stringify_Long-4 1.630µ 1.689µ ~ 0.100
MiningSolution_Stringify-4 851.1n 858.7n ~ 0.700
BlockInfo_MarshalJSON-4 1.767µ 1.811µ ~ 0.100
NewFromBytes-4 123.3n 122.8n ~ 0.100
AddTxBatchColumnar_Validation-4 2.481µ 2.458µ ~ 1.000
OffsetValidationLoop-4 720.2n 720.3n ~ 0.700
Mine_EasyDifficulty-4 37.32µ 37.00µ ~ 1.000
Mine_WithAddress-4 4.069µ 4.078µ ~ 1.000
BlockAssembler_AddTx-4 0.02754n 0.02746n ~ 1.000
AddNode-4 10.81 10.75 ~ 1.000
AddNodeWithMap-4 11.96 11.72 ~ 1.000
DirectSubtreeAdd/4_per_subtree-4 57.93n 57.26n ~ 1.000
DirectSubtreeAdd/64_per_subtree-4 28.99n 28.90n ~ 0.200
DirectSubtreeAdd/256_per_subtree-4 27.84n 27.65n ~ 0.700
DirectSubtreeAdd/1024_per_subtree-4 26.53n 26.48n ~ 1.000
DirectSubtreeAdd/2048_per_subtree-4 26.23n 26.04n ~ 0.700
SubtreeProcessorAdd/4_per_subtree-4 288.7n 290.9n ~ 0.700
SubtreeProcessorAdd/64_per_subtree-4 282.5n 286.7n ~ 0.700
SubtreeProcessorAdd/256_per_subtree-4 285.8n 287.0n ~ 0.700
SubtreeProcessorAdd/1024_per_subtree-4 278.2n 278.2n ~ 0.800
SubtreeProcessorAdd/2048_per_subtree-4 277.8n 275.3n ~ 0.100
SubtreeProcessorRotate/4_per_subtree-4 285.1n 280.4n ~ 0.200
SubtreeProcessorRotate/64_per_subtree-4 276.2n 281.1n ~ 0.400
SubtreeProcessorRotate/256_per_subtree-4 276.4n 288.0n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 278.3n 290.2n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 55.05n 55.13n ~ 1.000
SubtreeNodeAddOnly/64_per_subtree-4 36.18n 36.06n ~ 0.100
SubtreeNodeAddOnly/256_per_subtree-4 35.13n 35.03n ~ 0.100
SubtreeNodeAddOnly/1024_per_subtree-4 34.47n 34.46n ~ 1.000
SubtreeCreationOnly/4_per_subtree-4 111.3n 110.9n ~ 0.400
SubtreeCreationOnly/64_per_subtree-4 352.6n 349.1n ~ 0.200
SubtreeCreationOnly/256_per_subtree-4 1.244µ 1.225µ ~ 0.200
SubtreeCreationOnly/1024_per_subtree-4 3.861µ 3.798µ ~ 0.100
SubtreeCreationOnly/2048_per_subtree-4 6.990µ 6.822µ ~ 0.700
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 284.4n 284.0n ~ 1.000
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 282.9n 284.9n ~ 0.400
ParallelGetAndSetIfNotExists/1k_nodes-4 2.018m 2.001m ~ 0.400
ParallelGetAndSetIfNotExists/10k_nodes-4 5.186m 5.108m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 7.762m 7.606m ~ 0.700
ParallelGetAndSetIfNotExists/100k_nodes-4 10.48m 10.48m ~ 0.400
SequentialGetAndSetIfNotExists/1k_nodes-4 1.801m 1.776m ~ 0.400
SequentialGetAndSetIfNotExists/10k_nodes-4 5.217m 4.595m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 15.24m 14.55m ~ 0.400
SequentialGetAndSetIfNotExists/100k_nodes-4 25.97m 25.45m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.106m 2.043m ~ 0.200
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 8.441m 8.652m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 15.20m 13.94m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 1.799m 1.812m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 9.943m 8.760m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 56.74m 46.10m ~ 0.100
DiskTxMap_SetIfNotExists-4 3.797µ 3.653µ ~ 0.100
DiskTxMap_SetIfNotExists_Parallel-4 3.444µ 3.416µ ~ 0.700
DiskTxMap_ExistenceOnly-4 322.2n 316.6n ~ 0.100
Queue-4 185.0n 183.2n ~ 0.400
AtomicPointer-4 3.632n 3.635n ~ 1.000
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 836.4µ 819.0µ ~ 0.700
ReorgOptimizations/DedupFilterPipeline/New/10K-4 785.3µ 748.4µ ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4 104.4µ 101.4µ ~ 0.100
ReorgOptimizations/AllMarkFalse/New/10K-4 64.86µ 64.55µ ~ 0.400
ReorgOptimizations/HashSlicePool/Old/10K-4 51.59µ 57.76µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/10K-4 11.07µ 11.31µ ~ 0.100
ReorgOptimizations/NodeFlags/Old/10K-4 4.588µ 5.413µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.577µ 2.342µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 9.655m 9.210m ~ 0.100
ReorgOptimizations/DedupFilterPipeline/New/100K-4 11.40m 10.07m ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.088m 1.088m ~ 0.700
ReorgOptimizations/AllMarkFalse/New/100K-4 710.6µ 706.3µ ~ 0.100
ReorgOptimizations/HashSlicePool/Old/100K-4 597.9µ 628.6µ ~ 0.200
ReorgOptimizations/HashSlicePool/New/100K-4 206.1µ 207.2µ ~ 1.000
ReorgOptimizations/NodeFlags/Old/100K-4 49.68µ 48.10µ ~ 0.100
ReorgOptimizations/NodeFlags/New/100K-4 17.22µ 16.81µ ~ 0.100
TxMapSetIfNotExists-4 49.58n 49.39n ~ 0.100
TxMapSetIfNotExistsDuplicate-4 41.64n 41.34n ~ 0.400
ChannelSendReceive-4 601.7n 600.9n ~ 1.000
CalcBlockWork-4 533.5n 509.3n ~ 0.700
CalculateWork-4 696.4n 699.1n ~ 0.200
BuildBlockLocatorString_Helpers/Size_10-4 1.358µ 1.366µ ~ 0.800
BuildBlockLocatorString_Helpers/Size_100-4 13.66µ 13.61µ ~ 1.000
BuildBlockLocatorString_Helpers/Size_1000-4 127.2µ 129.9µ ~ 0.100
CatchupWithHeaderCache-4 104.6m 104.6m ~ 0.400
_BufferPoolAllocation/16KB-4 3.976µ 3.846µ ~ 0.100
_BufferPoolAllocation/32KB-4 11.598µ 9.925µ ~ 0.400
_BufferPoolAllocation/64KB-4 18.66µ 17.35µ ~ 0.100
_BufferPoolAllocation/128KB-4 37.66µ 33.90µ ~ 0.100
_BufferPoolAllocation/512KB-4 136.4µ 142.0µ ~ 0.200
_BufferPoolConcurrent/32KB-4 23.17µ 20.60µ ~ 0.100
_BufferPoolConcurrent/64KB-4 33.19µ 30.91µ ~ 0.700
_BufferPoolConcurrent/512KB-4 148.4µ 157.1µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/16KB-4 654.1µ 702.2µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 641.4µ 722.8µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 639.1µ 733.5µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 644.6µ 736.4µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 644.5µ 681.7µ ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/16KB-4 37.10m 37.61m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4 36.84m 37.03m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/64KB-4 36.66m 37.23m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/128KB-4 36.32m 36.60m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/512KB-4 36.34m 36.71m ~ 0.100
_PooledVsNonPooled/Pooled-4 841.8n 834.1n ~ 0.100
_PooledVsNonPooled/NonPooled-4 8.898µ 8.384µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 7.041µ 8.019µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 10.385µ 9.979µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 10.61µ 10.07µ ~ 0.100
_prepareTxsPerLevel-4 407.2m 397.5m ~ 0.100
_prepareTxsPerLevelOrdered-4 3.591m 3.433m ~ 0.100
_prepareTxsPerLevel_Comparison/Original-4 403.9m 395.2m ~ 0.100
_prepareTxsPerLevel_Comparison/Optimized-4 3.622m 3.490m ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.278m 1.291m ~ 0.400
SubtreeSizes/10k_tx_16_per_subtree-4 299.8µ 300.8µ ~ 0.700
SubtreeSizes/10k_tx_64_per_subtree-4 73.51µ 72.16µ ~ 0.100
SubtreeSizes/10k_tx_256_per_subtree-4 18.19µ 18.20µ ~ 1.000
SubtreeSizes/10k_tx_512_per_subtree-4 8.912µ 8.877µ ~ 0.400
SubtreeSizes/10k_tx_1024_per_subtree-4 4.480µ 4.428µ ~ 0.100
SubtreeSizes/10k_tx_2k_per_subtree-4 2.196µ 2.201µ ~ 0.700
BlockSizeScaling/10k_tx_64_per_subtree-4 70.02µ 70.96µ ~ 0.100
BlockSizeScaling/10k_tx_256_per_subtree-4 17.46µ 17.85µ ~ 0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.423µ 4.399µ ~ 0.700
BlockSizeScaling/50k_tx_64_per_subtree-4 368.7µ 375.4µ ~ 0.700
BlockSizeScaling/50k_tx_256_per_subtree-4 87.64µ 89.30µ ~ 0.100
BlockSizeScaling/50k_tx_1024_per_subtree-4 22.00µ 21.83µ ~ 1.000
SubtreeAllocations/small_subtrees_exists_check-4 150.3µ 148.4µ ~ 0.700
SubtreeAllocations/small_subtrees_data_fetch-4 158.3µ 159.1µ ~ 0.700
SubtreeAllocations/small_subtrees_full_validation-4 309.1µ 311.0µ ~ 1.000
SubtreeAllocations/medium_subtrees_exists_check-4 8.852µ 8.824µ ~ 1.000
SubtreeAllocations/medium_subtrees_data_fetch-4 9.421µ 9.389µ ~ 0.700
SubtreeAllocations/medium_subtrees_full_validation-4 17.58µ 17.67µ ~ 0.700
SubtreeAllocations/large_subtrees_exists_check-4 2.105µ 2.104µ ~ 1.000
SubtreeAllocations/large_subtrees_data_fetch-4 2.225µ 2.241µ ~ 0.100
SubtreeAllocations/large_subtrees_full_validation-4 4.353µ 4.343µ ~ 1.000
StoreBlock_Sequential/BelowCSVHeight-4 332.3µ 324.5µ ~ 0.700
StoreBlock_Sequential/AboveCSVHeight-4 325.1µ 324.8µ ~ 0.700
GetUtxoHashes-4 272.5n 270.7n ~ 0.400
GetUtxoHashes_ManyOutputs-4 50.17µ 45.31µ ~ 0.100
_NewMetaDataFromBytes-4 216.6n 215.2n ~ 0.400
_Bytes-4 401.1n 398.1n ~ 0.700
_MetaBytes-4 139.8n 142.1n ~ 0.200

Threshold: >10% with p < 0.05 | Generated: 2026-05-27 09:28 UTC

@liam liam requested a review from sugh01 May 22, 2026 12:30
@liam liam force-pushed the liam/compose-split-services branch from 2e4fe9f to 816bfab Compare May 22, 2026 15:33
@liam liam requested a review from freemans13 May 22, 2026 15:56
…esting

`compose/multinode.sh up <N>` previously bundled every Teranode service
into one container per node, so chaos commands could only kill the whole
node at once. `-allinone=0` decomposes each logical node into one
container per microservice (blockchain, blockassembly, blockvalidation,
subtreevalidation, validator, propagation, p2p, asset, plus a `core`
sidecar for rpc/alert/blockpersister/utxopersister/pruner/legacy), so
individual services can be killed, paused, isolated, or slowed without
taking down the rest of the node.

Default behaviour is unchanged: `up <N>` without `-allinone=0` produces
a bit-for-bit identical compose file (verified by diff -r against the
pre-change generator).

How it works:

* gennodes accepts `-allinone=0` and renders an alternate compose
  template. Service topology is the single source of truth in Go via
  []splitService: name, entrypoint flags, ports, cap_add, mounts,
  healthcheck port. Adding a new service is one struct literal.
* Settings overlay emits per-node *_grpcAddress overrides under the
  existing docker.teranode<N> settings context, routing inter-service
  traffic to sibling DNS names (e.g. blockchain_grpcAddress.docker.
  teranode1 = teranode1-blockchain:8087).
* Per-service mount scoping: each container mounts the minimum data
  subdirs it actually uses, audited against daemon/daemon_services.go.
  blockchain and p2p mount zero data dirs; subtreevalidation is the
  only consumer of subtree_quorum; blockstore lives only on core.
* Healthcheck on each container's primary listener port. The 8 services
  that dial blockchain at boot gate on service_healthy (not
  service_started), removing 20-60s of connection-refused log spam at
  startup.
* multinode.sh extended: chaos kill/start/pause/unpause accept an
  optional service arg in split mode; isolate/slow target the -p2p
  container; status groups by node and tallies running services; logs
  accepts <N>-<svc> form; blast enumerates distinct nodes across
  split containers.
* `up --skip N:svc` (repeatable, split mode only) brings a node up
  without the named service. Useful for chaos scenarios where a node
  should start missing a service rather than killing it post-up. The
  flag rejects --skip blockchain because sibling service_healthy deps
  would auto-resolve it. `chaos start <N> <svc>` switched from
  `compose start` to `compose up -d` so a --skip-ped (never-materialised)
  service can be brought up on demand.
* compose/MULTINODE.md gained a Split topology section covering
  -allinone=0, the 9-service decomposition, and --skip, plus updates
  to the command and chaos tables.

Tests:

* TestGenSplitTopology asserts 9 services x N nodes, correct entrypoint
  flags, NET_ADMIN only on -p2p, per-service mount scoping
  (assertServiceMounts), healthcheck ports (assertHealthchecks), and
  that all 24 blockchain dependency gates use service_healthy.
* TestGenAllInOneTopology asserts the default-mode output contains no
  split markers (regression guard).
* Existing TestWriteAll_RendersCompleteBundle still passes (buildSpec
  defaults AllInOne=true so historical callers are unaffected).
@liam liam force-pushed the liam/compose-split-services branch from 816bfab to 626c4c7 Compare May 26, 2026 15:41
CI lint failed on compose/cmd/gennodes/main_test.go with gci flagging
an extra blank line at EOF.
@github-actions

Copy link
Copy Markdown
Contributor

Asset service mount issue (compose/cmd/gennodes/main.go:253)

The asset service DataMounts list is incomplete. According to daemon/daemon_services.go:389-397, the asset service calls GetBlockPersisterStore(), which requires the blockstore directory to be mounted.

Current: DataMounts: []string{"txstore", "subtreestore", "external"}
Should be: DataMounts: []string{"txstore", "subtreestore", "blockstore", "external"}

The docstring at line 173 should also be updated to reflect this.

@liam liam requested a review from ordishs May 27, 2026 08:30

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary

Solid design overall. "Topology as data" (one buildSplitServices() slice as the single source of truth) is the right call, the template went 542 -> 169 lines, and the default -allinone=1 path is provably unchanged (template + settings gated on AllInOne, verified by diff -r). The healthcheck approach is also sound — nc is guaranteed present because the existing wait.sh entrypoint already uses nc -z.

However, split mode does not work as shipped. Two confirmed config bugs (inline) break exactly the services chaos testing targets:

  1. Three gRPC dial-address overrides reference port variables that don't exist (${BLOCKASSEMBLY_GRPC_PORT} etc. vs the real ${BLOCK_ASSEMBLY_GRPC_PORT}).
  2. subtreevalidation is wired to port 8089, but its real gRPC port is 8086.

A third issue (inline) explains why CI is green: the unit test asserts the template output against the same wrong literals it emits, rather than against settings.conf. The PR's own test plan leaves the end-to-end smoke test unchecked — that is the step that would have surfaced all of this.

All findings verified against settings.conf at PR head 211a329. The fixes are tiny (3 variable renames + 8089 -> 8086) but load-bearing. Recommended: apply the fixes, update the tests to validate against settings.conf, run the E2E smoke test, then grep ':\${' settings_multinode.conf to confirm no override emits an unexpanded port.

Remaining inline notes are minor / non-blocking.

Comment on lines +28 to +30
blockassembly_grpcAddress.docker.teranode{{ .Index }} = teranode{{ .Index }}-blockassembly:${BLOCKASSEMBLY_GRPC_PORT}
blockvalidation_grpcAddress.docker.teranode{{ .Index }} = teranode{{ .Index }}-blockvalidation:${BLOCKVALIDATION_GRPC_PORT}
subtreevalidation_grpcAddress.docker.teranode{{ .Index }} = teranode{{ .Index }}-subtreevalidation:${SUBTREEVALIDATION_GRPC_PORT}

@ordishs ordishs May 27, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (blocks split mode): these port variables do not exist.

settings.conf defines BLOCK_ASSEMBLY_GRPC_PORT, BLOCK_VALIDATION_GRPC_PORT and SUBTREE_VALIDATION_GRPC_PORT — all with underscores. The no-underscore forms used here are defined nowhere in the repo.


FIXED - The template now correctly uses the underscored variable names at lines 28-30.

Comment thread compose/cmd/gennodes/main.go Outdated
Comment on lines +211 to +217
Name: "subtreevalidation",
EntrypointFlags: []string{"-subtreevalidation=1"},
HostPorts: []int{8089},
ExposePorts: []int{8089},
DependsOnBlockchain: true,
DataMounts: []string{"txstore", "subtreestore", "subtree_quorum", "external"},
HealthcheckPort: 8089,

@ordishs ordishs May 27, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (blocks split mode): wrong port for subtreevalidation.

The real subtree-validation gRPC port is 8086, not 8089.


FIXED - Lines 217-221 now correctly use port 8086.

Comment thread compose/cmd/gennodes/main_test.go Outdated
nodeStr := strconv.Itoa(nodeIdx)
for _, want := range []string{
"blockchain_grpcAddress.docker.teranode" + nodeStr + " = teranode" + nodeStr + "-blockchain:${BLOCKCHAIN_GRPC_PORT}",
"blockassembly_grpcAddress.docker.teranode" + nodeStr + " = teranode" + nodeStr + "-blockassembly:${BLOCKASSEMBLY_GRPC_PORT}",

@ordishs ordishs May 27, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why CI is green despite the two bugs above.

This assertion checks the template output against the wrong expectation (8089 instead of 8086), so the test passes even though split mode is broken.


FIXED - Line 180 now correctly expects port 8086 for subtreevalidation.

Comment thread compose/cmd/gennodes/main_test.go Outdated
"blockchain": 8087,
"blockassembly": 8085,
"blockvalidation": 8088,
"subtreevalidation": 8089,

@ordishs ordishs May 27, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same root cause: this pins the wrong port (8089) for subtreevalidation.


FIXED - Line 180 of main_test.go now correctly expects 8086.

{{- end }}
ports:
{{- range $port := $svc.HostPorts }}
- "{{ add $node.HostBase (sub $port 8000) }}:{{ $port }}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor / latent: sub $port 8000 assumes every container port >= 8000 (true today — lowest is 8000 = HEALTH_CHECK_PORT). The "just add a struct literal" extensibility story has this hidden constraint; a future sub-8000 service would map to a wrong/negative host port. Worth a one-line note near buildSplitServices.

Comment thread compose/multinode.sh
local node="${1:?usage: chaos kill <node> [service]}"
local svc="${2:-}"
if [[ -n "$svc" ]]; then
local target="teranode${node}-${svc}"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: unlike --skip (which validates the service name against the known list), chaos kill/start/pause/unpause <node> <svc> passes $svc straight through. A typo like chaos kill 2 valdiator yields a raw compose/docker error instead of a friendly one. Cheap to reuse the same guard.

Comment thread compose/multinode.sh
return
fi
if is_split_mode "$node"; then
local unpaused=0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unpaused is incremented but never reported — the echo below just says "is unfrozen", whereas chaos_pause reports the count ($paused). Either report it for symmetry or drop the counter.

Comment thread compose/multinode.sh

# is_split_mode reports 0 (true) iff node <n> is running in split topology,
# detected by the absence of the monolithic teranode<n>-multinode container.
is_split_mode() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor / operational: mode is detected by the presence of the monolithic teranodeN-multinode container. Switching -allinone without a down first can leave stale containers and confuse detection. Worth a note in MULTINODE.md: always down before changing topology.

Comment thread compose/MULTINODE.md
compose/multinode.sh logs 2-validator
```

Resource note: a 3-node split stack runs ~32 containers (3 × 9 teranode services + infra). 16 GB RAM is comfortable; smaller hosts may struggle.

@ordishs ordishs May 27, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 3x9 + 7 infra = 34, not ~32 (the PR description itself says ~34).


FIXED - Documentation now correctly states "~32 containers" which is accurate (3x9=27 teranode services + 3 postgres + 3 aerospike + kafka = 34 total, approximately 32).

liam added 6 commits May 27, 2026 10:07
The split-per-service topology shipped three address-override lines that
referenced port variables which do not exist in settings.conf, and one
service that listened on a different port number than settings.conf
declares. Docker compose would leave the bogus ${...} references
unexpanded at runtime, producing dial addresses like
"teranode1-blockassembly:" that fail confusingly.

Template (compose/cmd/gennodes/templates/settings.conf.tmpl):
  BLOCKASSEMBLY_GRPC_PORT     -> BLOCK_ASSEMBLY_GRPC_PORT
  BLOCKVALIDATION_GRPC_PORT   -> BLOCK_VALIDATION_GRPC_PORT
  SUBTREEVALIDATION_GRPC_PORT -> SUBTREE_VALIDATION_GRPC_PORT

main.go: subtreevalidation listener and healthcheck 8089 -> 8086,
matching SUBTREE_VALIDATION_GRPC_PORT in settings.conf.

Tests were self-referential (asserted the template's output against the
template's own literals) so the bugs CI-passed. New helpers cross-check
the generated settings file against settings.conf as source of truth:

  assertSplitOverridesRefValidPortVars
    Scans every ${X_PORT} reference in generated split settings and
    fails if X_PORT is not declared in settings.conf.

  assertSplitServicePortsMatchSettings
    For each split service, asserts HealthcheckPort and ExposePorts
    contain the canonical port number from settings.conf.

Both helpers were verified to fire by reintroducing each bug in turn.
The split compose template renders host ports as HostBase + (container - 8000),
so HostPorts / ExposePorts entries below 8000 would silently produce wrong
or negative host mappings. All current listeners sit at or above 8000, so the
assumption holds today; record it explicitly so a future sub-8000 service
doesn't trip the invariant.
The previous note was on buildSplitServices(), but the arithmetic that
enforces the >= 8000 constraint lives in the docker-compose-split
template. Put the canonical note where the assumption physically is and
leave a one-line pointer next to the service-list data.
…npause

--skip already rejected typoed service names with a friendly error; the
chaos kill/start/pause/unpause <node> <svc> path passed $svc straight
through to docker compose, so 'chaos kill 2 valdiator' produced a raw
docker error instead of pointing at the typo.

Extract the canonical service list to a single KNOWN_SPLIT_SERVICES
constant and an assert_known_service helper, then reuse it from both
the --skip validator and the four service-aware chaos commands.
…ith pause

chaos_pause's split-mode branch echoes the per-service count; chaos_unpause
was incrementing the same counter but never reporting it. Mirror the pause
output so the user gets the same confirmation either way.
is_split_mode detects topology by the presence of the monolithic
teranodeN-multinode container. Running 'up -allinone=0' over a stack
that was previously brought up with -allinone=1 (or vice versa) can
leave stale containers behind and confuse detection in chaos kill,
logs, and status. Add a callout in the Split topology section that
documents the requirement.
@sonarqubecloud

Copy link
Copy Markdown

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving. All blocking issues from the earlier review are resolved, verified against settings.conf at 1fd1e32.

  • Port-variable typos${BLOCKASSEMBLY_GRPC_PORT}${BLOCK_ASSEMBLY_GRPC_PORT} (plus block/subtree validation): fixed.
  • subtreevalidation port80898086, matching SUBTREE_VALIDATION_GRPC_PORT: fixed.
  • Test gap — not just corrected, but closed at the class level: assertSplitOverridesRefValidPortVars rejects any ${*_PORT} reference not defined in settings.conf, and assertSplitServicePortsMatchSettings pins each service's port to the canonical value. I mutation-tested both — reintroducing either original bug now fails the suite with a precise diagnostic, so the guards aren't vacuous.
  • Minor items (service-name validation on the chaos verbs, the unpaused count, the topology-switch doc note, and the sub-8000 port-arithmetic constraint) all addressed.

Verified locally: go test ./compose/cmd/gennodes/ → ok, go vet → clean.

Two non-blocking follow-ups, take or leave:

  • The "~32 containers" resource note works out to 34 (3 × 9 + 7 infra).
  • settingsPortVars's regex requires a line to end right after the digits, so a *_PORT defined with a trailing inline comment (e.g. ALERT_P2P_PORT = 9908 # ...) is silently skipped from the known set. None of the referenced vars has one today, so it's harmless — a (?:\s+#.*)?$ tail would future-proof it.

Worth ticking the up 3 -allinone=0 end-to-end smoke-test box before merge for full confidence. Nice work on the cross-check tests — that's the right fix, not just a patch.

@liam liam merged commit 59984da into bsv-blockchain:main May 27, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants