Skip to content

fix(blockassembly): periodic reconcile watchdog to recover from silent subscription drift#874

Closed
oskarszoon wants to merge 3 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/872-blockassembly-notification-recovery
Closed

fix(blockassembly): periodic reconcile watchdog to recover from silent subscription drift#874
oskarszoon wants to merge 3 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/872-blockassembly-notification-recovery

Conversation

@oskarszoon

Copy link
Copy Markdown
Contributor

Closes #872.

Summary

Block assembly's listener goroutine sits idle indefinitely after a reorg coincides with a blockchain-client gRPC EOF reconnect. Observed on teratestnet: BA stuck at one height for 15+ hours while the blockchain advanced 20+ blocks, only exposed by WaitForBlockAssemblyReady from legacy/blockvalidation. Restarting the BA container clears the stall because Start() runs a one-shot triggerReconcile().

The root cause sits in services/blockchain/Client.go subscription lifecycle (likely a per-service ctx cancellation killing the subscriber goroutine permanently — multiple Client instances per container, only some share the long-lived daemon ctx). That investigation is out of scope here.

Fix

Defense-in-depth: add a ticker in startChannelListeners that calls triggerReconcile() on a bounded cadence (default 30s). Each tick is idempotent — if BA's tip matches the chain tip, processNewBlockAnnouncement returns early with no work. If they differ (because notifications were missed), the existing reconcile path runs and BA catches up.

Setting blockassembly_periodicReconcileInterval:

  • 30s (default) — recovers a silent subscription within 30s instead of 15+ hours.
  • 0 — disables the ticker entirely (escape hatch for tests / dev).

New Prometheus counter teranode_blockassembly_periodic_reconcile_total so operators can observe ticker firings (sustained high rate paired with missing notification-driven reconciles signals subscription drift).

Out of scope (follow-ups)

  • Root-cause fix in services/blockchain/Client.go subscription lifecycle. The early-return at Client.go:1311 (if !c.running.Load() || ctx.Err() != nil { return }) exits the subscriber goroutine permanently if either condition fires. Investigation needs instrumentation first (structured log + counter on the early-return) to identify which side fires before changing semantics.
  • gRPC keepalive is already enabled via util/grpc_helper.go:154-158 with KeepaliveTime=30s, KeepaliveTimeout=20s, PermitWithoutStream=true. The bug fired despite this — so either it's hypothesis 1 from blockassembly: listener stops processing block notifications after reorg + blockchain-client EOF #872 (per-service ctx cancellation, not half-open stream) or the keepalive isn't catching it for some other reason.

Test plan

  • TestBlockAssembler_PeriodicReconcile_FiresWhenIntervalSet — 50ms interval, asserts ≥2 ticker firings in 220ms past startup baseline.
  • TestBlockAssembler_PeriodicReconcile_DisabledWhenIntervalZero — interval=0, asserts zero additional GetBestBlockHeader calls beyond startup.
  • TestBlockAssembler_PeriodicReconcile_NotificationStillProcessed — 1h interval (effectively disabled), notification arrives on blockchainSubscriptionCh, asserts notification path still drives reconcile.
  • go test -race -count=1 -tags testtxmetacache ./services/blockassembly/ → 378 passed (no regressions).
  • go test -race -count=1 ./settings/ → 131 passed.
  • go vet ./services/blockassembly/ ./settings/ clean.
  • go build ./... clean.
  • golangci-lint run ./services/blockassembly/... ./settings/ clean (6 pre-existing prealloc warnings in unrelated files, not from this change).

…t subscription drift (refs bsv-blockchain#872)

## Problem

Block assembly's listener goroutine sits idle indefinitely after a reorg
coincides with a blockchain-client gRPC EOF reconnect. Observed on
teratestnet: BA stuck at one height for 15+ hours while the blockchain
advanced 20+ blocks, only exposed by `WaitForBlockAssemblyReady` from
legacy/blockvalidation. Restarting the BA container clears the stall
because Start() runs a one-shot triggerReconcile().

The root cause sits in services/blockchain/Client.go subscription
lifecycle (likely a per-service ctx cancellation killing the subscriber
goroutine permanently — multiple Client instances per container, only
some share the long-lived daemon ctx). That investigation is out of
scope here.

## Fix

Defense-in-depth: add a ticker in `startChannelListeners` that calls
`triggerReconcile()` on a bounded cadence (default 30s). Each tick is
idempotent — if BA's tip matches the chain tip, processNewBlockAnnouncement
returns early with no work. If they differ (because notifications were
missed), the existing reconcile path runs and BA catches up.

Setting `blockassembly_periodicReconcileInterval`:
- Default 30s — recovers a silent subscription within 30s instead of
  15+ hours.
- 0 disables the ticker entirely (escape hatch for tests / dev).

New Prometheus counter `teranode_blockassembly_periodic_reconcile_total`
so operators can observe ticker firings (sustained high rate paired
with missing notification-driven reconciles signals subscription drift).

## Out of scope

- Root-cause fix in services/blockchain/Client.go subscription lifecycle.
  Needs instrumentation first (logging + counter on the
  line 1311 early-return) to identify which side fires (ctx.Err vs
  c.running) before changing semantics. Worth a follow-up issue.
- gRPC keepalive is already enabled and didn't catch this. Either the
  bug is hypothesis 1 (per-service ctx cancellation) or the half-open
  stream isn't actually what happened.

## Test plan

- New TestBlockAssembler_PeriodicReconcile_FiresWhenIntervalSet:
  50ms interval, asserts ≥2 ticker firings in a 220ms window past
  the startup baseline.
- New TestBlockAssembler_PeriodicReconcile_DisabledWhenIntervalZero:
  interval=0, asserts zero additional GetBestBlockHeader calls
  beyond startup.
- New TestBlockAssembler_PeriodicReconcile_NotificationStillProcessed:
  1h interval (effectively disabled within the test window),
  notification arrives on blockchainSubscriptionCh, asserts the
  notification path still drives a reconcile.
- `go test -race -count=1 -tags testtxmetacache ./services/blockassembly/`
  → 378 passed (no regressions).
- `go test -race -count=1 ./settings/` → 131 passed.
- `go vet ./services/blockassembly/ ./settings/` clean.
- `go build ./...` clean.
@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

Summary

This PR implements a defense-in-depth watchdog ticker to prevent block assembly from stalling indefinitely when blockchain notifications go silent. The implementation is sound and well-tested.

Reviewed:

  • Core implementation: periodic reconcile ticker with configurable interval
  • Test coverage: 3 comprehensive tests verifying ticker behavior, disable path, and notification coexistence
  • Metrics: Prometheus counter for observability
  • Documentation: thorough setting description with operational guidance

Findings: No issues found.

The change follows the repository's careful engineering standards (AGENTS.md):

  • Minimal diff focused on the specific issue
  • Test-first approach with regression coverage
  • Comprehensive verification (all tests passing per PR description)
  • Clear tradeoff documentation in settings

The periodic reconcile is idempotent and negligible overhead when healthy (30s default hits GetBestBlockHeader only if already being called during normal operation). This bounds recovery from silent subscription failure to 30 seconds instead of 15+ hours observed in production.

@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-874 (6a00738)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 142
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.731µ 1.665µ ~ 0.700
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 61.65n 61.58n ~ 1.000
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 61.62n 61.77n ~ 0.700
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 61.52n 61.60n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 28.92n 28.84n ~ 1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 50.78n 50.35n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 105.1n 104.6n ~ 0.400
MiningCandidate_Stringify_Short-4 265.0n 261.6n ~ 0.100
MiningCandidate_Stringify_Long-4 1.914µ 1.883µ ~ 0.100
MiningSolution_Stringify-4 992.6n 980.7n ~ 0.100
BlockInfo_MarshalJSON-4 1.762µ 1.761µ ~ 0.400
NewFromBytes-4 122.0n 122.9n ~ 0.100
Mine_EasyDifficulty-4 65.25µ 65.66µ ~ 0.400
Mine_WithAddress-4 6.973µ 7.025µ ~ 0.700
DiskTxMap_SetIfNotExists-4 3.513µ 3.504µ ~ 1.000
DiskTxMap_SetIfNotExists_Parallel-4 3.451µ 3.406µ ~ 1.000
DiskTxMap_ExistenceOnly-4 318.3n 296.5n ~ 0.100
Queue-4 190.7n 187.3n ~ 0.100
AtomicPointer-4 4.795n 5.024n ~ 0.200
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 842.5µ 840.2µ ~ 0.400
ReorgOptimizations/DedupFilterPipeline/New/10K-4 810.5µ 788.5µ ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4 103.8µ 102.7µ ~ 0.100
ReorgOptimizations/AllMarkFalse/New/10K-4 62.69µ 62.34µ ~ 0.400
ReorgOptimizations/HashSlicePool/Old/10K-4 55.52µ 55.16µ ~ 1.000
ReorgOptimizations/HashSlicePool/New/10K-4 11.98µ 11.86µ ~ 0.700
ReorgOptimizations/NodeFlags/Old/10K-4 4.615µ 4.745µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.548µ 1.641µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 9.543m 10.249m ~ 0.700
ReorgOptimizations/DedupFilterPipeline/New/100K-4 9.830m 10.163m ~ 1.000
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.085m 1.082m ~ 1.000
ReorgOptimizations/AllMarkFalse/New/100K-4 688.1µ 685.8µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/100K-4 537.7µ 510.7µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 293.9µ 318.6µ ~ 0.700
ReorgOptimizations/NodeFlags/Old/100K-4 46.75µ 50.34µ ~ 0.100
ReorgOptimizations/NodeFlags/New/100K-4 15.87µ 17.30µ ~ 0.100
TxMapSetIfNotExists-4 51.39n 51.79n ~ 0.700
TxMapSetIfNotExistsDuplicate-4 38.09n 38.05n ~ 1.000
ChannelSendReceive-4 594.4n 605.7n ~ 0.100
DirectSubtreeAdd/4_per_subtree-4 58.07n 57.59n ~ 1.000
DirectSubtreeAdd/64_per_subtree-4 29.24n 29.17n ~ 0.500
DirectSubtreeAdd/256_per_subtree-4 28.02n 28.22n ~ 1.000
DirectSubtreeAdd/1024_per_subtree-4 26.50n 26.53n ~ 0.700
DirectSubtreeAdd/2048_per_subtree-4 26.05n 26.06n ~ 1.000
SubtreeProcessorAdd/4_per_subtree-4 288.3n 282.7n ~ 0.200
SubtreeProcessorAdd/64_per_subtree-4 281.6n 273.8n ~ 0.100
SubtreeProcessorAdd/256_per_subtree-4 280.8n 275.7n ~ 0.100
SubtreeProcessorAdd/1024_per_subtree-4 271.9n 272.0n ~ 1.000
SubtreeProcessorAdd/2048_per_subtree-4 272.1n 268.6n ~ 0.500
SubtreeProcessorRotate/4_per_subtree-4 276.6n 273.9n ~ 0.100
SubtreeProcessorRotate/64_per_subtree-4 274.7n 270.2n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 278.8n 271.3n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 276.2n 270.3n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 55.10n 55.13n ~ 0.700
SubtreeNodeAddOnly/64_per_subtree-4 36.07n 36.21n ~ 0.400
SubtreeNodeAddOnly/256_per_subtree-4 35.15n 35.15n ~ 0.800
SubtreeNodeAddOnly/1024_per_subtree-4 34.52n 34.50n ~ 1.000
SubtreeCreationOnly/4_per_subtree-4 110.5n 111.1n ~ 1.000
SubtreeCreationOnly/64_per_subtree-4 354.0n 351.1n ~ 0.100
SubtreeCreationOnly/256_per_subtree-4 1.233µ 1.226µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 3.774µ 3.800µ ~ 0.700
SubtreeCreationOnly/2048_per_subtree-4 6.736µ 6.731µ ~ 1.000
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 275.6n 269.4n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 274.4n 270.1n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 547.1µ 546.7µ ~ 0.700
ParallelGetAndSetIfNotExists/10k_nodes-4 1.345m 1.340m ~ 0.700
ParallelGetAndSetIfNotExists/50k_nodes-4 6.746m 6.669m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 13.61m 13.27m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 631.7µ 622.7µ ~ 0.400
SequentialGetAndSetIfNotExists/10k_nodes-4 2.935m 2.995m ~ 0.200
SequentialGetAndSetIfNotExists/50k_nodes-4 11.40m 11.20m ~ 0.200
SequentialGetAndSetIfNotExists/100k_nodes-4 21.62m 21.76m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 609.9µ 594.4µ ~ 0.200
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 4.656m 4.627m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 17.12m 17.00m ~ 0.200
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 684.3µ 665.9µ ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 6.400m 6.404m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 40.54m 39.41m ~ 0.100
BlockAssembler_AddTx-4 0.03303n 0.03323n ~ 1.000
AddNode-4 12.52 12.74 ~ 1.000
AddNodeWithMap-4 13.06 13.32 ~ 1.000
CalcBlockWork-4 367.1n 365.7n ~ 0.400
CalculateWork-4 498.6n 487.4n ~ 0.700
BuildBlockLocatorString_Helpers/Size_10-4 1.311µ 1.316µ ~ 1.000
BuildBlockLocatorString_Helpers/Size_100-4 16.11µ 15.48µ ~ 1.000
BuildBlockLocatorString_Helpers/Size_1000-4 122.3µ 124.0µ ~ 0.100
CatchupWithHeaderCache-4 104.4m 104.4m ~ 0.100
_BufferPoolAllocation/16KB-4 3.351µ 3.334µ ~ 1.000
_BufferPoolAllocation/32KB-4 6.514µ 9.054µ ~ 0.100
_BufferPoolAllocation/64KB-4 17.22µ 15.50µ ~ 0.200
_BufferPoolAllocation/128KB-4 25.33µ 25.20µ ~ 1.000
_BufferPoolAllocation/512KB-4 99.53µ 121.71µ ~ 0.100
_BufferPoolConcurrent/32KB-4 16.81µ 16.60µ ~ 1.000
_BufferPoolConcurrent/64KB-4 26.18µ 26.20µ ~ 1.000
_BufferPoolConcurrent/512KB-4 137.3µ 139.9µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/16KB-4 606.5µ 618.9µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/32KB-4 605.8µ 625.3µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 606.9µ 628.1µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 613.4µ 633.2µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 632.8µ 612.0µ ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/16KB-4 35.92m 36.01m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/32KB-4 35.43m 35.66m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/64KB-4 35.38m 35.70m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/128KB-4 35.14m 35.57m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 35.53m 35.25m ~ 0.400
_PooledVsNonPooled/Pooled-4 739.3n 737.4n ~ 0.400
_PooledVsNonPooled/NonPooled-4 6.242µ 6.743µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 6.423µ 6.529µ ~ 0.700
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.058µ 8.960µ ~ 1.000
_MemoryFootprint/Alternative_64KB_32concurrent-4 8.736µ 8.689µ ~ 0.700
_prepareTxsPerLevel-4 407.5m 401.4m ~ 0.700
_prepareTxsPerLevelOrdered-4 3.501m 3.430m ~ 0.100
_prepareTxsPerLevel_Comparison/Original-4 396.8m 403.7m ~ 0.400
_prepareTxsPerLevel_Comparison/Optimized-4 3.744m 3.535m ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.256m 1.267m ~ 0.700
SubtreeSizes/10k_tx_16_per_subtree-4 294.8µ 294.7µ ~ 1.000
SubtreeSizes/10k_tx_64_per_subtree-4 70.52µ 70.53µ ~ 1.000
SubtreeSizes/10k_tx_256_per_subtree-4 17.61µ 17.42µ ~ 0.700
SubtreeSizes/10k_tx_512_per_subtree-4 8.599µ 8.721µ ~ 0.200
SubtreeSizes/10k_tx_1024_per_subtree-4 4.291µ 4.324µ ~ 0.100
SubtreeSizes/10k_tx_2k_per_subtree-4 2.132µ 2.156µ ~ 0.400
BlockSizeScaling/10k_tx_64_per_subtree-4 69.25µ 69.51µ ~ 0.400
BlockSizeScaling/10k_tx_256_per_subtree-4 17.25µ 17.36µ ~ 0.700
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.270µ 4.313µ ~ 0.700
BlockSizeScaling/50k_tx_64_per_subtree-4 371.3µ 367.2µ ~ 0.700
BlockSizeScaling/50k_tx_256_per_subtree-4 87.60µ 86.95µ ~ 0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4 21.15µ 21.46µ ~ 0.200
SubtreeAllocations/small_subtrees_exists_check-4 149.5µ 149.6µ ~ 0.700
SubtreeAllocations/small_subtrees_data_fetch-4 158.8µ 157.4µ ~ 0.400
SubtreeAllocations/small_subtrees_full_validation-4 308.9µ 308.6µ ~ 1.000
SubtreeAllocations/medium_subtrees_exists_check-4 8.861µ 8.839µ ~ 0.700
SubtreeAllocations/medium_subtrees_data_fetch-4 9.246µ 9.320µ ~ 1.000
SubtreeAllocations/medium_subtrees_full_validation-4 17.28µ 17.38µ ~ 1.000
SubtreeAllocations/large_subtrees_exists_check-4 2.088µ 2.077µ ~ 1.000
SubtreeAllocations/large_subtrees_data_fetch-4 2.227µ 2.239µ ~ 1.000
SubtreeAllocations/large_subtrees_full_validation-4 4.363µ 4.331µ ~ 0.600
StoreBlock_Sequential/BelowCSVHeight-4 329.0µ 328.6µ ~ 1.000
StoreBlock_Sequential/AboveCSVHeight-4 329.7µ 321.4µ ~ 1.000
GetUtxoHashes-4 274.8n 274.2n ~ 1.000
GetUtxoHashes_ManyOutputs-4 45.90µ 46.26µ ~ 0.100
_NewMetaDataFromBytes-4 230.6n 229.9n ~ 0.300
_Bytes-4 611.1n 601.0n ~ 0.100
_MetaBytes-4 568.2n 553.2n ~ 0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-15 12:06 UTC

…tests

Address claude[bot] feedback on PR bsv-blockchain#874: the three new tests created
BlockAssembler instances but didn't explicitly tear down the listener
goroutine. BlockAssembler doesn't expose a Stop method (Stop lives on
the BlockAssembly server wrapper), so ctx cancellation is the supported
shutdown signal — see BlockAssembler.go:317.

Add a shutdownListenerGoroutine helper that cancels the test ctx in
t.Cleanup and gives the listener a 50ms grace window to exit. Pass the
cancellable ctx to utxostoresql.New, NewBlockAssembler, and ba.Start
so the entire chain shuts down deterministically.
@sonarqubecloud

Copy link
Copy Markdown

// it's created by the blockchain client's Subscribe method
return

case <-periodicReconcileC:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we also restart the blockchain connection, if it's not working?

@oskarszoon oskarszoon self-assigned this May 15, 2026
@oskarszoon

Copy link
Copy Markdown
Contributor Author

Closing in favor of root-cause fix in blockchain service.

Log evidence from teratestnet (Notifications channel full, skipping heartbeat) points to the real fault: in services/blockchain/Server.go startSubscriptions iterates all subscribers synchronously and calls Send() sequentially under subscribersMu.RLock(). One slow subscriber blocks delivery to all the others, and the heartbeat warnings + filling b.notifications channel (cap 100) are the visible symptoms. BA going silent is then just one subscriber happening to be downstream of the stall.

A per-BA watchdog is defense-in-depth but doesn't address the systemic problem — every other subscriber (alerts, persister, etc.) is exposed to the same fault. Real fix is per-subscriber buffered channel + per-subscriber goroutine + bounded buffer + per-Send timeout in the blockchain service. Opening a separate PR for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

blockassembly: listener stops processing block notifications after reorg + blockchain-client EOF

2 participants