fix(blockassembly): periodic reconcile watchdog to recover from silent subscription drift#874
Conversation
…t subscription drift (refs bsv-blockchain#872) ## Problem Block assembly's listener goroutine sits idle indefinitely after a reorg coincides with a blockchain-client gRPC EOF reconnect. Observed on teratestnet: BA stuck at one height for 15+ hours while the blockchain advanced 20+ blocks, only exposed by `WaitForBlockAssemblyReady` from legacy/blockvalidation. Restarting the BA container clears the stall because Start() runs a one-shot triggerReconcile(). The root cause sits in services/blockchain/Client.go subscription lifecycle (likely a per-service ctx cancellation killing the subscriber goroutine permanently — multiple Client instances per container, only some share the long-lived daemon ctx). That investigation is out of scope here. ## Fix Defense-in-depth: add a ticker in `startChannelListeners` that calls `triggerReconcile()` on a bounded cadence (default 30s). Each tick is idempotent — if BA's tip matches the chain tip, processNewBlockAnnouncement returns early with no work. If they differ (because notifications were missed), the existing reconcile path runs and BA catches up. Setting `blockassembly_periodicReconcileInterval`: - Default 30s — recovers a silent subscription within 30s instead of 15+ hours. - 0 disables the ticker entirely (escape hatch for tests / dev). New Prometheus counter `teranode_blockassembly_periodic_reconcile_total` so operators can observe ticker firings (sustained high rate paired with missing notification-driven reconciles signals subscription drift). ## Out of scope - Root-cause fix in services/blockchain/Client.go subscription lifecycle. Needs instrumentation first (logging + counter on the line 1311 early-return) to identify which side fires (ctx.Err vs c.running) before changing semantics. Worth a follow-up issue. - gRPC keepalive is already enabled and didn't catch this. Either the bug is hypothesis 1 (per-service ctx cancellation) or the half-open stream isn't actually what happened. ## Test plan - New TestBlockAssembler_PeriodicReconcile_FiresWhenIntervalSet: 50ms interval, asserts ≥2 ticker firings in a 220ms window past the startup baseline. - New TestBlockAssembler_PeriodicReconcile_DisabledWhenIntervalZero: interval=0, asserts zero additional GetBestBlockHeader calls beyond startup. - New TestBlockAssembler_PeriodicReconcile_NotificationStillProcessed: 1h interval (effectively disabled within the test window), notification arrives on blockchainSubscriptionCh, asserts the notification path still drives a reconcile. - `go test -race -count=1 -tags testtxmetacache ./services/blockassembly/` → 378 passed (no regressions). - `go test -race -count=1 ./settings/` → 131 passed. - `go vet ./services/blockassembly/ ./settings/` clean. - `go build ./...` clean.
|
🤖 Claude Code Review Status: Complete SummaryThis PR implements a defense-in-depth watchdog ticker to prevent block assembly from stalling indefinitely when blockchain notifications go silent. The implementation is sound and well-tested. Reviewed:
Findings: No issues found. The change follows the repository's careful engineering standards (AGENTS.md):
The periodic reconcile is idempotent and negligible overhead when healthy (30s default hits |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-15 12:06 UTC |
…tests Address claude[bot] feedback on PR bsv-blockchain#874: the three new tests created BlockAssembler instances but didn't explicitly tear down the listener goroutine. BlockAssembler doesn't expose a Stop method (Stop lives on the BlockAssembly server wrapper), so ctx cancellation is the supported shutdown signal — see BlockAssembler.go:317. Add a shutdownListenerGoroutine helper that cancels the test ctx in t.Cleanup and gives the listener a 50ms grace window to exit. Pass the cancellable ctx to utxostoresql.New, NewBlockAssembler, and ba.Start so the entire chain shuts down deterministically.
|
| // it's created by the blockchain client's Subscribe method | ||
| return | ||
|
|
||
| case <-periodicReconcileC: |
There was a problem hiding this comment.
Shouldn't we also restart the blockchain connection, if it's not working?
|
Closing in favor of root-cause fix in blockchain service. Log evidence from teratestnet ( A per-BA watchdog is defense-in-depth but doesn't address the systemic problem — every other subscriber (alerts, persister, etc.) is exposed to the same fault. Real fix is per-subscriber buffered channel + per-subscriber goroutine + bounded buffer + per-Send timeout in the blockchain service. Opening a separate PR for that. |



Closes #872.
Summary
Block assembly's listener goroutine sits idle indefinitely after a reorg coincides with a blockchain-client gRPC
EOFreconnect. Observed on teratestnet: BA stuck at one height for 15+ hours while the blockchain advanced 20+ blocks, only exposed byWaitForBlockAssemblyReadyfrom legacy/blockvalidation. Restarting the BA container clears the stall becauseStart()runs a one-shottriggerReconcile().The root cause sits in
services/blockchain/Client.gosubscription lifecycle (likely a per-service ctx cancellation killing the subscriber goroutine permanently — multipleClientinstances per container, only some share the long-lived daemon ctx). That investigation is out of scope here.Fix
Defense-in-depth: add a ticker in
startChannelListenersthat callstriggerReconcile()on a bounded cadence (default 30s). Each tick is idempotent — if BA's tip matches the chain tip,processNewBlockAnnouncementreturns early with no work. If they differ (because notifications were missed), the existing reconcile path runs and BA catches up.Setting
blockassembly_periodicReconcileInterval:New Prometheus counter
teranode_blockassembly_periodic_reconcile_totalso operators can observe ticker firings (sustained high rate paired with missing notification-driven reconciles signals subscription drift).Out of scope (follow-ups)
services/blockchain/Client.gosubscription lifecycle. The early-return atClient.go:1311(if !c.running.Load() || ctx.Err() != nil { return }) exits the subscriber goroutine permanently if either condition fires. Investigation needs instrumentation first (structured log + counter on the early-return) to identify which side fires before changing semantics.util/grpc_helper.go:154-158withKeepaliveTime=30s, KeepaliveTimeout=20s, PermitWithoutStream=true. The bug fired despite this — so either it's hypothesis 1 from blockassembly: listener stops processing block notifications after reorg + blockchain-client EOF #872 (per-service ctx cancellation, not half-open stream) or the keepalive isn't catching it for some other reason.Test plan
TestBlockAssembler_PeriodicReconcile_FiresWhenIntervalSet— 50ms interval, asserts ≥2 ticker firings in 220ms past startup baseline.TestBlockAssembler_PeriodicReconcile_DisabledWhenIntervalZero— interval=0, asserts zero additionalGetBestBlockHeadercalls beyond startup.TestBlockAssembler_PeriodicReconcile_NotificationStillProcessed— 1h interval (effectively disabled), notification arrives onblockchainSubscriptionCh, asserts notification path still drives reconcile.go test -race -count=1 -tags testtxmetacache ./services/blockassembly/→ 378 passed (no regressions).go test -race -count=1 ./settings/→ 131 passed.go vet ./services/blockassembly/ ./settings/clean.go build ./...clean.golangci-lint run ./services/blockassembly/... ./settings/clean (6 pre-existing prealloc warnings in unrelated files, not from this change).