fix(blockchain): root-cause fix for block-assembly stall after silent subscription drift (refs #872)#878
Conversation
…s (refs bsv-blockchain#872) Previously startSubscriptions called sub.subscription.Send synchronously inside the broadcast loop. One slow or flow-control-stalled subscriber (e.g. blockassembly during a long reorgBlocks) would park the loop, starving all other subscribers of notifications and heartbeats. That left the client-side staleness check permanently unreachable (it only fires when stream.Recv() returns an error; a parked Send never produces one), causing the 15.5-hour BA stall on teratestnet. Fix: each subscriber gets a bounded pending channel (cap 64). The broadcast loop does a non-blocking send into that channel; subscribers whose buffer is full are evicted immediately. One drain goroutine per subscriber calls sub.subscription.Send, preserving the gRPC no-concurrent-Send invariant while isolating slow consumers. sendInitialNotification is updated to enqueue via pending (non- blocking) rather than calling Send directly, keeping the newSubscriptions case non-blocking. Key safety points: - Double-close guard in deadSubscriptions cleanup: the map-exists check ensures pending is closed exactly once, whether the sub is evicted via buffer-full or via Send error from the drain goroutine. - Subscribe handler now constructs the subscriber once and reuses the same value for both newSubscriptions and deadSubscriptions pushes, so the struct keys match in the map. - Drain goroutines exit cleanly on AppCtx cancel, sub.done close, or Send error. Tests: three new tests in subscription_fanout_test.go pin the contract: - SlowSubscriberDoesNotBlockFastOnes: regression test for bsv-blockchain#872 - FullBufferMarksSubscriberDead: eviction path - PerSubscriberOrderPreserved: FIFO delivery guarantee
…erver (refs bsv-blockchain#872) The heartbeat staleness check in SubscribeToServer only fires when stream.Recv() returns an error. A half-zombie stream where Send() on the server side is flow-control-stalled never produces a Recv error on the client — Recv blocks indefinitely, the staleness check is unreachable, and the subscription stays dark for hours. Fix: each stream iteration creates a per-stream context (context.WithCancel). A zombie watchdog goroutine ticks at HeartbeatInterval/2 and checks time-since-last-Recv-return. If no Recv returns within 2×HeartbeatInterval (zombieTimeout), the watchdog calls cancelStream(), which causes Recv to return context.Canceled, and the existing reconnect path takes over. lastRecvAt is initialised to now() so a fresh stream is given the full zombieTimeout grace period before the watchdog fires. All exit paths from the inner recv loop call shutdown() (cancelStream + <-watchdogDone) before re-entering the outer reconnect loop or returning, preventing watchdog goroutine leaks. Default zombieTimeout: 2×10s = 20s (configurable via HeartbeatInterval). Tests in client_zombie_watchdog_test.go: - WatchdogClosesZombieStream: zombie server that never sends; asserts the server's Subscribe is called at least twice (initial + reconnect) within a tight deadline, proving the watchdog fired. - WatchdogDoesNotFireWhenStreamProgresses: healthy server sending Block notifications at HeartbeatInterval/2; asserts local subscribers keep receiving notifications across 5×zombieTimeout with no interruption.
|
🤖 Claude Code Review Status: Complete Current Review: This PR implements a well-structured fix for the block-assembly stall issue (#872). The changes introduce per-subscriber buffered fan-out on the server side and a client-side stream watchdog to detect zombie connections. Observations:
History:
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-15 15:17 UTC |
…bscriptions cap (refs bsv-blockchain#872) runSubscriberDrain previously had no bound on how long a single Send call could block. A truly zombie stream (server parked waiting for client WINDOW_UPDATE that never arrives) would hold the drain goroutine indefinitely — one leak per stuck stream. With the per-subscriber goroutine design this is bounded by subscriber count, but still worth closing. Fix: race Send against a 5s timer. On deadline, log warn, increment the send-errors metric, and push the subscriber to deadSubscriptions. The Send helper goroutine is a residual leak until Send eventually returns, but the drain goroutine itself exits immediately, preventing it from blocking deadSubscriptions cleanup for the rest of the subscriber lifetime. Also bump deadSubscriptions channel cap from 10 to 1000. The original cap made drain-goroutine dead-pushes a bottleneck during a simultaneous burst of subscriber deaths (e.g. the 18-EOF incident). 1000 matches the scale of realistic worst-case bursts without unbounded growth. subscriber_pending_full_total counter added for the buffer-full eviction path in the broadcast loop.
…argins (refs bsv-blockchain#872) Watchdog tick rate: zombieTimeout/2 → zombieTimeout/4. Worst-case detection latency drops from 1.5× to 1.25× zombieTimeout (30s → 25s at production HeartbeatInterval=10s) at negligible CPU cost. Prometheus metrics for fan-out health: - teranode_blockchain_subscriber_pending_full_total{source}: evictions from the broadcast loop due to full pending buffer (backpressure indicator — precursor to subscriber loss) - teranode_blockchain_subscriber_send_errors_total{source}: evictions from the drain goroutine due to Send errors or deadline exceeded - teranode_blockchain_watchdog_fires_total{source}: client-side zombie watchdog firings (counts half-zombie stream reconnects) initPrometheusMetrics called from NewClientWithAddress so metrics are initialised when the client is created without a server (needed in watchdog tests and standalone client deployments). Test: bump per-iteration timeout in WatchdogDoesNotFireWhenStreamProgresses from zombieTimeout+50ms to zombieTimeout+200ms to reduce flakiness on loaded CI runners.
… complexity (refs bsv-blockchain#872) Sonar rule go:S3776 flagged TestStartSubscriptions_PerSubscriberOrderPreserved at complexity 16. Extracting the hash-index-extraction heuristic into extractTestIndex drops it below the 15 threshold.
…refs bsv-blockchain#872) Comment on lines 1273-1276 still said "ticks at half the timeout" after commit fc8f802 changed the tick to zombieTimeout/4. Update to match.
|



Refs #872. Supersedes the closed band-aid PR #874.
Summary
Two structural fixes for the 15.5-hour block-assembly stall observed on teratestnet 2026-05-14. Both layers of the bug class are addressed:
services/blockchain/Server.gostartSubscriptionsis now non-blocking per-subscriber. One slow / flow-control-stalled subscriber can no longer block delivery to the others.services/blockchain/Client.goSubscribeToServercatches half-zombie streams wherestream.Recv()parks forever without returning an error.Root cause
startSubscriptionscalledsub.subscription.Sendsynchronously inside the broadcast loop (Server.go around line 692). One slow consumer parked the loop, starving all other subscribers of notifications and heartbeats. The client-side staleness check atClient.go:1316-1330only fires whenstream.Recv()returns an error — a parked Send produces no error on the client, so the check is unreachable.Combined with a 1m36s
SubtreeProcessor.reorgBlockswindow (during which BA's listener wasn't drainingblockchainSubscriptionCh), the resulting HTTP/2 receive-window starvation parked the stream indefinitely. ServerSend()returnednilevery time for the queued frames; clientRecv()never returned; nothing on either side produced an error or a keepalive trip.See issue #872 for the full timeline and the in-depth analysis docs under
reviews/issue-872-rethink-*.md.Fix 1 — Server: per-subscriber buffered fan-out
services/blockchain/Server.go:pendingchannel (cap 64).deadSubscriptions(Subscriber X pending buffer full, marking deadlog).sub.subscription.Send. This preserves the gRPC no-concurrent-Send invariant on the stream while isolating slow consumers — a stalled Send parks only that subscriber's goroutine, not the broadcast loop.sendInitialNotificationnow enqueues tosub.pendinginstead of calling Send directly, so a slow new-subscriber's initial send can't block other new subscriptions.deadSubscriptionscleanup guards against double-close onpending(broadcast eviction + drain-goroutine Send-error can both mark the same sub dead).Fix 2 — Client: stream-progress watchdog
services/blockchain/Client.goSubscribeToServer:context.WithCancelplus a watchdog goroutine.HeartbeatInterval/2and checkstime.Since(lastRecvAt). If noRecv()returns within2 * HeartbeatInterval(defaults: 10s/2 tick, 20s threshold), it callscancelStream().Recv()then returnscontext.Canceledand the existing reconnect path kicks in.lastRecvAtis updated on every successfulRecv()return (atomic.Int64 of UnixNano).shutdown()helper cancels the stream and waits for the watchdog goroutine to exit. Called on all exit paths to prevent goroutine leaks.What this does NOT include (separate follow-ups)
block_assembly_tip_lag_blocks). Penetration-tester analysis flagged this as the highest-leverage detection change; covered by a follow-up issue.Send()deadline server-side (faster than buffer-overflow eviction). Considered but not included — the client-side watchdog already kills the zombie stream in ~20s, which propagates back to the server as aSenderror. Could be added as belt-and-suspenders in a follow-up.Test plan
TestStartSubscriptions_SlowSubscriberDoesNotBlockFastOnes— slow subscriber's Send blocks indefinitely; fast subscriber still receives all N notifications within deadline.TestStartSubscriptions_FullBufferMarksSubscriberDead— stuck subscriber'spendingfills; subscriber is evicted from the map.TestStartSubscriptions_PerSubscriberOrderPreserved— per-subscriber FIFO preserved.TestSubscribeToServer_WatchdogClosesZombieStream— fake server never sends; client cancels stream context within2 * heartbeat + tick.TestSubscribeToServer_WatchdogDoesNotFireWhenStreamProgresses— fake server sends heartbeats; watchdog does not fire.go test -race -count=1 -tags testtxmetacache ./services/blockchain/→ 621 passed (no existing tests needed updating).go vet ./services/blockchain/clean.