Skip to content

fix(blockchain): root-cause fix for block-assembly stall after silent subscription drift (refs #872)#878

Merged
oskarszoon merged 6 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/872-subscription-fanout
May 15, 2026
Merged

fix(blockchain): root-cause fix for block-assembly stall after silent subscription drift (refs #872)#878
oskarszoon merged 6 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/872-subscription-fanout

Conversation

@oskarszoon

Copy link
Copy Markdown
Contributor

Refs #872. Supersedes the closed band-aid PR #874.

Summary

Two structural fixes for the 15.5-hour block-assembly stall observed on teratestnet 2026-05-14. Both layers of the bug class are addressed:

  • Server-side fan-out in services/blockchain/Server.go startSubscriptions is now non-blocking per-subscriber. One slow / flow-control-stalled subscriber can no longer block delivery to the others.
  • Client-side stream-progress watchdog in services/blockchain/Client.go SubscribeToServer catches half-zombie streams where stream.Recv() parks forever without returning an error.

Root cause

startSubscriptions called sub.subscription.Send synchronously inside the broadcast loop (Server.go around line 692). One slow consumer parked the loop, starving all other subscribers of notifications and heartbeats. The client-side staleness check at Client.go:1316-1330 only fires when stream.Recv() returns an error — a parked Send produces no error on the client, so the check is unreachable.

Combined with a 1m36s SubtreeProcessor.reorgBlocks window (during which BA's listener wasn't draining blockchainSubscriptionCh), the resulting HTTP/2 receive-window starvation parked the stream indefinitely. Server Send() returned nil every time for the queued frames; client Recv() never returned; nothing on either side produced an error or a keepalive trip.

See issue #872 for the full timeline and the in-depth analysis docs under reviews/issue-872-rethink-*.md.

Fix 1 — Server: per-subscriber buffered fan-out

services/blockchain/Server.go:

  • Each subscriber gets a bounded pending channel (cap 64).
  • The broadcast loop does a non-blocking send into that channel. Subscribers whose buffer is full are evicted immediately via deadSubscriptions (Subscriber X pending buffer full, marking dead log).
  • One drain goroutine per subscriber owns sub.subscription.Send. This preserves the gRPC no-concurrent-Send invariant on the stream while isolating slow consumers — a stalled Send parks only that subscriber's goroutine, not the broadcast loop.
  • sendInitialNotification now enqueues to sub.pending instead of calling Send directly, so a slow new-subscriber's initial send can't block other new subscriptions.
  • deadSubscriptions cleanup guards against double-close on pending (broadcast eviction + drain-goroutine Send-error can both mark the same sub dead).

Fix 2 — Client: stream-progress watchdog

services/blockchain/Client.go SubscribeToServer:

  • Per-stream context.WithCancel plus a watchdog goroutine.
  • The watchdog ticks at HeartbeatInterval/2 and checks time.Since(lastRecvAt). If no Recv() returns within 2 * HeartbeatInterval (defaults: 10s/2 tick, 20s threshold), it calls cancelStream(). Recv() then returns context.Canceled and the existing reconnect path kicks in.
  • lastRecvAt is updated on every successful Recv() return (atomic.Int64 of UnixNano).
  • shutdown() helper cancels the stream and waits for the watchdog goroutine to exit. Called on all exit paths to prevent goroutine leaks.

What this does NOT include (separate follow-ups)

  • Tip-lag Prometheus gauge / alert (block_assembly_tip_lag_blocks). Penetration-tester analysis flagged this as the highest-leverage detection change; covered by a follow-up issue.
  • Root-cause investigation of the 18-EOF burst that originally triggered the cascade. Separate investigation; the fixes here address the bug class regardless of what triggers it.
  • Per-subscriber Send() deadline server-side (faster than buffer-overflow eviction). Considered but not included — the client-side watchdog already kills the zombie stream in ~20s, which propagates back to the server as a Send error. Could be added as belt-and-suspenders in a follow-up.

Test plan

  • TestStartSubscriptions_SlowSubscriberDoesNotBlockFastOnes — slow subscriber's Send blocks indefinitely; fast subscriber still receives all N notifications within deadline.
  • TestStartSubscriptions_FullBufferMarksSubscriberDead — stuck subscriber's pending fills; subscriber is evicted from the map.
  • TestStartSubscriptions_PerSubscriberOrderPreserved — per-subscriber FIFO preserved.
  • TestSubscribeToServer_WatchdogClosesZombieStream — fake server never sends; client cancels stream context within 2 * heartbeat + tick.
  • TestSubscribeToServer_WatchdogDoesNotFireWhenStreamProgresses — fake server sends heartbeats; watchdog does not fire.
  • go test -race -count=1 -tags testtxmetacache ./services/blockchain/ → 621 passed (no existing tests needed updating).
  • go vet ./services/blockchain/ clean.

…s (refs bsv-blockchain#872)

Previously startSubscriptions called sub.subscription.Send synchronously
inside the broadcast loop. One slow or flow-control-stalled subscriber
(e.g. blockassembly during a long reorgBlocks) would park the loop,
starving all other subscribers of notifications and heartbeats. That
left the client-side staleness check permanently unreachable (it only
fires when stream.Recv() returns an error; a parked Send never produces
one), causing the 15.5-hour BA stall on teratestnet.

Fix: each subscriber gets a bounded pending channel (cap 64). The
broadcast loop does a non-blocking send into that channel; subscribers
whose buffer is full are evicted immediately. One drain goroutine per
subscriber calls sub.subscription.Send, preserving the gRPC
no-concurrent-Send invariant while isolating slow consumers.

sendInitialNotification is updated to enqueue via pending (non-
blocking) rather than calling Send directly, keeping the
newSubscriptions case non-blocking.

Key safety points:
- Double-close guard in deadSubscriptions cleanup: the map-exists
  check ensures pending is closed exactly once, whether the sub is
  evicted via buffer-full or via Send error from the drain goroutine.
- Subscribe handler now constructs the subscriber once and reuses the
  same value for both newSubscriptions and deadSubscriptions pushes, so
  the struct keys match in the map.
- Drain goroutines exit cleanly on AppCtx cancel, sub.done close, or
  Send error.

Tests: three new tests in subscription_fanout_test.go pin the contract:
- SlowSubscriberDoesNotBlockFastOnes: regression test for bsv-blockchain#872
- FullBufferMarksSubscriberDead: eviction path
- PerSubscriberOrderPreserved: FIFO delivery guarantee
…erver (refs bsv-blockchain#872)

The heartbeat staleness check in SubscribeToServer only fires when
stream.Recv() returns an error. A half-zombie stream where Send() on the
server side is flow-control-stalled never produces a Recv error on the
client — Recv blocks indefinitely, the staleness check is unreachable,
and the subscription stays dark for hours.

Fix: each stream iteration creates a per-stream context
(context.WithCancel). A zombie watchdog goroutine ticks at
HeartbeatInterval/2 and checks time-since-last-Recv-return. If no Recv
returns within 2×HeartbeatInterval (zombieTimeout), the watchdog calls
cancelStream(), which causes Recv to return context.Canceled, and the
existing reconnect path takes over.

lastRecvAt is initialised to now() so a fresh stream is given the full
zombieTimeout grace period before the watchdog fires.

All exit paths from the inner recv loop call shutdown() (cancelStream +
<-watchdogDone) before re-entering the outer reconnect loop or returning,
preventing watchdog goroutine leaks.

Default zombieTimeout: 2×10s = 20s (configurable via HeartbeatInterval).

Tests in client_zombie_watchdog_test.go:
- WatchdogClosesZombieStream: zombie server that never sends; asserts
  the server's Subscribe is called at least twice (initial + reconnect)
  within a tight deadline, proving the watchdog fired.
- WatchdogDoesNotFireWhenStreamProgresses: healthy server sending Block
  notifications at HeartbeatInterval/2; asserts local subscribers keep
  receiving notifications across 5×zombieTimeout with no interruption.
@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Current Review:

This PR implements a well-structured fix for the block-assembly stall issue (#872). The changes introduce per-subscriber buffered fan-out on the server side and a client-side stream watchdog to detect zombie connections.

Observations:

  1. Architecture: The dual-layer fix (server non-blocking fan-out + client watchdog) addresses both symptoms and root cause effectively.

  2. Server-side bounded goroutine leak: In runSubscriberDrain (Server.go:793-838), when sendDeadline fires, a helper goroutine spawned at line 810 continues running until the blocked Send eventually returns. This is acknowledged in comments as "bounded to one per stuck stream" and is an acceptable tradeoff given gRPC's lack of context-aware Send. The subscriber is evicted before the goroutine completes, which correctly prioritizes system health over cleanup perfection.

  3. Test coverage: The new tests comprehensively verify the three key contracts (head-of-line isolation, buffer-overflow eviction, per-subscriber ordering). The zombie watchdog tests validate both positive (fires on stalled stream) and negative (doesn't fire on healthy stream) cases.

  4. Metrics: The three new Prometheus counters (subscriber_pending_full_total, subscriber_send_errors_total, watchdog_fires_total) provide appropriate observability for the new failure modes.

  5. Documentation accuracy: Inline comments accurately describe the implementation behavior, including precise timing calculations for the watchdog (quarter-tick interval, 1.25× worst-case detection latency).

History:

  • ✅ Fixed: Documentation mismatch in watchdog comment (now correctly states "quarter of the timeout")

@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-878 (6bc0d05)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 142
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.552µ 1.559µ ~ 1.000
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 71.18n 71.10n ~ 0.600
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 71.32n 71.36n ~ 0.600
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 70.99n 71.25n ~ 0.600
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 31.66n 32.38n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 52.97n 53.61n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 127.2n 131.5n ~ 0.200
MiningCandidate_Stringify_Short-4 223.5n 228.0n ~ 0.100
MiningCandidate_Stringify_Long-4 1.603µ 1.637µ ~ 0.100
MiningSolution_Stringify-4 834.2n 845.0n ~ 0.100
BlockInfo_MarshalJSON-4 1.704µ 1.720µ ~ 0.100
NewFromBytes-4 123.2n 143.2n ~ 0.100
Mine_EasyDifficulty-4 61.12µ 60.75µ ~ 1.000
Mine_WithAddress-4 6.682µ 6.693µ ~ 0.400
BlockAssembler_AddTx-4 0.02982n 0.02990n ~ 1.000
AddNode-4 10.76 11.04 ~ 0.400
AddNodeWithMap-4 11.13 11.44 ~ 0.200
DirectSubtreeAdd/4_per_subtree-4 58.29n 57.00n ~ 0.700
DirectSubtreeAdd/64_per_subtree-4 29.57n 29.64n ~ 1.000
DirectSubtreeAdd/256_per_subtree-4 27.64n 27.75n ~ 0.200
DirectSubtreeAdd/1024_per_subtree-4 26.40n 26.38n ~ 1.000
DirectSubtreeAdd/2048_per_subtree-4 26.02n 26.09n ~ 0.400
SubtreeProcessorAdd/4_per_subtree-4 290.8n 290.5n ~ 1.000
SubtreeProcessorAdd/64_per_subtree-4 288.8n 279.9n ~ 0.200
SubtreeProcessorAdd/256_per_subtree-4 281.3n 278.2n ~ 0.200
SubtreeProcessorAdd/1024_per_subtree-4 275.7n 272.3n ~ 0.100
SubtreeProcessorAdd/2048_per_subtree-4 277.3n 275.3n ~ 0.400
SubtreeProcessorRotate/4_per_subtree-4 280.0n 279.1n ~ 1.000
SubtreeProcessorRotate/64_per_subtree-4 275.4n 273.8n ~ 0.400
SubtreeProcessorRotate/256_per_subtree-4 274.1n 275.0n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 273.6n 275.6n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 55.42n 55.17n ~ 1.000
SubtreeNodeAddOnly/64_per_subtree-4 36.10n 36.15n ~ 1.000
SubtreeNodeAddOnly/256_per_subtree-4 35.14n 35.08n ~ 0.800
SubtreeNodeAddOnly/1024_per_subtree-4 34.50n 34.51n ~ 0.700
SubtreeCreationOnly/4_per_subtree-4 112.5n 110.5n ~ 0.300
SubtreeCreationOnly/64_per_subtree-4 352.7n 354.1n ~ 1.000
SubtreeCreationOnly/256_per_subtree-4 1.224µ 1.230µ ~ 0.700
SubtreeCreationOnly/1024_per_subtree-4 3.769µ 3.772µ ~ 0.700
SubtreeCreationOnly/2048_per_subtree-4 6.785µ 6.849µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 274.8n 273.7n ~ 0.700
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 274.6n 272.0n ~ 0.400
ParallelGetAndSetIfNotExists/1k_nodes-4 546.4µ 549.8µ ~ 0.700
ParallelGetAndSetIfNotExists/10k_nodes-4 1.315m 1.334m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 6.704m 6.964m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 13.72m 13.94m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 630.7µ 636.1µ ~ 0.400
SequentialGetAndSetIfNotExists/10k_nodes-4 2.923m 2.952m ~ 0.400
SequentialGetAndSetIfNotExists/50k_nodes-4 11.26m 11.51m ~ 0.400
SequentialGetAndSetIfNotExists/100k_nodes-4 21.61m 22.08m ~ 0.200
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 605.1µ 596.7µ ~ 0.200
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 4.583m 4.568m ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 17.20m 17.30m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 665.6µ 671.2µ ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 6.281m 6.415m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 39.75m 43.77m ~ 0.100
DiskTxMap_SetIfNotExists-4 3.697µ 3.713µ ~ 1.000
DiskTxMap_SetIfNotExists_Parallel-4 3.803µ 3.519µ ~ 0.200
DiskTxMap_ExistenceOnly-4 281.3n 283.4n ~ 0.400
Queue-4 193.6n 193.8n ~ 1.000
AtomicPointer-4 8.108n 8.126n ~ 0.700
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 749.0µ 747.9µ ~ 0.700
ReorgOptimizations/DedupFilterPipeline/New/10K-4 699.4µ 705.1µ ~ 0.200
ReorgOptimizations/AllMarkFalse/Old/10K-4 104.2µ 103.8µ ~ 0.400
ReorgOptimizations/AllMarkFalse/New/10K-4 58.20µ 58.04µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/10K-4 52.26µ 54.73µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/10K-4 12.03µ 11.76µ ~ 0.100
ReorgOptimizations/NodeFlags/Old/10K-4 4.621µ 4.701µ ~ 0.700
ReorgOptimizations/NodeFlags/New/10K-4 1.544µ 1.595µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 9.154m 8.810m ~ 0.200
ReorgOptimizations/DedupFilterPipeline/New/100K-4 8.463m 8.583m ~ 0.400
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.074m 1.077m ~ 0.700
ReorgOptimizations/AllMarkFalse/New/100K-4 730.0µ 732.1µ ~ 0.400
ReorgOptimizations/HashSlicePool/Old/100K-4 604.3µ 553.5µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 315.7µ 328.0µ ~ 0.100
ReorgOptimizations/NodeFlags/Old/100K-4 45.76µ 47.56µ ~ 0.400
ReorgOptimizations/NodeFlags/New/100K-4 16.48µ 16.09µ ~ 0.100
TxMapSetIfNotExists-4 51.33n 51.86n ~ 0.100
TxMapSetIfNotExistsDuplicate-4 43.45n 43.55n ~ 0.700
ChannelSendReceive-4 646.8n 676.0n ~ 0.100
CalcBlockWork-4 459.6n 465.1n ~ 0.100
CalculateWork-4 618.1n 614.8n ~ 0.400
BuildBlockLocatorString_Helpers/Size_10-4 1.295µ 1.289µ ~ 0.700
BuildBlockLocatorString_Helpers/Size_100-4 15.19µ 12.39µ ~ 0.400
BuildBlockLocatorString_Helpers/Size_1000-4 123.5µ 127.6µ ~ 0.700
CatchupWithHeaderCache-4 104.3m 104.4m ~ 1.000
_prepareTxsPerLevel-4 419.4m 416.2m ~ 1.000
_prepareTxsPerLevelOrdered-4 3.749m 3.803m ~ 0.400
_prepareTxsPerLevel_Comparison/Original-4 426.4m 426.7m ~ 1.000
_prepareTxsPerLevel_Comparison/Optimized-4 3.768m 3.779m ~ 1.000
SubtreeSizes/10k_tx_4_per_subtree-4 1.350m 1.363m ~ 1.000
SubtreeSizes/10k_tx_16_per_subtree-4 326.3µ 322.5µ ~ 0.100
SubtreeSizes/10k_tx_64_per_subtree-4 76.72µ 77.18µ ~ 0.100
SubtreeSizes/10k_tx_256_per_subtree-4 19.11µ 18.91µ ~ 0.400
SubtreeSizes/10k_tx_512_per_subtree-4 9.450µ 9.458µ ~ 0.700
SubtreeSizes/10k_tx_1024_per_subtree-4 4.694µ 4.703µ ~ 1.000
SubtreeSizes/10k_tx_2k_per_subtree-4 2.354µ 2.356µ ~ 1.000
BlockSizeScaling/10k_tx_64_per_subtree-4 74.81µ 74.94µ ~ 0.700
BlockSizeScaling/10k_tx_256_per_subtree-4 18.77µ 18.78µ ~ 1.000
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.728µ 4.699µ ~ 1.000
BlockSizeScaling/50k_tx_64_per_subtree-4 394.0µ 395.7µ ~ 0.700
BlockSizeScaling/50k_tx_256_per_subtree-4 95.42µ 93.44µ ~ 0.100
BlockSizeScaling/50k_tx_1024_per_subtree-4 23.33µ 23.16µ ~ 0.700
SubtreeAllocations/small_subtrees_exists_check-4 157.2µ 156.2µ ~ 1.000
SubtreeAllocations/small_subtrees_data_fetch-4 165.2µ 163.4µ ~ 0.400
SubtreeAllocations/small_subtrees_full_validation-4 321.8µ 321.8µ ~ 1.000
SubtreeAllocations/medium_subtrees_exists_check-4 9.251µ 9.171µ ~ 0.400
SubtreeAllocations/medium_subtrees_data_fetch-4 9.792µ 9.651µ ~ 0.100
SubtreeAllocations/medium_subtrees_full_validation-4 18.95µ 19.08µ ~ 0.700
SubtreeAllocations/large_subtrees_exists_check-4 2.187µ 2.186µ ~ 1.000
SubtreeAllocations/large_subtrees_data_fetch-4 2.392µ 2.353µ ~ 0.100
SubtreeAllocations/large_subtrees_full_validation-4 4.718µ 4.750µ ~ 0.200
_BufferPoolAllocation/16KB-4 3.292µ 3.338µ ~ 0.700
_BufferPoolAllocation/32KB-4 9.062µ 6.748µ ~ 0.100
_BufferPoolAllocation/64KB-4 16.37µ 13.26µ ~ 0.700
_BufferPoolAllocation/128KB-4 27.20µ 30.18µ ~ 0.100
_BufferPoolAllocation/512KB-4 98.86µ 115.67µ ~ 0.100
_BufferPoolConcurrent/32KB-4 17.21µ 16.10µ ~ 0.700
_BufferPoolConcurrent/64KB-4 26.67µ 25.36µ ~ 0.700
_BufferPoolConcurrent/512KB-4 146.1µ 146.8µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/16KB-4 629.8µ 643.6µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 627.3µ 634.6µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 622.6µ 629.3µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 623.2µ 627.4µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/512KB-4 635.7µ 640.9µ ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/16KB-4 37.30m 37.45m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4 37.14m 36.95m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/64KB-4 37.30m 37.41m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/128KB-4 37.24m 37.31m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/512KB-4 37.28m 37.25m ~ 1.000
_PooledVsNonPooled/Pooled-4 234.4n 819.2n ~ 0.100
_PooledVsNonPooled/NonPooled-4 5.822µ 6.491µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 8.015µ 7.340µ ~ 0.200
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.928µ 11.080µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 9.515µ 9.899µ ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 331.3µ 316.8µ ~ 0.200
StoreBlock_Sequential/AboveCSVHeight-4 330.6µ 333.7µ ~ 0.100
GetUtxoHashes-4 257.2n 262.7n ~ 0.700
GetUtxoHashes_ManyOutputs-4 43.22µ 46.54µ ~ 0.100
_NewMetaDataFromBytes-4 251.6n 252.5n ~ 0.700
_Bytes-4 641.3n 636.8n ~ 0.700
_MetaBytes-4 705.3n 566.5n ~ 0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-15 15:17 UTC

…bscriptions cap (refs bsv-blockchain#872)

runSubscriberDrain previously had no bound on how long a single Send
call could block. A truly zombie stream (server parked waiting for
client WINDOW_UPDATE that never arrives) would hold the drain goroutine
indefinitely — one leak per stuck stream. With the per-subscriber
goroutine design this is bounded by subscriber count, but still worth
closing.

Fix: race Send against a 5s timer. On deadline, log warn, increment
the send-errors metric, and push the subscriber to deadSubscriptions.
The Send helper goroutine is a residual leak until Send eventually
returns, but the drain goroutine itself exits immediately, preventing
it from blocking deadSubscriptions cleanup for the rest of the
subscriber lifetime.

Also bump deadSubscriptions channel cap from 10 to 1000. The original
cap made drain-goroutine dead-pushes a bottleneck during a simultaneous
burst of subscriber deaths (e.g. the 18-EOF incident). 1000 matches the
scale of realistic worst-case bursts without unbounded growth.

subscriber_pending_full_total counter added for the buffer-full eviction
path in the broadcast loop.
…argins (refs bsv-blockchain#872)

Watchdog tick rate: zombieTimeout/2 → zombieTimeout/4. Worst-case
detection latency drops from 1.5× to 1.25× zombieTimeout (30s → 25s at
production HeartbeatInterval=10s) at negligible CPU cost.

Prometheus metrics for fan-out health:
  - teranode_blockchain_subscriber_pending_full_total{source}: evictions
    from the broadcast loop due to full pending buffer (backpressure
    indicator — precursor to subscriber loss)
  - teranode_blockchain_subscriber_send_errors_total{source}: evictions
    from the drain goroutine due to Send errors or deadline exceeded
  - teranode_blockchain_watchdog_fires_total{source}: client-side zombie
    watchdog firings (counts half-zombie stream reconnects)

initPrometheusMetrics called from NewClientWithAddress so metrics are
initialised when the client is created without a server (needed in
watchdog tests and standalone client deployments).

Test: bump per-iteration timeout in WatchdogDoesNotFireWhenStreamProgresses
from zombieTimeout+50ms to zombieTimeout+200ms to reduce flakiness on
loaded CI runners.
Comment thread services/blockchain/Client.go Outdated
… complexity (refs bsv-blockchain#872)

Sonar rule go:S3776 flagged TestStartSubscriptions_PerSubscriberOrderPreserved at
complexity 16. Extracting the hash-index-extraction heuristic into extractTestIndex
drops it below the 15 threshold.
…refs bsv-blockchain#872)

Comment on lines 1273-1276 still said "ticks at half the timeout" after
commit fc8f802 changed the tick to zombieTimeout/4. Update to match.
@sonarqubecloud

Copy link
Copy Markdown

@oskarszoon oskarszoon merged commit 3090446 into bsv-blockchain:main May 15, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants