fix(subtree_data): stop cancel cascade + segregate client-gone errors by icellan · Pull Request #947 · bsv-blockchain/teranode

icellan · 2026-05-26T12:17:00Z

Summary

Under catchup load — block-validator pulling many subtrees from a peer in parallel — the asset server logs an avalanche of `io: read/write on closed pipe` warnings. Measured rate on the scale-1 cluster: 4832 `write_failed` vs 82 `on_demand_created_locked` successes per ~1h window across 4 asset pods (~98% failure). Two coupled bugs, both with the same shape.

Bug 1 (consumer): cascading cancellation in both fan-outs

Two places fan out subtree_data fetches under `errgroup.WithContext`:

`services/subtreevalidation/check_block_subtrees.go` (validation path)
`services/blockvalidation/get_blocks.go` (catchup path)

When any one goroutine returns an error, `gCtx` is cancelled and every in-flight HTTP body `Read` returns `ctx.Err()`. Each client then closes its TCP connection. nginx (default `proxy_ignore_client_abort off`) propagates that close upstream. On the peer:

`c.Response().Write` fails → `c.Stream` returns → `defer r.Close()` closes the pipe reader
The producer goroutine's next write into `io.MultiWriter` returns `io.ErrClosedPipe`
`storer.Abort(err)` discards already-paid Aerospike work

One failure → tens of cancelled streams across the pod fleet, every failed attempt wastes the chunked tx-meta fetch it already completed.

Fix: keep the errgroup but detach the heavy subtree_data work from `gCtx`. The existence check still respects the original context (so a pre-cancelled call exits fast); the HTTP fetch + parse + store does not. Sibling failures no longer cancel in-flight peers; each fetch completes naturally or hits its own `http_streaming_timeout`. The peer finishes writing its subtreeData file, which becomes a cache hit on the next retry. Trade-off: batch failure detection waits for in-flight peers to drain instead of bailing early — acceptable because the per-fetch streaming timeout still bounds it, and the alternative is throwing away pipeline work on every batch failure.

Bug 2 (producer): `write_failed` conflates client disconnects with server faults

`services/asset/repository/GetSubtreeData.go` logged every producer-side error at WARN under metric reason `write_failed`. `io.ErrClosedPipe` and context cancellation are caller-side conditions, not server faults — they should not pollute the ops signal that ought to flag genuine write failures.

Fix: detect `errors.IsContextError(err) || errors.Is(err, io.ErrClosedPipe)` and log at debug under a new metric reason `client_gone`. WARN + `write_failed` is now reserved for actual server-side write failures.

Files

`services/subtreevalidation/check_block_subtrees.go` — pass `ctx` (not `gCtx`) to the subtree_data fetch + parser
`services/blockvalidation/get_blocks.go` — detach `ctx` via `context.WithoutCancel` after the existence check in `fetchAndStoreSubtreeData`
`services/asset/repository/GetSubtreeData.go` — classify `client_gone` separately from `write_failed`
`services/asset/repository/metrics.go` — document the new `client_gone` reason
`services/{asset/repository,subtreevalidation,blockvalidation}/*_test.go` — TDD regression tests

Tests

Written TDD-style — confirmed each test FAILS on pre-fix code and PASSES post-fix.

The cascade regression tests use a custom streaming response body that respects `req.Context()` between halves and waits for cancellation to propagate. httpmock's default buffered bodies cannot exercise the race because the full body is in memory by the time the reader runs; we needed an actual gated stream to prove that mid-stream cancellation truncates the body pre-fix and does not post-fix.

`TestCheckBlockSubtrees_SiblingFailureDoesNotCancelInFlight` — subtree A succeeds streaming, sibling B fails fast; A's `FileTypeSubtreeData` file must be in the store afterwards.
`TestFetchSubtreeDataForBlock_SiblingFailureDoesNotCancelInFlight` — same shape, catchup path.
`TestGetSubtreeDataWithReader/client disconnect during stream is classified as client_gone, not write_failed` — producer-side classification.
`TestGetSubtreeDataWithReader/genuine server-side write error still records write_failed` — companion test confirming WARN+`write_failed` is preserved for genuine faults (e.g. missing tx meta).

Out of scope (follow-ups)

Settings-level mitigations (`asset_concurrency_subtree_data_create`, `asset_subtreeDataStreamingChunkSize`, nginx `proxy_ignore_client_abort on`) are operational tuning, separate concerns.
The asset-side `storer.Abort` on `client_gone` still throws away the file in-progress. Decoupling the storer write from the response pipe (so a client disconnect lets the file finish to disk) is a bigger refactor and a separate PR.

Test plan

`go test ./services/asset/repository/...` — pass
`go test ./services/subtreevalidation/...` — pass
`go test ./services/blockvalidation/...` — pass (incl. catchup subpackage)
`go test -race ./services/asset/repository/...` — pass
`go vet` on all three packages — clean
`golangci-lint run --new-from-rev=origin/main` — clean
Verify on scale-1 cluster: `teranode_asset_repository_subtree_data_created_total{result="error",reason="write_failed"}` rate drops to near-zero; `{reason="client_gone"}` absorbs the volume; `io: read/write on closed pipe` WARN lines disappear from the asset logs

Copilot

Pull request overview

This PR addresses subtree_data streaming failures under catchup load by preventing cancellation cascades during parallel subtree fetch/processing, and by improving observability so client/proxy disconnects are not counted as server write faults.

Changes:

Prevent errgroup sibling failures from cancelling in-flight /subtree_data HTTP streams by using the parent ctx (not gCtx) for the streaming fetch + processing.
Classify mid-stream disconnect conditions (context cancellation / io.ErrClosedPipe) as client_gone instead of write_failed, and reduce log severity to debug for those cases.
Add a regression test asserting client_gone increments and write_failed does not for an early reader close.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
services/subtreevalidation/check_block_subtrees.go	Detach subtree_data streaming work from errgroup cancellation to avoid cancel cascades during batch parallelism.
services/asset/repository/GetSubtreeData.go	Split “client disconnect” errors into `client_gone` metric + debug logging; reserve `write_failed` for real server-side faults.
services/asset/repository/metrics.go	Document the new `client_gone` label value for the subtree_data_created metric.
services/asset/repository/GetSubtreeData_test.go	Add regression coverage for `client_gone` classification on early stream close.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Under catchup load — block-validator pulling many subtrees from a peer in parallel — the asset server logs an avalanche of "io: read/write on closed pipe" warnings. Measured at ~98% of all on-demand subtreeData creations in scale-1 (4832 write_failed vs 82 successes per ~1h window across 4 asset pods). Two coupled bugs, both with the same shape: CONSUMER SIDE (cascade-cancel) services/subtreevalidation/check_block_subtrees.go and services/blockvalidation/get_blocks.go both fan out subtree_data fetches under errgroup.WithContext. When any one goroutine errors, gCtx is cancelled and every in-flight HTTP body Read returns ctx.Err(). Each client then closes its TCP connection to the peer's asset-cache. nginx (default proxy_ignore_client_abort off) propagates the close upstream. The peer's c.Response().Write fails, c.Stream unwinds, defer r.Close() closes the response pipe, and the producer goroutine's next write into io.MultiWriter returns io.ErrClosedPipe — at which point storer.Abort discards already-paid Aerospike work. One failure cascades to tens of in-flight streams across the pod fleet. Fix (both files): keep the errgroup but detach the heavy subtree_data work from gCtx. In check_block_subtrees.go this means passing the parent ctx (not gCtx) to DoHTTPRequestBodyReaderWithRetry and processSubtreeDataStream. In get_blocks.go the call paths through fetchAndStoreSubtreeData; once the existence check has passed (still respecting gCtx so a pre-cancelled call exits fast), the function context.WithoutCancel(ctx)s itself so the subsequent HTTP fetch + parse + store run to completion. Sibling failures no longer cancel in-flight peers; each fetch completes naturally or hits its own http_streaming_timeout. The peer finishes writing its subtreeData file, which becomes a cache hit on the next retry. Trade-off: batch failure detection waits for in-flight peers instead of cancelling early — acceptable because the per-fetch streaming timeout still bounds it, and the alternative is throwing away pipeline work on every batch failure. PRODUCER SIDE (error classification) services/asset/repository/GetSubtreeData.go logged every producer error at WARN with metric reason "write_failed", conflating client disconnects with genuine server-side faults. io.ErrClosedPipe and ctx cancellation are caller-side conditions, not server faults; they should not pollute the ops signal that ought to flag genuine write failures. Fix: detect errors.IsContextError(err) || errors.Is(err, io.ErrClosedPipe) and log at debug under a new metric reason "client_gone". WARN + reason "write_failed" is now reserved for actual server-side write failures. TESTS TDD regression tests with a custom streaming response body that respects req.Context() between halves, so cancellation can actually truncate the body mid-stream (httpmock's default buffered bodies cannot exercise the race). Confirmed each test fails on pre-fix code and passes post-fix: - TestCheckBlockSubtrees_SiblingFailureDoesNotCancelInFlight - TestFetchSubtreeDataForBlock_SiblingFailureDoesNotCancelInFlight Plus producer-side classification tests: - TestGetSubtreeDataWithReader/"client disconnect ... is classified as client_gone, not write_failed" - TestGetSubtreeDataWithReader/"genuine server-side write error still records write_failed" No behaviour change for the success path or for genuine fetch/storage failures.

github-actions · 2026-05-26T12:53:13Z

🤖 Claude Code Review

Status: Complete

Current Review:

This PR addresses two coupled bugs causing the "catchup avalanche" under load. The fix is well-designed with strong test coverage. No critical issues found.

Minor observation: In check_block_subtrees.go:341, the code still uses gCtx for DecoupleTracingSpan, while the comment mentions ensuring tracking completes even if parent context is cancelled. The existing DecoupleTracingSpan likely already handles this, but worth verifying that the decoupling intent is preserved despite passing gCtx rather than ctx.

Overall, this is high-quality defensive engineering with thorough TDD-style tests that fail pre-fix and pass post-fix.

github-actions · 2026-05-26T13:07:48Z

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-947 (c964edc)

Summary

Regressions: 0
Improvements: 0
Unchanged: 144
Significance level: p < 0.05

All benchmark results (sec/op)

Benchmark	Baseline	Current	Change	p-value
_NewBlockFromBytes-4	1.742µ	1.741µ	~	1.000
SplitSyncedParentMap_SetIfNotExists/256_buckets-4	61.61n	61.67n	~	0.600
SplitSyncedParentMap_SetIfNotExists/16_buckets-4	61.69n	61.84n	~	0.400
SplitSyncedParentMap_SetIfNotExists/1_bucket-4	61.70n	61.70n	~	0.800
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets...	28.95n	29.70n	~	0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_...	49.86n	51.15n	~	0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa...	107.6n	107.2n	~	0.400
MiningCandidate_Stringify_Short-4	268.2n	271.5n	~	0.100
MiningCandidate_Stringify_Long-4	1.887µ	1.895µ	~	0.700
MiningSolution_Stringify-4	963.1n	967.5n	~	0.100
BlockInfo_MarshalJSON-4	1.786µ	1.788µ	~	0.300
NewFromBytes-4	124.8n	125.3n	~	0.300
AddTxBatchColumnar_Validation-4	2.416µ	2.441µ	~	1.000
OffsetValidationLoop-4	719.1n	719.6n	~	1.000
Mine_EasyDifficulty-4	34.13µ	34.05µ	~	0.700
Mine_WithAddress-4	3.776µ	3.734µ	~	0.100
BlockAssembler_AddTx-4	0.02889n	0.02797n	~	1.000
AddNode-4	12.08	11.47	~	0.100
AddNodeWithMap-4	11.55	12.45	~	0.700
DirectSubtreeAdd/4_per_subtree-4	55.15n	58.36n	~	0.200
DirectSubtreeAdd/64_per_subtree-4	29.28n	29.02n	~	0.700
DirectSubtreeAdd/256_per_subtree-4	27.76n	28.30n	~	0.400
DirectSubtreeAdd/1024_per_subtree-4	26.46n	26.48n	~	0.800
DirectSubtreeAdd/2048_per_subtree-4	26.08n	26.19n	~	0.700
SubtreeProcessorAdd/4_per_subtree-4	296.2n	296.9n	~	1.000
SubtreeProcessorAdd/64_per_subtree-4	292.4n	303.3n	~	0.400
SubtreeProcessorAdd/256_per_subtree-4	286.1n	291.6n	~	0.100
SubtreeProcessorAdd/1024_per_subtree-4	278.8n	285.2n	~	0.100
SubtreeProcessorAdd/2048_per_subtree-4	280.8n	285.0n	~	0.100
SubtreeProcessorRotate/4_per_subtree-4	287.4n	294.3n	~	0.200
SubtreeProcessorRotate/64_per_subtree-4	279.1n	290.1n	~	0.100
SubtreeProcessorRotate/256_per_subtree-4	281.3n	283.6n	~	1.000
SubtreeProcessorRotate/1024_per_subtree-4	281.6n	282.2n	~	1.000
SubtreeNodeAddOnly/4_per_subtree-4	55.31n	54.98n	~	0.700
SubtreeNodeAddOnly/64_per_subtree-4	35.98n	35.98n	~	1.000
SubtreeNodeAddOnly/256_per_subtree-4	35.19n	34.96n	~	0.100
SubtreeNodeAddOnly/1024_per_subtree-4	34.52n	34.33n	~	0.700
SubtreeCreationOnly/4_per_subtree-4	110.2n	110.4n	~	1.000
SubtreeCreationOnly/64_per_subtree-4	352.0n	350.6n	~	0.400
SubtreeCreationOnly/256_per_subtree-4	1.238µ	1.242µ	~	0.400
SubtreeCreationOnly/1024_per_subtree-4	3.874µ	3.797µ	~	0.200
SubtreeCreationOnly/2048_per_subtree-4	6.904µ	6.944µ	~	0.700
SubtreeProcessorOverheadBreakdown/64_per_subtree-4	281.0n	284.9n	~	0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4	280.5n	293.1n	~	0.100
ParallelGetAndSetIfNotExists/1k_nodes-4	2.014m	2.024m	~	0.700
ParallelGetAndSetIfNotExists/10k_nodes-4	5.136m	5.174m	~	0.100
ParallelGetAndSetIfNotExists/50k_nodes-4	7.433m	7.426m	~	0.700
ParallelGetAndSetIfNotExists/100k_nodes-4	10.31m	10.08m	~	0.100
SequentialGetAndSetIfNotExists/1k_nodes-4	1.785m	1.796m	~	1.000
SequentialGetAndSetIfNotExists/10k_nodes-4	4.585m	4.552m	~	0.700
SequentialGetAndSetIfNotExists/50k_nodes-4	14.55m	13.95m	~	0.100
SequentialGetAndSetIfNotExists/100k_nodes-4	27.81m	25.32m	~	0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4	2.089m	2.071m	~	1.000
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4	8.446m	8.292m	~	0.100
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4	13.82m	13.51m	~	0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4	1.824m	1.801m	~	0.400
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4	8.605m	8.236m	~	0.200
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4	50.53m	48.48m	~	0.700
DiskTxMap_SetIfNotExists-4	3.975µ	4.023µ	~	1.000
DiskTxMap_SetIfNotExists_Parallel-4	3.666µ	3.601µ	~	0.100
DiskTxMap_ExistenceOnly-4	544.4n	419.6n	~	1.000
Queue-4	187.2n	187.0n	~	0.500
AtomicPointer-4	3.640n	3.637n	~	1.000
ReorgOptimizations/DedupFilterPipeline/Old/10K-4	835.5µ	856.8µ	~	0.200
ReorgOptimizations/DedupFilterPipeline/New/10K-4	764.7µ	777.2µ	~	0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4	105.3µ	106.4µ	~	0.400
ReorgOptimizations/AllMarkFalse/New/10K-4	65.50µ	64.43µ	~	0.200
ReorgOptimizations/HashSlicePool/Old/10K-4	51.81µ	52.05µ	~	0.700
ReorgOptimizations/HashSlicePool/New/10K-4	11.03µ	11.33µ	~	0.200
ReorgOptimizations/NodeFlags/Old/10K-4	4.570µ	4.816µ	~	0.100
ReorgOptimizations/NodeFlags/New/10K-4	1.595µ	2.489µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4	10.03m	10.22m	~	0.100
ReorgOptimizations/DedupFilterPipeline/New/100K-4	11.44m	10.67m	~	0.100
ReorgOptimizations/AllMarkFalse/Old/100K-4	1.099m	1.099m	~	1.000
ReorgOptimizations/AllMarkFalse/New/100K-4	705.0µ	709.6µ	~	0.200
ReorgOptimizations/HashSlicePool/Old/100K-4	607.1µ	522.6µ	~	0.100
ReorgOptimizations/HashSlicePool/New/100K-4	205.9µ	215.2µ	~	0.200
ReorgOptimizations/NodeFlags/Old/100K-4	50.06µ	48.62µ	~	0.400
ReorgOptimizations/NodeFlags/New/100K-4	17.44µ	16.44µ	~	0.400
TxMapSetIfNotExists-4	49.68n	49.81n	~	1.000
TxMapSetIfNotExistsDuplicate-4	41.42n	41.82n	~	0.400
ChannelSendReceive-4	609.2n	624.6n	~	0.200
CalcBlockWork-4	366.6n	363.5n	~	0.100
CalculateWork-4	498.8n	492.5n	~	0.700
BuildBlockLocatorString_Helpers/Size_10-4	1.337µ	1.350µ	~	0.100
BuildBlockLocatorString_Helpers/Size_100-4	12.84µ	13.02µ	~	0.100
BuildBlockLocatorString_Helpers/Size_1000-4	145.6µ	154.1µ	~	0.700
CatchupWithHeaderCache-4	104.6m	104.5m	~	1.000
_BufferPoolAllocation/16KB-4	3.917µ	3.942µ	~	0.700
_BufferPoolAllocation/32KB-4	7.801µ	8.275µ	~	0.700
_BufferPoolAllocation/64KB-4	16.62µ	16.03µ	~	0.700
_BufferPoolAllocation/128KB-4	25.74µ	26.66µ	~	0.700
_BufferPoolAllocation/512KB-4	110.2µ	123.0µ	~	0.200
_BufferPoolConcurrent/32KB-4	19.58µ	18.44µ	~	0.400
_BufferPoolConcurrent/64KB-4	30.29µ	30.22µ	~	0.700
_BufferPoolConcurrent/512KB-4	145.8µ	147.8µ	~	1.000
_SubtreeDeserializationWithBufferSizes/16KB-4	656.5µ	625.9µ	~	0.100
_SubtreeDeserializationWithBufferSizes/32KB-4	674.2µ	621.0µ	~	0.100
_SubtreeDeserializationWithBufferSizes/64KB-4	666.2µ	614.1µ	~	0.100
_SubtreeDeserializationWithBufferSizes/128KB-4	660.2µ	617.7µ	~	0.100
_SubtreeDeserializationWithBufferSizes/512KB-4	632.5µ	613.9µ	~	0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4	36.49m	36.31m	~	0.100
_SubtreeDataDeserializationWithBufferSizes/32KB-4	36.26m	36.33m	~	1.000
_SubtreeDataDeserializationWithBufferSizes/64KB-4	36.18m	36.31m	~	0.700
_SubtreeDataDeserializationWithBufferSizes/128KB-4	36.18m	36.02m	~	1.000
_SubtreeDataDeserializationWithBufferSizes/512KB-4	35.82m	35.84m	~	1.000
_PooledVsNonPooled/Pooled-4	740.8n	651.2n	~	0.100
_PooledVsNonPooled/NonPooled-4	7.761µ	7.662µ	~	0.100
_MemoryFootprint/Current_512KB_32concurrent-4	6.449µ	6.406µ	~	0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4	9.771µ	9.427µ	~	0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4	9.262µ	8.926µ	~	0.100
_prepareTxsPerLevel-4	410.9m	412.9m	~	0.100
_prepareTxsPerLevelOrdered-4	3.577m	3.840m	~	0.200
_prepareTxsPerLevel_Comparison/Original-4	408.8m	413.9m	~	0.700
_prepareTxsPerLevel_Comparison/Optimized-4	3.607m	3.856m	~	0.200
SubtreeSizes/10k_tx_4_per_subtree-4	1.345m	1.359m	~	0.700
SubtreeSizes/10k_tx_16_per_subtree-4	326.3µ	321.4µ	~	0.700
SubtreeSizes/10k_tx_64_per_subtree-4	76.82µ	76.68µ	~	1.000
SubtreeSizes/10k_tx_256_per_subtree-4	19.22µ	19.25µ	~	0.700
SubtreeSizes/10k_tx_512_per_subtree-4	9.562µ	9.407µ	~	0.100
SubtreeSizes/10k_tx_1024_per_subtree-4	4.785µ	4.682µ	~	0.100
SubtreeSizes/10k_tx_2k_per_subtree-4	2.375µ	2.344µ	~	0.700
BlockSizeScaling/10k_tx_64_per_subtree-4	74.91µ	74.28µ	~	0.400
BlockSizeScaling/10k_tx_256_per_subtree-4	19.15µ	18.81µ	~	0.100
BlockSizeScaling/10k_tx_1024_per_subtree-4	4.726µ	4.658µ	~	0.100
BlockSizeScaling/50k_tx_64_per_subtree-4	396.3µ	394.4µ	~	0.400
BlockSizeScaling/50k_tx_256_per_subtree-4	95.57µ	94.09µ	~	0.200
BlockSizeScaling/50k_tx_1024_per_subtree-4	23.52µ	22.86µ	~	0.100
SubtreeAllocations/small_subtrees_exists_check-4	154.9µ	154.5µ	~	0.700
SubtreeAllocations/small_subtrees_data_fetch-4	170.7µ	166.6µ	~	0.400
SubtreeAllocations/small_subtrees_full_validation-4	331.0µ	323.8µ	~	0.400
SubtreeAllocations/medium_subtrees_exists_check-4	9.176µ	9.296µ	~	0.700
SubtreeAllocations/medium_subtrees_data_fetch-4	10.09µ	10.06µ	~	1.000
SubtreeAllocations/medium_subtrees_full_validation-4	19.25µ	19.06µ	~	0.100
SubtreeAllocations/large_subtrees_exists_check-4	2.232µ	2.191µ	~	0.200
SubtreeAllocations/large_subtrees_data_fetch-4	2.447µ	2.484µ	~	0.700
SubtreeAllocations/large_subtrees_full_validation-4	4.856µ	4.706µ	~	0.200
StoreBlock_Sequential/BelowCSVHeight-4	332.9µ	334.2µ	~	0.100
StoreBlock_Sequential/AboveCSVHeight-4	334.4µ	335.4µ	~	0.700
GetUtxoHashes-4	269.0n	263.6n	~	0.700
GetUtxoHashes_ManyOutputs-4	42.66µ	43.55µ	~	0.100
_NewMetaDataFromBytes-4	229.7n	233.7n	~	0.100
_Bytes-4	398.3n	408.2n	~	0.100
_MetaBytes-4	137.8n	140.4n	~	0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-26 13:33 UTC

oskarszoon

Approve. Code quality is unusually good — WithoutCancel rationale lives in code not just the PR body, tests are real regressions (gated streaming body genuinely fails pre-fix), all -race clean. Three coordination items below; none are blockers.

Coordinate before merge

1. Two callsites use different isolation mechanisms; comment implies equivalence.

services/subtreevalidation/check_block_subtrees.go:321 passes the outer ctx (still cancellable by the RPC caller).
services/blockvalidation/get_blocks.go:523 uses context.WithoutCancel(ctx) (never cancellable).

Both are correct for their context but the "See companion fix" comment implies they're equivalent. Either tighten the comments to spell out the difference, or unify to WithoutCancel in both.

2. Dashboard / alert compat for write_failed.
Existing alerts thresholded on rate(prometheusAssetSubtreeDataCreated{reason="write_failed"}[5m]) will see ~98% of catchup-time volume drop after deploy. Either re-tune alerts in the change-window, or keep write_failed bumping AND add a separate error_type dimension so existing dashboards stay valid. Worth a release-notes line.

3. DeadlineExceeded now buckets with Canceled as client_gone.
errors.IsContextError(err) matches both context.Canceled and context.DeadlineExceeded. Server-side deadline exhaustion on slow disk writes is now debug-level instead of Warn. Probably fine here (work is bounded by http_streaming_timeout) but the operator-facing semantics shift. Confirm you intended both, not just Canceled.

Nits (non-blocking)

Pre-existing (not this PR): net.ErrClosed on some platforms doesn't match IsContextError or io.ErrClosedPipe and lands in write_failed.
gatedStreamingBody / gatedStreamingBodyGB — ~35 lines duplicated across the two test files. Not worth lifting until a third caller arrives.
storer.Close(gCtx) at GetSubtreeData.go:274 doesn't classify client_gone (only the write path does). Asymmetric but rare.
Cascade regression tests don't assert the failing sibling's error name — ErrorContains(t, err, subtreeB.RootHash().String()) would lock regression coverage harder.

Confirmations worth recording

WithoutCancel placement is correct: existence check at :502 still uses original ctx. No goroutine leak.
Streaming timeout bound: 5 min (http_streaming_timeout) or pre-existing deadline if one was set — doHTTPRequestForStreamingWithRetryAfter only adds the timeout when no deadline exists.
g.Wait() still collects errors correctly after the decouple; only delays error return by up to httpStreamingTimeout when a subtree is mid-stream at failure time.
Worst-case batch latency: subtreeConcurrency (default 8) fetches each running up to 5 min after a block-level failure before fetchSubtreeDataForBlock returns. Acceptable for catch-up.
client_gone classification is comprehensive for the documented cases; IsContextError handles wrapped variants.
Custom gated-stream test mechanism is justified — httpmock's buffered bodies can't reproduce the mid-stream cancellation race because the full body is in memory by the time the reader runs.

sonarqubecloud · 2026-05-26T13:31:48Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

icellan requested review from Copilot, freemans13 and oskarszoon May 26, 2026 12:24

Copilot started reviewing on behalf of icellan May 26, 2026 12:24 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

icellan force-pushed the fix/subtreedata-fetch-no-cascade branch from 91fe6e6 to f397f89 Compare May 26, 2026 12:52

icellan self-assigned this May 26, 2026

Merge branch 'main' into fix/subtreedata-fetch-no-cascade

82ec4a6

oskarszoon approved these changes May 26, 2026

View reviewed changes

freemans13 approved these changes May 26, 2026

View reviewed changes

icellan merged commit 599f644 into bsv-blockchain:main May 26, 2026
25 checks passed

icellan mentioned this pull request Jun 12, 2026

fix(asset): admission control for subtree_data + 503 retry in peer callers #830

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(subtree_data): stop cancel cascade + segregate client-gone errors#947

fix(subtree_data): stop cancel cascade + segregate client-gone errors#947
icellan merged 2 commits into
bsv-blockchain:mainfrom
icellan:fix/subtreedata-fetch-no-cascade

icellan commented May 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented May 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 26, 2026 •

edited

Loading

Uh oh!

oskarszoon left a comment

Uh oh!

sonarqubecloud Bot commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

icellan commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug 1 (consumer): cascading cancellation in both fan-outs

Bug 2 (producer): `write_failed` conflates client disconnects with server faults

Files

Tests

Out of scope (follow-ups)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Comparison Report

Summary

Uh oh!

oskarszoon left a comment

Choose a reason for hiding this comment

Coordinate before merge

Nits (non-blocking)

Confirmations worth recording

Uh oh!

sonarqubecloud Bot commented May 26, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

icellan commented May 26, 2026 •

edited

Loading

github-actions Bot commented May 26, 2026 •

edited

Loading

github-actions Bot commented May 26, 2026 •

edited

Loading