Skip to content

fix(subtree_data): stop cancel cascade + segregate client-gone errors#947

Merged
icellan merged 2 commits into
bsv-blockchain:mainfrom
icellan:fix/subtreedata-fetch-no-cascade
May 26, 2026
Merged

fix(subtree_data): stop cancel cascade + segregate client-gone errors#947
icellan merged 2 commits into
bsv-blockchain:mainfrom
icellan:fix/subtreedata-fetch-no-cascade

Conversation

@icellan

@icellan icellan commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Under catchup load — block-validator pulling many subtrees from a peer in parallel — the asset server logs an avalanche of `io: read/write on closed pipe` warnings. Measured rate on the scale-1 cluster: 4832 `write_failed` vs 82 `on_demand_created_locked` successes per ~1h window across 4 asset pods (~98% failure). Two coupled bugs, both with the same shape.

Bug 1 (consumer): cascading cancellation in both fan-outs

Two places fan out subtree_data fetches under `errgroup.WithContext`:

  • `services/subtreevalidation/check_block_subtrees.go` (validation path)
  • `services/blockvalidation/get_blocks.go` (catchup path)

When any one goroutine returns an error, `gCtx` is cancelled and every in-flight HTTP body `Read` returns `ctx.Err()`. Each client then closes its TCP connection. nginx (default `proxy_ignore_client_abort off`) propagates that close upstream. On the peer:

  • `c.Response().Write` fails → `c.Stream` returns → `defer r.Close()` closes the pipe reader
  • The producer goroutine's next write into `io.MultiWriter` returns `io.ErrClosedPipe`
  • `storer.Abort(err)` discards already-paid Aerospike work

One failure → tens of cancelled streams across the pod fleet, every failed attempt wastes the chunked tx-meta fetch it already completed.

Fix: keep the errgroup but detach the heavy subtree_data work from `gCtx`. The existence check still respects the original context (so a pre-cancelled call exits fast); the HTTP fetch + parse + store does not. Sibling failures no longer cancel in-flight peers; each fetch completes naturally or hits its own `http_streaming_timeout`. The peer finishes writing its subtreeData file, which becomes a cache hit on the next retry. Trade-off: batch failure detection waits for in-flight peers to drain instead of bailing early — acceptable because the per-fetch streaming timeout still bounds it, and the alternative is throwing away pipeline work on every batch failure.

Bug 2 (producer): `write_failed` conflates client disconnects with server faults

`services/asset/repository/GetSubtreeData.go` logged every producer-side error at WARN under metric reason `write_failed`. `io.ErrClosedPipe` and context cancellation are caller-side conditions, not server faults — they should not pollute the ops signal that ought to flag genuine write failures.

Fix: detect `errors.IsContextError(err) || errors.Is(err, io.ErrClosedPipe)` and log at debug under a new metric reason `client_gone`. WARN + `write_failed` is now reserved for actual server-side write failures.

Files

  • `services/subtreevalidation/check_block_subtrees.go` — pass `ctx` (not `gCtx`) to the subtree_data fetch + parser
  • `services/blockvalidation/get_blocks.go` — detach `ctx` via `context.WithoutCancel` after the existence check in `fetchAndStoreSubtreeData`
  • `services/asset/repository/GetSubtreeData.go` — classify `client_gone` separately from `write_failed`
  • `services/asset/repository/metrics.go` — document the new `client_gone` reason
  • `services/{asset/repository,subtreevalidation,blockvalidation}/*_test.go` — TDD regression tests

Tests

Written TDD-style — confirmed each test FAILS on pre-fix code and PASSES post-fix.

The cascade regression tests use a custom streaming response body that respects `req.Context()` between halves and waits for cancellation to propagate. httpmock's default buffered bodies cannot exercise the race because the full body is in memory by the time the reader runs; we needed an actual gated stream to prove that mid-stream cancellation truncates the body pre-fix and does not post-fix.

  • `TestCheckBlockSubtrees_SiblingFailureDoesNotCancelInFlight` — subtree A succeeds streaming, sibling B fails fast; A's `FileTypeSubtreeData` file must be in the store afterwards.
  • `TestFetchSubtreeDataForBlock_SiblingFailureDoesNotCancelInFlight` — same shape, catchup path.
  • `TestGetSubtreeDataWithReader/client disconnect during stream is classified as client_gone, not write_failed` — producer-side classification.
  • `TestGetSubtreeDataWithReader/genuine server-side write error still records write_failed` — companion test confirming WARN+`write_failed` is preserved for genuine faults (e.g. missing tx meta).

Out of scope (follow-ups)

  • Settings-level mitigations (`asset_concurrency_subtree_data_create`, `asset_subtreeDataStreamingChunkSize`, nginx `proxy_ignore_client_abort on`) are operational tuning, separate concerns.
  • The asset-side `storer.Abort` on `client_gone` still throws away the file in-progress. Decoupling the storer write from the response pipe (so a client disconnect lets the file finish to disk) is a bigger refactor and a separate PR.

Test plan

  • `go test ./services/asset/repository/...` — pass
  • `go test ./services/subtreevalidation/...` — pass
  • `go test ./services/blockvalidation/...` — pass (incl. catchup subpackage)
  • `go test -race ./services/asset/repository/...` — pass
  • `go vet` on all three packages — clean
  • `golangci-lint run --new-from-rev=origin/main` — clean
  • Verify on scale-1 cluster: `teranode_asset_repository_subtree_data_created_total{result="error",reason="write_failed"}` rate drops to near-zero; `{reason="client_gone"}` absorbs the volume; `io: read/write on closed pipe` WARN lines disappear from the asset logs

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses subtree_data streaming failures under catchup load by preventing cancellation cascades during parallel subtree fetch/processing, and by improving observability so client/proxy disconnects are not counted as server write faults.

Changes:

  • Prevent errgroup sibling failures from cancelling in-flight /subtree_data HTTP streams by using the parent ctx (not gCtx) for the streaming fetch + processing.
  • Classify mid-stream disconnect conditions (context cancellation / io.ErrClosedPipe) as client_gone instead of write_failed, and reduce log severity to debug for those cases.
  • Add a regression test asserting client_gone increments and write_failed does not for an early reader close.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
services/subtreevalidation/check_block_subtrees.go Detach subtree_data streaming work from errgroup cancellation to avoid cancel cascades during batch parallelism.
services/asset/repository/GetSubtreeData.go Split “client disconnect” errors into client_gone metric + debug logging; reserve write_failed for real server-side faults.
services/asset/repository/metrics.go Document the new client_gone label value for the subtree_data_created metric.
services/asset/repository/GetSubtreeData_test.go Add regression coverage for client_gone classification on early stream close.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Under catchup load — block-validator pulling many subtrees from a peer in
parallel — the asset server logs an avalanche of
"io: read/write on closed pipe" warnings. Measured at ~98% of all on-demand
subtreeData creations in scale-1 (4832 write_failed vs 82 successes per
~1h window across 4 asset pods). Two coupled bugs, both with the same
shape:

CONSUMER SIDE (cascade-cancel)

services/subtreevalidation/check_block_subtrees.go and
services/blockvalidation/get_blocks.go both fan out subtree_data fetches
under errgroup.WithContext. When any one goroutine errors, gCtx is
cancelled and every in-flight HTTP body Read returns ctx.Err(). Each
client then closes its TCP connection to the peer's asset-cache. nginx
(default proxy_ignore_client_abort off) propagates the close upstream.
The peer's c.Response().Write fails, c.Stream unwinds, defer r.Close()
closes the response pipe, and the producer goroutine's next write into
io.MultiWriter returns io.ErrClosedPipe — at which point storer.Abort
discards already-paid Aerospike work. One failure cascades to tens of
in-flight streams across the pod fleet.

Fix (both files): keep the errgroup but detach the heavy subtree_data
work from gCtx. In check_block_subtrees.go this means passing the parent
ctx (not gCtx) to DoHTTPRequestBodyReaderWithRetry and
processSubtreeDataStream. In get_blocks.go the call paths through
fetchAndStoreSubtreeData; once the existence check has passed (still
respecting gCtx so a pre-cancelled call exits fast), the function
context.WithoutCancel(ctx)s itself so the subsequent HTTP fetch + parse +
store run to completion. Sibling failures no longer cancel in-flight
peers; each fetch completes naturally or hits its own
http_streaming_timeout. The peer finishes writing its subtreeData file,
which becomes a cache hit on the next retry. Trade-off: batch failure
detection waits for in-flight peers instead of cancelling early —
acceptable because the per-fetch streaming timeout still bounds it, and
the alternative is throwing away pipeline work on every batch failure.

PRODUCER SIDE (error classification)

services/asset/repository/GetSubtreeData.go logged every producer error
at WARN with metric reason "write_failed", conflating client disconnects
with genuine server-side faults. io.ErrClosedPipe and ctx cancellation
are caller-side conditions, not server faults; they should not pollute
the ops signal that ought to flag genuine write failures.

Fix: detect errors.IsContextError(err) || errors.Is(err, io.ErrClosedPipe)
and log at debug under a new metric reason "client_gone". WARN + reason
"write_failed" is now reserved for actual server-side write failures.

TESTS

TDD regression tests with a custom streaming response body that respects
req.Context() between halves, so cancellation can actually truncate the
body mid-stream (httpmock's default buffered bodies cannot exercise the
race). Confirmed each test fails on pre-fix code and passes post-fix:
  - TestCheckBlockSubtrees_SiblingFailureDoesNotCancelInFlight
  - TestFetchSubtreeDataForBlock_SiblingFailureDoesNotCancelInFlight
Plus producer-side classification tests:
  - TestGetSubtreeDataWithReader/"client disconnect ... is classified as
    client_gone, not write_failed"
  - TestGetSubtreeDataWithReader/"genuine server-side write error still
    records write_failed"

No behaviour change for the success path or for genuine fetch/storage
failures.
@icellan icellan force-pushed the fix/subtreedata-fetch-no-cascade branch from 91fe6e6 to f397f89 Compare May 26, 2026 12:52
@github-actions

github-actions Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Current Review:

This PR addresses two coupled bugs causing the "catchup avalanche" under load. The fix is well-designed with strong test coverage. No critical issues found.

Minor observation: In check_block_subtrees.go:341, the code still uses gCtx for DecoupleTracingSpan, while the comment mentions ensuring tracking completes even if parent context is cancelled. The existing DecoupleTracingSpan likely already handles this, but worth verifying that the decoupling intent is preserved despite passing gCtx rather than ctx.

Overall, this is high-quality defensive engineering with thorough TDD-style tests that fail pre-fix and pass post-fix.

@github-actions

github-actions Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-947 (c964edc)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 144
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.742µ 1.741µ ~ 1.000
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 61.61n 61.67n ~ 0.600
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 61.69n 61.84n ~ 0.400
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 61.70n 61.70n ~ 0.800
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 28.95n 29.70n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 49.86n 51.15n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 107.6n 107.2n ~ 0.400
MiningCandidate_Stringify_Short-4 268.2n 271.5n ~ 0.100
MiningCandidate_Stringify_Long-4 1.887µ 1.895µ ~ 0.700
MiningSolution_Stringify-4 963.1n 967.5n ~ 0.100
BlockInfo_MarshalJSON-4 1.786µ 1.788µ ~ 0.300
NewFromBytes-4 124.8n 125.3n ~ 0.300
AddTxBatchColumnar_Validation-4 2.416µ 2.441µ ~ 1.000
OffsetValidationLoop-4 719.1n 719.6n ~ 1.000
Mine_EasyDifficulty-4 34.13µ 34.05µ ~ 0.700
Mine_WithAddress-4 3.776µ 3.734µ ~ 0.100
BlockAssembler_AddTx-4 0.02889n 0.02797n ~ 1.000
AddNode-4 12.08 11.47 ~ 0.100
AddNodeWithMap-4 11.55 12.45 ~ 0.700
DirectSubtreeAdd/4_per_subtree-4 55.15n 58.36n ~ 0.200
DirectSubtreeAdd/64_per_subtree-4 29.28n 29.02n ~ 0.700
DirectSubtreeAdd/256_per_subtree-4 27.76n 28.30n ~ 0.400
DirectSubtreeAdd/1024_per_subtree-4 26.46n 26.48n ~ 0.800
DirectSubtreeAdd/2048_per_subtree-4 26.08n 26.19n ~ 0.700
SubtreeProcessorAdd/4_per_subtree-4 296.2n 296.9n ~ 1.000
SubtreeProcessorAdd/64_per_subtree-4 292.4n 303.3n ~ 0.400
SubtreeProcessorAdd/256_per_subtree-4 286.1n 291.6n ~ 0.100
SubtreeProcessorAdd/1024_per_subtree-4 278.8n 285.2n ~ 0.100
SubtreeProcessorAdd/2048_per_subtree-4 280.8n 285.0n ~ 0.100
SubtreeProcessorRotate/4_per_subtree-4 287.4n 294.3n ~ 0.200
SubtreeProcessorRotate/64_per_subtree-4 279.1n 290.1n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 281.3n 283.6n ~ 1.000
SubtreeProcessorRotate/1024_per_subtree-4 281.6n 282.2n ~ 1.000
SubtreeNodeAddOnly/4_per_subtree-4 55.31n 54.98n ~ 0.700
SubtreeNodeAddOnly/64_per_subtree-4 35.98n 35.98n ~ 1.000
SubtreeNodeAddOnly/256_per_subtree-4 35.19n 34.96n ~ 0.100
SubtreeNodeAddOnly/1024_per_subtree-4 34.52n 34.33n ~ 0.700
SubtreeCreationOnly/4_per_subtree-4 110.2n 110.4n ~ 1.000
SubtreeCreationOnly/64_per_subtree-4 352.0n 350.6n ~ 0.400
SubtreeCreationOnly/256_per_subtree-4 1.238µ 1.242µ ~ 0.400
SubtreeCreationOnly/1024_per_subtree-4 3.874µ 3.797µ ~ 0.200
SubtreeCreationOnly/2048_per_subtree-4 6.904µ 6.944µ ~ 0.700
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 281.0n 284.9n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 280.5n 293.1n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 2.014m 2.024m ~ 0.700
ParallelGetAndSetIfNotExists/10k_nodes-4 5.136m 5.174m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 7.433m 7.426m ~ 0.700
ParallelGetAndSetIfNotExists/100k_nodes-4 10.31m 10.08m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 1.785m 1.796m ~ 1.000
SequentialGetAndSetIfNotExists/10k_nodes-4 4.585m 4.552m ~ 0.700
SequentialGetAndSetIfNotExists/50k_nodes-4 14.55m 13.95m ~ 0.100
SequentialGetAndSetIfNotExists/100k_nodes-4 27.81m 25.32m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.089m 2.071m ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 8.446m 8.292m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 13.82m 13.51m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 1.824m 1.801m ~ 0.400
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 8.605m 8.236m ~ 0.200
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 50.53m 48.48m ~ 0.700
DiskTxMap_SetIfNotExists-4 3.975µ 4.023µ ~ 1.000
DiskTxMap_SetIfNotExists_Parallel-4 3.666µ 3.601µ ~ 0.100
DiskTxMap_ExistenceOnly-4 544.4n 419.6n ~ 1.000
Queue-4 187.2n 187.0n ~ 0.500
AtomicPointer-4 3.640n 3.637n ~ 1.000
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 835.5µ 856.8µ ~ 0.200
ReorgOptimizations/DedupFilterPipeline/New/10K-4 764.7µ 777.2µ ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4 105.3µ 106.4µ ~ 0.400
ReorgOptimizations/AllMarkFalse/New/10K-4 65.50µ 64.43µ ~ 0.200
ReorgOptimizations/HashSlicePool/Old/10K-4 51.81µ 52.05µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/10K-4 11.03µ 11.33µ ~ 0.200
ReorgOptimizations/NodeFlags/Old/10K-4 4.570µ 4.816µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.595µ 2.489µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 10.03m 10.22m ~ 0.100
ReorgOptimizations/DedupFilterPipeline/New/100K-4 11.44m 10.67m ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.099m 1.099m ~ 1.000
ReorgOptimizations/AllMarkFalse/New/100K-4 705.0µ 709.6µ ~ 0.200
ReorgOptimizations/HashSlicePool/Old/100K-4 607.1µ 522.6µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 205.9µ 215.2µ ~ 0.200
ReorgOptimizations/NodeFlags/Old/100K-4 50.06µ 48.62µ ~ 0.400
ReorgOptimizations/NodeFlags/New/100K-4 17.44µ 16.44µ ~ 0.400
TxMapSetIfNotExists-4 49.68n 49.81n ~ 1.000
TxMapSetIfNotExistsDuplicate-4 41.42n 41.82n ~ 0.400
ChannelSendReceive-4 609.2n 624.6n ~ 0.200
CalcBlockWork-4 366.6n 363.5n ~ 0.100
CalculateWork-4 498.8n 492.5n ~ 0.700
BuildBlockLocatorString_Helpers/Size_10-4 1.337µ 1.350µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 12.84µ 13.02µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 145.6µ 154.1µ ~ 0.700
CatchupWithHeaderCache-4 104.6m 104.5m ~ 1.000
_BufferPoolAllocation/16KB-4 3.917µ 3.942µ ~ 0.700
_BufferPoolAllocation/32KB-4 7.801µ 8.275µ ~ 0.700
_BufferPoolAllocation/64KB-4 16.62µ 16.03µ ~ 0.700
_BufferPoolAllocation/128KB-4 25.74µ 26.66µ ~ 0.700
_BufferPoolAllocation/512KB-4 110.2µ 123.0µ ~ 0.200
_BufferPoolConcurrent/32KB-4 19.58µ 18.44µ ~ 0.400
_BufferPoolConcurrent/64KB-4 30.29µ 30.22µ ~ 0.700
_BufferPoolConcurrent/512KB-4 145.8µ 147.8µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/16KB-4 656.5µ 625.9µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 674.2µ 621.0µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 666.2µ 614.1µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 660.2µ 617.7µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 632.5µ 613.9µ ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4 36.49m 36.31m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/32KB-4 36.26m 36.33m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/64KB-4 36.18m 36.31m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/128KB-4 36.18m 36.02m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/512KB-4 35.82m 35.84m ~ 1.000
_PooledVsNonPooled/Pooled-4 740.8n 651.2n ~ 0.100
_PooledVsNonPooled/NonPooled-4 7.761µ 7.662µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 6.449µ 6.406µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.771µ 9.427µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 9.262µ 8.926µ ~ 0.100
_prepareTxsPerLevel-4 410.9m 412.9m ~ 0.100
_prepareTxsPerLevelOrdered-4 3.577m 3.840m ~ 0.200
_prepareTxsPerLevel_Comparison/Original-4 408.8m 413.9m ~ 0.700
_prepareTxsPerLevel_Comparison/Optimized-4 3.607m 3.856m ~ 0.200
SubtreeSizes/10k_tx_4_per_subtree-4 1.345m 1.359m ~ 0.700
SubtreeSizes/10k_tx_16_per_subtree-4 326.3µ 321.4µ ~ 0.700
SubtreeSizes/10k_tx_64_per_subtree-4 76.82µ 76.68µ ~ 1.000
SubtreeSizes/10k_tx_256_per_subtree-4 19.22µ 19.25µ ~ 0.700
SubtreeSizes/10k_tx_512_per_subtree-4 9.562µ 9.407µ ~ 0.100
SubtreeSizes/10k_tx_1024_per_subtree-4 4.785µ 4.682µ ~ 0.100
SubtreeSizes/10k_tx_2k_per_subtree-4 2.375µ 2.344µ ~ 0.700
BlockSizeScaling/10k_tx_64_per_subtree-4 74.91µ 74.28µ ~ 0.400
BlockSizeScaling/10k_tx_256_per_subtree-4 19.15µ 18.81µ ~ 0.100
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.726µ 4.658µ ~ 0.100
BlockSizeScaling/50k_tx_64_per_subtree-4 396.3µ 394.4µ ~ 0.400
BlockSizeScaling/50k_tx_256_per_subtree-4 95.57µ 94.09µ ~ 0.200
BlockSizeScaling/50k_tx_1024_per_subtree-4 23.52µ 22.86µ ~ 0.100
SubtreeAllocations/small_subtrees_exists_check-4 154.9µ 154.5µ ~ 0.700
SubtreeAllocations/small_subtrees_data_fetch-4 170.7µ 166.6µ ~ 0.400
SubtreeAllocations/small_subtrees_full_validation-4 331.0µ 323.8µ ~ 0.400
SubtreeAllocations/medium_subtrees_exists_check-4 9.176µ 9.296µ ~ 0.700
SubtreeAllocations/medium_subtrees_data_fetch-4 10.09µ 10.06µ ~ 1.000
SubtreeAllocations/medium_subtrees_full_validation-4 19.25µ 19.06µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 2.232µ 2.191µ ~ 0.200
SubtreeAllocations/large_subtrees_data_fetch-4 2.447µ 2.484µ ~ 0.700
SubtreeAllocations/large_subtrees_full_validation-4 4.856µ 4.706µ ~ 0.200
StoreBlock_Sequential/BelowCSVHeight-4 332.9µ 334.2µ ~ 0.100
StoreBlock_Sequential/AboveCSVHeight-4 334.4µ 335.4µ ~ 0.700
GetUtxoHashes-4 269.0n 263.6n ~ 0.700
GetUtxoHashes_ManyOutputs-4 42.66µ 43.55µ ~ 0.100
_NewMetaDataFromBytes-4 229.7n 233.7n ~ 0.100
_Bytes-4 398.3n 408.2n ~ 0.100
_MetaBytes-4 137.8n 140.4n ~ 0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-26 13:33 UTC

@icellan icellan self-assigned this May 26, 2026

@oskarszoon oskarszoon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. Code quality is unusually good — WithoutCancel rationale lives in code not just the PR body, tests are real regressions (gated streaming body genuinely fails pre-fix), all -race clean. Three coordination items below; none are blockers.

Coordinate before merge

1. Two callsites use different isolation mechanisms; comment implies equivalence.

  • services/subtreevalidation/check_block_subtrees.go:321 passes the outer ctx (still cancellable by the RPC caller).
  • services/blockvalidation/get_blocks.go:523 uses context.WithoutCancel(ctx) (never cancellable).

Both are correct for their context but the "See companion fix" comment implies they're equivalent. Either tighten the comments to spell out the difference, or unify to WithoutCancel in both.

2. Dashboard / alert compat for write_failed.
Existing alerts thresholded on rate(prometheusAssetSubtreeDataCreated{reason="write_failed"}[5m]) will see ~98% of catchup-time volume drop after deploy. Either re-tune alerts in the change-window, or keep write_failed bumping AND add a separate error_type dimension so existing dashboards stay valid. Worth a release-notes line.

3. DeadlineExceeded now buckets with Canceled as client_gone.
errors.IsContextError(err) matches both context.Canceled and context.DeadlineExceeded. Server-side deadline exhaustion on slow disk writes is now debug-level instead of Warn. Probably fine here (work is bounded by http_streaming_timeout) but the operator-facing semantics shift. Confirm you intended both, not just Canceled.

Nits (non-blocking)

  • Pre-existing (not this PR): net.ErrClosed on some platforms doesn't match IsContextError or io.ErrClosedPipe and lands in write_failed.
  • gatedStreamingBody / gatedStreamingBodyGB — ~35 lines duplicated across the two test files. Not worth lifting until a third caller arrives.
  • storer.Close(gCtx) at GetSubtreeData.go:274 doesn't classify client_gone (only the write path does). Asymmetric but rare.
  • Cascade regression tests don't assert the failing sibling's error name — ErrorContains(t, err, subtreeB.RootHash().String()) would lock regression coverage harder.

Confirmations worth recording

  • WithoutCancel placement is correct: existence check at :502 still uses original ctx. No goroutine leak.
  • Streaming timeout bound: 5 min (http_streaming_timeout) or pre-existing deadline if one was set — doHTTPRequestForStreamingWithRetryAfter only adds the timeout when no deadline exists.
  • g.Wait() still collects errors correctly after the decouple; only delays error return by up to httpStreamingTimeout when a subtree is mid-stream at failure time.
  • Worst-case batch latency: subtreeConcurrency (default 8) fetches each running up to 5 min after a block-level failure before fetchSubtreeDataForBlock returns. Acceptable for catch-up.
  • client_gone classification is comprehensive for the documented cases; IsContextError handles wrapped variants.
  • Custom gated-stream test mechanism is justified — httpmock's buffered bodies can't reproduce the mid-stream cancellation race because the full body is in memory by the time the reader runs.

@sonarqubecloud

Copy link
Copy Markdown

@icellan icellan merged commit 599f644 into bsv-blockchain:main May 26, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants