fix(subtree_data): stop cancel cascade + segregate client-gone errors#947
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses subtree_data streaming failures under catchup load by preventing cancellation cascades during parallel subtree fetch/processing, and by improving observability so client/proxy disconnects are not counted as server write faults.
Changes:
- Prevent errgroup sibling failures from cancelling in-flight
/subtree_dataHTTP streams by using the parentctx(notgCtx) for the streaming fetch + processing. - Classify mid-stream disconnect conditions (
contextcancellation /io.ErrClosedPipe) asclient_goneinstead ofwrite_failed, and reduce log severity to debug for those cases. - Add a regression test asserting
client_goneincrements andwrite_faileddoes not for an early reader close.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| services/subtreevalidation/check_block_subtrees.go | Detach subtree_data streaming work from errgroup cancellation to avoid cancel cascades during batch parallelism. |
| services/asset/repository/GetSubtreeData.go | Split “client disconnect” errors into client_gone metric + debug logging; reserve write_failed for real server-side faults. |
| services/asset/repository/metrics.go | Document the new client_gone label value for the subtree_data_created metric. |
| services/asset/repository/GetSubtreeData_test.go | Add regression coverage for client_gone classification on early stream close. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Under catchup load — block-validator pulling many subtrees from a peer in
parallel — the asset server logs an avalanche of
"io: read/write on closed pipe" warnings. Measured at ~98% of all on-demand
subtreeData creations in scale-1 (4832 write_failed vs 82 successes per
~1h window across 4 asset pods). Two coupled bugs, both with the same
shape:
CONSUMER SIDE (cascade-cancel)
services/subtreevalidation/check_block_subtrees.go and
services/blockvalidation/get_blocks.go both fan out subtree_data fetches
under errgroup.WithContext. When any one goroutine errors, gCtx is
cancelled and every in-flight HTTP body Read returns ctx.Err(). Each
client then closes its TCP connection to the peer's asset-cache. nginx
(default proxy_ignore_client_abort off) propagates the close upstream.
The peer's c.Response().Write fails, c.Stream unwinds, defer r.Close()
closes the response pipe, and the producer goroutine's next write into
io.MultiWriter returns io.ErrClosedPipe — at which point storer.Abort
discards already-paid Aerospike work. One failure cascades to tens of
in-flight streams across the pod fleet.
Fix (both files): keep the errgroup but detach the heavy subtree_data
work from gCtx. In check_block_subtrees.go this means passing the parent
ctx (not gCtx) to DoHTTPRequestBodyReaderWithRetry and
processSubtreeDataStream. In get_blocks.go the call paths through
fetchAndStoreSubtreeData; once the existence check has passed (still
respecting gCtx so a pre-cancelled call exits fast), the function
context.WithoutCancel(ctx)s itself so the subsequent HTTP fetch + parse +
store run to completion. Sibling failures no longer cancel in-flight
peers; each fetch completes naturally or hits its own
http_streaming_timeout. The peer finishes writing its subtreeData file,
which becomes a cache hit on the next retry. Trade-off: batch failure
detection waits for in-flight peers instead of cancelling early —
acceptable because the per-fetch streaming timeout still bounds it, and
the alternative is throwing away pipeline work on every batch failure.
PRODUCER SIDE (error classification)
services/asset/repository/GetSubtreeData.go logged every producer error
at WARN with metric reason "write_failed", conflating client disconnects
with genuine server-side faults. io.ErrClosedPipe and ctx cancellation
are caller-side conditions, not server faults; they should not pollute
the ops signal that ought to flag genuine write failures.
Fix: detect errors.IsContextError(err) || errors.Is(err, io.ErrClosedPipe)
and log at debug under a new metric reason "client_gone". WARN + reason
"write_failed" is now reserved for actual server-side write failures.
TESTS
TDD regression tests with a custom streaming response body that respects
req.Context() between halves, so cancellation can actually truncate the
body mid-stream (httpmock's default buffered bodies cannot exercise the
race). Confirmed each test fails on pre-fix code and passes post-fix:
- TestCheckBlockSubtrees_SiblingFailureDoesNotCancelInFlight
- TestFetchSubtreeDataForBlock_SiblingFailureDoesNotCancelInFlight
Plus producer-side classification tests:
- TestGetSubtreeDataWithReader/"client disconnect ... is classified as
client_gone, not write_failed"
- TestGetSubtreeDataWithReader/"genuine server-side write error still
records write_failed"
No behaviour change for the success path or for genuine fetch/storage
failures.
91fe6e6 to
f397f89
Compare
|
🤖 Claude Code Review Status: Complete Current Review: This PR addresses two coupled bugs causing the "catchup avalanche" under load. The fix is well-designed with strong test coverage. No critical issues found. Minor observation: In Overall, this is high-quality defensive engineering with thorough TDD-style tests that fail pre-fix and pass post-fix. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-26 13:33 UTC |
oskarszoon
left a comment
There was a problem hiding this comment.
Approve. Code quality is unusually good — WithoutCancel rationale lives in code not just the PR body, tests are real regressions (gated streaming body genuinely fails pre-fix), all -race clean. Three coordination items below; none are blockers.
Coordinate before merge
1. Two callsites use different isolation mechanisms; comment implies equivalence.
services/subtreevalidation/check_block_subtrees.go:321passes the outerctx(still cancellable by the RPC caller).services/blockvalidation/get_blocks.go:523usescontext.WithoutCancel(ctx)(never cancellable).
Both are correct for their context but the "See companion fix" comment implies they're equivalent. Either tighten the comments to spell out the difference, or unify to WithoutCancel in both.
2. Dashboard / alert compat for write_failed.
Existing alerts thresholded on rate(prometheusAssetSubtreeDataCreated{reason="write_failed"}[5m]) will see ~98% of catchup-time volume drop after deploy. Either re-tune alerts in the change-window, or keep write_failed bumping AND add a separate error_type dimension so existing dashboards stay valid. Worth a release-notes line.
3. DeadlineExceeded now buckets with Canceled as client_gone.
errors.IsContextError(err) matches both context.Canceled and context.DeadlineExceeded. Server-side deadline exhaustion on slow disk writes is now debug-level instead of Warn. Probably fine here (work is bounded by http_streaming_timeout) but the operator-facing semantics shift. Confirm you intended both, not just Canceled.
Nits (non-blocking)
- Pre-existing (not this PR):
net.ErrClosedon some platforms doesn't matchIsContextErrororio.ErrClosedPipeand lands inwrite_failed. gatedStreamingBody/gatedStreamingBodyGB— ~35 lines duplicated across the two test files. Not worth lifting until a third caller arrives.storer.Close(gCtx)atGetSubtreeData.go:274doesn't classifyclient_gone(only the write path does). Asymmetric but rare.- Cascade regression tests don't assert the failing sibling's error name —
ErrorContains(t, err, subtreeB.RootHash().String())would lock regression coverage harder.
Confirmations worth recording
WithoutCancelplacement is correct: existence check at:502still uses original ctx. No goroutine leak.- Streaming timeout bound: 5 min (
http_streaming_timeout) or pre-existing deadline if one was set —doHTTPRequestForStreamingWithRetryAfteronly adds the timeout when no deadline exists. g.Wait()still collects errors correctly after the decouple; only delays error return by up tohttpStreamingTimeoutwhen a subtree is mid-stream at failure time.- Worst-case batch latency:
subtreeConcurrency(default 8) fetches each running up to 5 min after a block-level failure beforefetchSubtreeDataForBlockreturns. Acceptable for catch-up. client_goneclassification is comprehensive for the documented cases;IsContextErrorhandles wrapped variants.- Custom gated-stream test mechanism is justified — httpmock's buffered bodies can't reproduce the mid-stream cancellation race because the full body is in memory by the time the reader runs.
|



Summary
Under catchup load — block-validator pulling many subtrees from a peer in parallel — the asset server logs an avalanche of `io: read/write on closed pipe` warnings. Measured rate on the scale-1 cluster: 4832 `write_failed` vs 82 `on_demand_created_locked` successes per ~1h window across 4 asset pods (~98% failure). Two coupled bugs, both with the same shape.
Bug 1 (consumer): cascading cancellation in both fan-outs
Two places fan out subtree_data fetches under `errgroup.WithContext`:
When any one goroutine returns an error, `gCtx` is cancelled and every in-flight HTTP body `Read` returns `ctx.Err()`. Each client then closes its TCP connection. nginx (default `proxy_ignore_client_abort off`) propagates that close upstream. On the peer:
One failure → tens of cancelled streams across the pod fleet, every failed attempt wastes the chunked tx-meta fetch it already completed.
Fix: keep the errgroup but detach the heavy subtree_data work from `gCtx`. The existence check still respects the original context (so a pre-cancelled call exits fast); the HTTP fetch + parse + store does not. Sibling failures no longer cancel in-flight peers; each fetch completes naturally or hits its own `http_streaming_timeout`. The peer finishes writing its subtreeData file, which becomes a cache hit on the next retry. Trade-off: batch failure detection waits for in-flight peers to drain instead of bailing early — acceptable because the per-fetch streaming timeout still bounds it, and the alternative is throwing away pipeline work on every batch failure.
Bug 2 (producer): `write_failed` conflates client disconnects with server faults
`services/asset/repository/GetSubtreeData.go` logged every producer-side error at WARN under metric reason `write_failed`. `io.ErrClosedPipe` and context cancellation are caller-side conditions, not server faults — they should not pollute the ops signal that ought to flag genuine write failures.
Fix: detect `errors.IsContextError(err) || errors.Is(err, io.ErrClosedPipe)` and log at debug under a new metric reason `client_gone`. WARN + `write_failed` is now reserved for actual server-side write failures.
Files
Tests
Written TDD-style — confirmed each test FAILS on pre-fix code and PASSES post-fix.
The cascade regression tests use a custom streaming response body that respects `req.Context()` between halves and waits for cancellation to propagate. httpmock's default buffered bodies cannot exercise the race because the full body is in memory by the time the reader runs; we needed an actual gated stream to prove that mid-stream cancellation truncates the body pre-fix and does not post-fix.
Out of scope (follow-ups)
Test plan