fix(asset): admission control for subtree_data + 503 retry in peer callers#830
fix(asset): admission control for subtree_data + 503 retry in peer callers#830icellan wants to merge 1 commit into
Conversation
Production asset pods were OOMKilling under load on /api/v1/subtree_data, manifesting downstream as nginx "upstream prematurely closed connection while reading response header from upstream". Goroutine profiles showed 60K+ goroutines accumulating in the chunk-fetch fan-out before SIGKILL. The earlier streaming work bounded per-request memory but did nothing to cap concurrent on-demand subtreeData file creations: each one holds chunkSize tx-metadata batches in memory, multiplied by however many clients arrive at once. With large transactions and slow clients the process trivially exceeds even a 64Gi limit. Asset side - admission control: - New asset_concurrency_subtree_data_create setting (default 4) gates the dualStreamWithFileCreation path with non-blocking TryAcquire. When the cap is reached, requests get HTTP 503 with Retry-After: 1 instead of waiting up to 30s for a permit. - Restructured GetSubtreeDataReader to check Exists first without holding the reader semaphore. File-exists fast path uses the existing reader sem; on-demand creation uses the new create sem. - Defensive cap on the pending chunk map in writeTransactionsViaSubtreeStoreStreaming (2 * concurrency); aborts with a clear error if a future scheduler regression grows it. - ctx check every 256 txs in writeChunkToWriter so a client disconnect releases chunkMetaSlice promptly instead of waiting for the next pipe write to fail. HTTP utility - typed 503 + retry helper: - buildHTTPError now produces errors.ErrServiceUnavailable on 503 so callers can errors.Is on it. - New DoHTTPRequestBodyReaderWithRetry: exponential backoff (250ms -> 5s, max 6 attempts), honors Retry-After header, retries only on 503. Non-503 errors and ctx cancellation return immediately. Callers - retry on peer 503: - subtreevalidation/SubtreeValidation.go (getSubtreeMissingTxs) - subtreevalidation/check_block_subtrees.go (CheckBlockSubtrees) - blockvalidation/get_blocks.go (fetchSubtreeDataFromPeer) Tests: - 7 new unit tests for the retry helper covering success, retry-then- succeed, attempt exhaustion, Retry-After honoring, no-retry on non- 503, ctx cancellation, and parseRetryAfter parsing. Race-clean. Verified: go build, go vet, go test ./util/ -race all clean.
|
🤖 Claude Code Review Status: Complete Review SummaryThis PR implements a critical production hotfix for OOMKilled crashes in the asset service by adding admission control for on-demand subtree data creation and retry logic for 503 responses. Current Review: No issues found The implementation is solid:
The PR description provides excellent production evidence, clear risk analysis, and detailed rollout notes. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-08 08:21 UTC |
oskarszoon
left a comment
There was a problem hiding this comment.
Approve. Prod-driven design (66k goroutines at OOM, getTxs fan-out amplifier identified precisely), layered fix — non-blocking TryAcquire on a new dedicated semSubtreeDataCreate separate from the reader sem, 503+Retry-After on saturation, typed ErrServiceUnavailable for caller errors.Is, 503-only retry with exponential backoff + Retry-After honoring, pending chunk-map cap, ctx-every-256-txs in the writer. Each layer has a clear purpose.
Three confirmations before merge:
testCI job is failing on the latest run while everything else (smoketest, sequential-{sqlite,postgres,aerospike}, lint, 14 benches) is green. The PR adds 7 new tests inutil/http_test.goso worth confirming this is the same flake hitting other PRs this week vs a real failure.- "Deploy to dev-scale-1 and verify peer-side recovers from saturation" is unchecked in the PR's verification list. Given retry-storm is correctly identified in the Risks section, the load test is the proof-point.
- This targets
feat/teranode-native-ops, not main. Worth a one-line note in the PR description so the next reader doesn't expect it on main and so the eventual main-port stays tracked.
|
Closing as already integrated. This PR's single commit (
A rebase onto the current base produces zero unique commits (the base is a strict superset of this PR), so there is nothing left to merge. |
Summary
Hotfix for production OOMKilled crashloops on the asset service caused by unbounded concurrent on-demand
/subtree_datafile creation.The streaming work in #827 capped per-request memory but did not add admission control. Each
dualStreamWithFileCreationcall holds chunk-sized batches of full transaction metadata in memory; multiplied by however many clients arrive at once and by per-tx size variance (data carriers, multi-output txs), the asset process trivially exceeds even a 64Gi limit. This PR adds the missing admission cap, surfaces it as an HTTP 503 withRetry-After, and teaches the three known peer-side callers (subtreevalidation,blockvalidationcatchup) to back off and retry instead of failing through.Targets
feat/teranode-native-opsbecause thedualStreamWithFileCreationpath being protected lives on that branch (added via #826's cherry-pick chain).Production evidence
The trigger for this work was the nginx-cache error stream:
Investigation on
dev-scale-1-scale-1:kubectl describe podon all 4 asset replicasReason: OOMKilled,Exit Code: 137, restart counts 18-22pprof goroutineon a freshly-started pod (~6 min uptime)go-batcher.SetMaxConcurrent, 1,898 inerrgroup.Group.Go, 103 activeRepository.getTxs*.lockfiles left by previous crashesThe "upstream prematurely closed connection" line is the only externally visible symptom because
SIGKILLdrops the TCP connection before any response headers can flush. Fixing the OOM eliminates it.Root cause
Per-request memory was bounded by #827, but four amplifiers compound:
meta.Data.Txis the full parsedbt.Tx. For OP_RETURN data carriers / multi-output txs it can run 10KB-500KB+. A 10000-tx chunk is 100MB-5GB, not the documented "~5MB".dualStreamWithFileCreationwrites throughio.MultiWriter(storer, httpWriter). Both must accept each byte before the next chunk can write. Slow client OR slow file storer (Ensure atomic file store publication #826's atomic-rename path competing for global write semaphore) makes the producer block, and chunks pile up in flight +resultsChan+pending.writeTransactionsViaSubtreeStoreStreamingis shared byGetSubtreeData,GetLegacyBlock, andmining_candidate_legacy_block— they fan out into the samegetTxsmachinery, multiplying goroutine pressure on Aerospike.ConcurrencyGetSubtreeDataReaderonly caps the reader slot — it doesn't distinguish the cheap file-exists path from the memory-heavy on-demand creation path. With four expensive creations in flight you've already scheduled 50+ chunks worth of tx data in heap.Changes
Asset — admission control
settings/asset_settings.go+settings/settings.goasset_concurrency_subtree_data_create(default 4).services/asset/repository/repository.gosemSubtreeDataCreatesemaphore +tryAcquireSemaphorePermithelper.services/asset/repository/GetSubtreeData.goGetSubtreeDataReader:Existscheck first (no permit held), file-exists fast path usessemGetSubtreeDataReader(bounded by FD count, can be raised), on-demand creation uses non-blockingTryAcquireonsemSubtreeDataCreateand returnsErrServiceUnavailableimmediately when at capacity. NewfileAppearedReadbackhelper handles the "file appeared during setup" race cleanly.services/asset/httpimpl/GetSubtreeData.goErrServiceUnavailable→ HTTP 503 withRetry-After: 1. 404 still 404; everything else still 500.services/asset/repository/GetLegacyBlock.gopendingchunk map (2 × concurrency); aborts with a clear error if a future scheduler regression grows it. ctx check every 256 txs inwriteChunkToWriterso client disconnect releaseschunkMetaSlicepromptly.Util — typed 503 + retry helper
util/http.gobuildHTTPErrornow produces typederrors.ErrServiceUnavailableon 503 (callers canerrors.Is). NewDoHTTPRequestBodyReaderWithRetry: exponential backoff (250ms → 5s, max 6 attempts), honors serverRetry-Afterheader, retries only on 503. Non-503 errors and ctx cancellation return immediately.util/http_test.goCallers — retry on peer 503
Three call sites switched from
DoHTTPRequestBodyReadertoDoHTTPRequestBodyReaderWithRetry:services/subtreevalidation/SubtreeValidation.go—getSubtreeMissingTxsservices/subtreevalidation/check_block_subtrees.go—CheckBlockSubtreesservices/blockvalidation/get_blocks.go—fetchSubtreeDataFromPeer(catchup)Behavior changes — what clients should expect
Asset server
Retry-After: 1instead of waiting up to 30s for a permit (then 503'ing anyway via timeout). The pod stays up and serves all already-created files via the fast path.Peer validation services
Test plan
Unit tests (in this PR)
TestDoHTTPRequestBodyReaderWithRetry_SuccessOnFirstTry— no retry overhead when healthyTestDoHTTPRequestBodyReaderWithRetry_RetriesOn503ThenSucceeds— returns body of successful attempt, not 503 bodyTestDoHTTPRequestBodyReaderWithRetry_ExhaustsAttemptsOnPersistent503— final error is typedErrServiceUnavailableTestDoHTTPRequestBodyReaderWithRetry_HonorsRetryAfter— serverRetry-After: 1overrides much-smallerinitialDelayTestDoHTTPRequestBodyReaderWithRetry_NoRetryOnNon503(4 sub-cases) — 500/502/504/404 fail in 1 attemptTestDoHTTPRequestBodyReaderWithRetry_ContextCancelAbortsRetries— ctx cancel short-circuits the loopTestParseRetryAfter— empty/negative/non-numeric inputs return 0Local verification done
go buildclean acrossutil/,services/asset/,services/subtreevalidation/,services/blockvalidation/go vetclean (only pre-existing warnings intest/utils/unrelated to this change)go test ./util/ -racepasses in ~7sPending verification (cannot do locally)
dev-scale-1and confirm:Risks and rollout notes
asset_concurrency_subtree_data_create=0reverts to unlimited (the prior behavior). Keep a knob in case the cap turns out to be too aggressive./subtree_datathat don't go through the new retry helper will see 503s they didn't see before. The three known internal callers are updated; any external/unknown caller falls back to existing peer-failure handling, which already treats network errors as transient.Retry-Afterhonoring + the 503-only filter (we don't retry on 5xx in general).feat/teranode-native-opsbranch returns to the prior (crashlooping) behavior — only do this if the new behavior is worse than the OOM, which would be surprising.Companion / follow-up work (not in this PR)
asset_subtreeDataStreamingChunkSize(currently 10000) — config-only change, can ship via Helm without a code change.processTxMetaUsingStoreConcurrencyreview —getTxsfan-out is the largest goroutine multiplier; we may want a global cap rather than per-call.