fix(asset): admission control for subtree_data + 503 retry in peer callers by icellan · Pull Request #830 · bsv-blockchain/teranode

icellan · 2026-05-08T08:05:47Z

Summary

Hotfix for production OOMKilled crashloops on the asset service caused by unbounded concurrent on-demand /subtree_data file creation.

The streaming work in #827 capped per-request memory but did not add admission control. Each dualStreamWithFileCreation call holds chunk-sized batches of full transaction metadata in memory; multiplied by however many clients arrive at once and by per-tx size variance (data carriers, multi-output txs), the asset process trivially exceeds even a 64Gi limit. This PR adds the missing admission cap, surfaces it as an HTTP 503 with Retry-After, and teaches the three known peer-side callers (subtreevalidation, blockvalidation catchup) to back off and retry instead of failing through.

Targets feat/teranode-native-ops because the dualStreamWithFileCreation path being protected lives on that branch (added via #826's cherry-pick chain).

Production evidence

The trigger for this work was the nginx-cache error stream:

upstream prematurely closed connection while reading response header from upstream,
client: 10.224.5.49, server: , request: "GET /api/v1/subtree_data/746a4271...",
upstream: "http://10.96.115.8:8090/api/v1/subtree_data/746a4271..."

Investigation on dev-scale-1-scale-1:

Signal	Observation
`kubectl describe pod` on all 4 asset replicas	`Reason: OOMKilled`, `Exit Code: 137`, restart counts 18-22
Pod lifespan	60-90s between restart and next OOM
Memory limit	64Gi (already huge)
`pprof goroutine` on a freshly-started pod (~6 min uptime)	66,355 goroutines — top groups: 3,616 in `go-batcher.SetMaxConcurrent`, 1,898 in `errgroup.Group.Go`, 103 active `Repository.getTxs`
Quorum lock dir	hundreds of stale `*.lock` files left by previous crashes

The "upstream prematurely closed connection" line is the only externally visible symptom because SIGKILL drops the TCP connection before any response headers can flush. Fixing the OOM eliminates it.

Root cause

Per-request memory was bounded by #827, but four amplifiers compound:

meta.Data.Tx is the full parsed bt.Tx. For OP_RETURN data carriers / multi-output txs it can run 10KB-500KB+. A 10000-tx chunk is 100MB-5GB, not the documented "~5MB".
dualStreamWithFileCreation writes through io.MultiWriter(storer, httpWriter). Both must accept each byte before the next chunk can write. Slow client OR slow file storer (Ensure atomic file store publication #826's atomic-rename path competing for global write semaphore) makes the producer block, and chunks pile up in flight + resultsChan + pending.
writeTransactionsViaSubtreeStoreStreaming is shared by GetSubtreeData, GetLegacyBlock, and mining_candidate_legacy_block — they fan out into the same getTxs machinery, multiplying goroutine pressure on Aerospike.
ConcurrencyGetSubtreeDataReader only caps the reader slot — it doesn't distinguish the cheap file-exists path from the memory-heavy on-demand creation path. With four expensive creations in flight you've already scheduled 50+ chunks worth of tx data in heap.

Changes

Asset — admission control

File	What
`settings/asset_settings.go` + `settings/settings.go`	New setting `asset_concurrency_subtree_data_create` (default 4).
`services/asset/repository/repository.go`	New `semSubtreeDataCreate` semaphore + `tryAcquireSemaphorePermit` helper.
`services/asset/repository/GetSubtreeData.go`	Restructured `GetSubtreeDataReader`: `Exists` check first (no permit held), file-exists fast path uses `semGetSubtreeDataReader` (bounded by FD count, can be raised), on-demand creation uses non-blocking `TryAcquire` on `semSubtreeDataCreate` and returns `ErrServiceUnavailable` immediately when at capacity. New `fileAppearedReadback` helper handles the "file appeared during setup" race cleanly.
`services/asset/httpimpl/GetSubtreeData.go`	Maps `ErrServiceUnavailable` → HTTP 503 with `Retry-After: 1`. 404 still 404; everything else still 500.
`services/asset/repository/GetLegacyBlock.go`	Defensive cap on `pending` chunk map (`2 × concurrency`); aborts with a clear error if a future scheduler regression grows it. ctx check every 256 txs in `writeChunkToWriter` so client disconnect releases `chunkMetaSlice` promptly.

Util — typed 503 + retry helper

File	What
`util/http.go`	`buildHTTPError` now produces typed `errors.ErrServiceUnavailable` on 503 (callers can `errors.Is`). New `DoHTTPRequestBodyReaderWithRetry`: exponential backoff (250ms → 5s, max 6 attempts), honors server `Retry-After` header, retries only on 503. Non-503 errors and ctx cancellation return immediately.
`util/http_test.go`	7 unit tests / 15 sub-cases. Race-clean.

Callers — retry on peer 503

Three call sites switched from DoHTTPRequestBodyReader to DoHTTPRequestBodyReaderWithRetry:

services/subtreevalidation/SubtreeValidation.go — getSubtreeMissingTxs
services/subtreevalidation/check_block_subtrees.go — CheckBlockSubtrees
services/blockvalidation/get_blocks.go — fetchSubtreeDataFromPeer (catchup)

Behavior changes — what clients should expect

Asset server

Under nominal load: no observable change.
Under create-path saturation: the 5th simultaneous on-demand creation gets HTTP 503 with Retry-After: 1 instead of waiting up to 30s for a permit (then 503'ing anyway via timeout). The pod stays up and serves all already-created files via the fast path.
For an unhealthy cluster: a client that hammers the server during a stuck Aerospike batch gets 503s and is expected to retry. Far better failure mode than crashing.

Peer validation services

subtreevalidation / blockvalidation: a peer's transient 503 is now retried (up to ~7.75s total worst case) instead of immediately falling through to "peer cannot provide subtree data" / next-peer attempt.
This may slightly slow down detection of genuinely broken peers (a peer that always 503s now takes ~8s to give up), but eliminates the spurious "peer is broken" classification when the peer is just temporarily admission-throttled. Net positive for sync stability.

Test plan

Unit tests (in this PR)

TestDoHTTPRequestBodyReaderWithRetry_SuccessOnFirstTry — no retry overhead when healthy
TestDoHTTPRequestBodyReaderWithRetry_RetriesOn503ThenSucceeds — returns body of successful attempt, not 503 body
TestDoHTTPRequestBodyReaderWithRetry_ExhaustsAttemptsOnPersistent503 — final error is typed ErrServiceUnavailable
TestDoHTTPRequestBodyReaderWithRetry_HonorsRetryAfter — server Retry-After: 1 overrides much-smaller initialDelay
TestDoHTTPRequestBodyReaderWithRetry_NoRetryOnNon503 (4 sub-cases) — 500/502/504/404 fail in 1 attempt
TestDoHTTPRequestBodyReaderWithRetry_ContextCancelAbortsRetries — ctx cancel short-circuits the loop
TestParseRetryAfter — empty/negative/non-numeric inputs return 0

Local verification done

go build clean across util/, services/asset/, services/subtreevalidation/, services/blockvalidation/
go vet clean (only pre-existing warnings in test/utils/ unrelated to this change)
go test ./util/ -race passes in ~7s
Pre-commit hooks: gci, gofmt, golangci-lint, etc. all green

Pending verification (cannot do locally)

CI test suite (deferring to CI run on this PR)
Deploy to dev-scale-1 and confirm:
- asset pods stay up (no OOMKilled events)
- 503s appear in metrics/logs under load instead of crashes
- subtreevalidation / blockvalidation peers tolerate the 503s and complete catchup

Risks and rollout notes

Feature flag-able via setting: asset_concurrency_subtree_data_create=0 reverts to unlimited (the prior behavior). Keep a knob in case the cap turns out to be too aggressive.
The 503 path is new: clients of /subtree_data that don't go through the new retry helper will see 503s they didn't see before. The three known internal callers are updated; any external/unknown caller falls back to existing peer-failure handling, which already treats network errors as transient.
Retry storm risk: 6 attempts × ~7.75s worst case per request × N peers could amplify load on a struggling asset server. Mitigated by the exponential backoff + Retry-After honoring + the 503-only filter (we don't retry on 5xx in general).
No schema/wire changes: pure server-side and client-side error-handling change.
Rollback: revert this commit. The deployed feat/teranode-native-ops branch returns to the prior (crashlooping) behavior — only do this if the new behavior is worse than the OOM, which would be surprising.

Companion / follow-up work (not in this PR)

Lower default asset_subtreeDataStreamingChunkSize (currently 10000) — config-only change, can ship via Helm without a code change.
Background quorum lock cleanup on startup — current per-request lazy expiration is fine but creates surprising latency when many stale locks exist after a crash storm.
processTxMetaUsingStoreConcurrency review — getTxs fan-out is the largest goroutine multiplier; we may want a global cap rather than per-call.

Production asset pods were OOMKilling under load on /api/v1/subtree_data, manifesting downstream as nginx "upstream prematurely closed connection while reading response header from upstream". Goroutine profiles showed 60K+ goroutines accumulating in the chunk-fetch fan-out before SIGKILL. The earlier streaming work bounded per-request memory but did nothing to cap concurrent on-demand subtreeData file creations: each one holds chunkSize tx-metadata batches in memory, multiplied by however many clients arrive at once. With large transactions and slow clients the process trivially exceeds even a 64Gi limit. Asset side - admission control: - New asset_concurrency_subtree_data_create setting (default 4) gates the dualStreamWithFileCreation path with non-blocking TryAcquire. When the cap is reached, requests get HTTP 503 with Retry-After: 1 instead of waiting up to 30s for a permit. - Restructured GetSubtreeDataReader to check Exists first without holding the reader semaphore. File-exists fast path uses the existing reader sem; on-demand creation uses the new create sem. - Defensive cap on the pending chunk map in writeTransactionsViaSubtreeStoreStreaming (2 * concurrency); aborts with a clear error if a future scheduler regression grows it. - ctx check every 256 txs in writeChunkToWriter so a client disconnect releases chunkMetaSlice promptly instead of waiting for the next pipe write to fail. HTTP utility - typed 503 + retry helper: - buildHTTPError now produces errors.ErrServiceUnavailable on 503 so callers can errors.Is on it. - New DoHTTPRequestBodyReaderWithRetry: exponential backoff (250ms -> 5s, max 6 attempts), honors Retry-After header, retries only on 503. Non-503 errors and ctx cancellation return immediately. Callers - retry on peer 503: - subtreevalidation/SubtreeValidation.go (getSubtreeMissingTxs) - subtreevalidation/check_block_subtrees.go (CheckBlockSubtrees) - blockvalidation/get_blocks.go (fetchSubtreeDataFromPeer) Tests: - 7 new unit tests for the retry helper covering success, retry-then- succeed, attempt exhaustion, Retry-After honoring, no-retry on non- 503, ctx cancellation, and parseRetryAfter parsing. Race-clean. Verified: go build, go vet, go test ./util/ -race all clean.

github-actions · 2026-05-08T08:07:00Z

🤖 Claude Code Review

Status: Complete

Review Summary

This PR implements a critical production hotfix for OOMKilled crashes in the asset service by adding admission control for on-demand subtree data creation and retry logic for 503 responses.

Current Review: No issues found

The implementation is solid:

Admission control: Non-blocking TryAcquire pattern correctly fails fast with HTTP 503 when capacity is reached, preventing memory exhaustion
Semaphore lifecycle: All permits are properly acquired and released, including error paths and the fileAppearedReadback helper
Retry logic: Well-designed exponential backoff (250ms→5s, max 6 attempts) with Retry-After header support
Context handling: Proper cancellation checks (ctxCheckEvery=256) in writeChunkToWriter prevent holding memory when clients disconnect
Test coverage: Comprehensive unit tests (7 tests covering success, retry, exhaustion, Retry-After, non-503, context cancel, edge cases)
Documentation: Clear godoc explaining concurrency model, ownership semantics, and behavior changes
Defensive checks: pendingCap prevents scheduler regressions from causing unbounded memory growth

The PR description provides excellent production evidence, clear risk analysis, and detailed rollout notes.

github-actions · 2026-05-08T08:21:39Z

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-830 (b4b32dc)

Summary

Regressions: 0
Improvements: 0
Unchanged: 142
Significance level: p < 0.05

All benchmark results (sec/op)

Benchmark	Baseline	Current	Change	p-value
_NewBlockFromBytes-4	1.659µ	1.660µ	~	1.000
SplitSyncedParentMap_SetIfNotExists/256_buckets-4	61.56n	61.92n	~	0.100
SplitSyncedParentMap_SetIfNotExists/16_buckets-4	61.60n	62.22n	~	0.700
SplitSyncedParentMap_SetIfNotExists/1_bucket-4	61.66n	61.86n	~	0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets...	30.38n	30.81n	~	1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_...	51.90n	52.33n	~	0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa...	106.3n	106.5n	~	1.000
MiningCandidate_Stringify_Short-4	263.6n	268.4n	~	0.100
MiningCandidate_Stringify_Long-4	1.884µ	1.877µ	~	0.300
MiningSolution_Stringify-4	969.2n	972.3n	~	0.400
BlockInfo_MarshalJSON-4	1.744µ	1.742µ	~	1.000
NewFromBytes-4	126.9n	126.0n	~	1.000
Mine_EasyDifficulty-4	67.62µ	67.18µ	~	0.700
Mine_WithAddress-4	6.894µ	7.002µ	~	0.100
BlockAssembler_AddTx-4	0.02833n	0.02798n	~	1.000
AddNode-4	11.30	11.85	~	0.400
AddNodeWithMap-4	11.55	11.23	~	0.400
DirectSubtreeAdd/4_per_subtree-4	62.57n	58.13n	~	0.200
DirectSubtreeAdd/64_per_subtree-4	31.64n	28.55n	~	0.100
DirectSubtreeAdd/256_per_subtree-4	30.70n	27.22n	~	0.100
DirectSubtreeAdd/1024_per_subtree-4	29.17n	26.10n	~	0.100
DirectSubtreeAdd/2048_per_subtree-4	28.76n	25.82n	~	0.100
SubtreeProcessorAdd/4_per_subtree-4	288.2n	276.2n	~	0.200
SubtreeProcessorAdd/64_per_subtree-4	272.8n	272.0n	~	0.700
SubtreeProcessorAdd/256_per_subtree-4	275.1n	274.1n	~	0.700
SubtreeProcessorAdd/1024_per_subtree-4	269.8n	267.9n	~	1.000
SubtreeProcessorAdd/2048_per_subtree-4	266.0n	266.8n	~	0.700
SubtreeProcessorRotate/4_per_subtree-4	272.2n	270.9n	~	1.000
SubtreeProcessorRotate/64_per_subtree-4	272.4n	269.8n	~	0.100
SubtreeProcessorRotate/256_per_subtree-4	268.5n	272.6n	~	0.100
SubtreeProcessorRotate/1024_per_subtree-4	268.4n	274.4n	~	0.100
SubtreeNodeAddOnly/4_per_subtree-4	53.78n	53.85n	~	1.000
SubtreeNodeAddOnly/64_per_subtree-4	34.16n	34.33n	~	0.700
SubtreeNodeAddOnly/256_per_subtree-4	33.23n	33.47n	~	0.100
SubtreeNodeAddOnly/1024_per_subtree-4	32.54n	32.61n	~	0.700
SubtreeCreationOnly/4_per_subtree-4	112.6n	112.2n	~	1.000
SubtreeCreationOnly/64_per_subtree-4	393.8n	392.9n	~	0.700
SubtreeCreationOnly/256_per_subtree-4	1.312µ	1.352µ	~	0.200
SubtreeCreationOnly/1024_per_subtree-4	4.374µ	4.482µ	~	0.100
SubtreeCreationOnly/2048_per_subtree-4	7.783µ	8.111µ	~	0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4	267.9n	271.4n	~	0.700
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4	269.5n	270.6n	~	0.200
ParallelGetAndSetIfNotExists/1k_nodes-4	789.9µ	815.1µ	~	0.100
ParallelGetAndSetIfNotExists/10k_nodes-4	1.331m	1.572m	~	0.100
ParallelGetAndSetIfNotExists/50k_nodes-4	6.686m	6.687m	~	1.000
ParallelGetAndSetIfNotExists/100k_nodes-4	13.46m	13.40m	~	1.000
SequentialGetAndSetIfNotExists/1k_nodes-4	661.4µ	649.3µ	~	0.100
SequentialGetAndSetIfNotExists/10k_nodes-4	2.804m	2.901m	~	0.200
SequentialGetAndSetIfNotExists/50k_nodes-4	10.37m	10.50m	~	0.100
SequentialGetAndSetIfNotExists/100k_nodes-4	20.03m	19.94m	~	0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4	632.4µ	841.6µ	~	0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4	4.151m	4.329m	~	0.100
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4	16.71m	16.65m	~	1.000
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4	688.4µ	686.7µ	~	0.400
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4	5.651m	5.664m	~	0.700
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4	37.88m	37.46m	~	0.100
DiskTxMap_SetIfNotExists-4	4.167µ	3.935µ	~	0.400
DiskTxMap_SetIfNotExists_Parallel-4	3.768µ	3.593µ	~	0.100
DiskTxMap_ExistenceOnly-4	464.7n	339.7n	~	0.700
Queue-4	194.5n	200.2n	~	0.100
AtomicPointer-4	4.531n	4.581n	~	0.700
ReorgOptimizations/DedupFilterPipeline/Old/10K-4	856.3µ	897.9µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4	833.9µ	860.6µ	~	0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4	111.0µ	112.6µ	~	0.400
ReorgOptimizations/AllMarkFalse/New/10K-4	62.38µ	62.26µ	~	0.100
ReorgOptimizations/HashSlicePool/Old/10K-4	68.88µ	65.66µ	~	0.400
ReorgOptimizations/HashSlicePool/New/10K-4	11.27µ	11.66µ	~	0.400
ReorgOptimizations/NodeFlags/Old/10K-4	5.529µ	5.790µ	~	0.700
ReorgOptimizations/NodeFlags/New/10K-4	1.862µ	1.912µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4	10.17m	12.23m	~	0.100
ReorgOptimizations/DedupFilterPipeline/New/100K-4	10.27m	10.58m	~	0.100
ReorgOptimizations/AllMarkFalse/Old/100K-4	1.141m	1.143m	~	1.000
ReorgOptimizations/AllMarkFalse/New/100K-4	685.4µ	690.5µ	~	0.700
ReorgOptimizations/HashSlicePool/Old/100K-4	641.1µ	779.6µ	~	0.100
ReorgOptimizations/HashSlicePool/New/100K-4	313.6µ	309.9µ	~	1.000
ReorgOptimizations/NodeFlags/Old/100K-4	56.21µ	61.69µ	~	0.100
ReorgOptimizations/NodeFlags/New/100K-4	19.65µ	20.96µ	~	0.100
TxMapSetIfNotExists-4	51.67n	51.98n	~	0.600
TxMapSetIfNotExistsDuplicate-4	38.16n	38.64n	~	0.700
ChannelSendReceive-4	609.6n	604.4n	~	0.200
CalcBlockWork-4	472.3n	472.8n	~	1.000
CalculateWork-4	624.9n	641.9n	~	0.100
BuildBlockLocatorString_Helpers/Size_10-4	1.456µ	1.329µ	~	0.700
BuildBlockLocatorString_Helpers/Size_100-4	12.36µ	14.90µ	~	0.100
BuildBlockLocatorString_Helpers/Size_1000-4	123.0µ	124.6µ	~	0.400
CatchupWithHeaderCache-4	104.4m	104.4m	~	1.000
_BufferPoolAllocation/16KB-4	4.431µ	3.286µ	~	0.100
_BufferPoolAllocation/32KB-4	8.310µ	8.574µ	~	1.000
_BufferPoolAllocation/64KB-4	15.74µ	17.17µ	~	0.700
_BufferPoolAllocation/128KB-4	32.27µ	27.51µ	~	0.100
_BufferPoolAllocation/512KB-4	108.4µ	115.0µ	~	0.200
_BufferPoolConcurrent/32KB-4	17.72µ	18.71µ	~	0.100
_BufferPoolConcurrent/64KB-4	27.36µ	30.32µ	~	0.100
_BufferPoolConcurrent/512KB-4	140.2µ	147.4µ	~	0.100
_SubtreeDeserializationWithBufferSizes/16KB-4	667.2µ	626.7µ	~	0.100
_SubtreeDeserializationWithBufferSizes/32KB-4	660.6µ	628.2µ	~	0.100
_SubtreeDeserializationWithBufferSizes/64KB-4	650.4µ	631.0µ	~	0.100
_SubtreeDeserializationWithBufferSizes/128KB-4	657.4µ	621.8µ	~	0.100
_SubtreeDeserializationWithBufferSizes/512KB-4	690.9µ	637.9µ	~	0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4	35.20m	35.70m	~	0.100
_SubtreeDataDeserializationWithBufferSizes/32KB-4	35.29m	35.69m	~	0.200
_SubtreeDataDeserializationWithBufferSizes/64KB-4	35.23m	35.55m	~	0.100
_SubtreeDataDeserializationWithBufferSizes/128KB-4	35.00m	35.14m	~	1.000
_SubtreeDataDeserializationWithBufferSizes/512KB-4	34.57m	34.90m	~	0.100
_PooledVsNonPooled/Pooled-4	738.8n	735.4n	~	0.700
_PooledVsNonPooled/NonPooled-4	7.410µ	6.753µ	~	0.100
_MemoryFootprint/Current_512KB_32concurrent-4	6.759µ	7.161µ	~	0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4	10.63µ	10.36µ	~	0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4	10.27µ	10.53µ	~	0.100
_prepareTxsPerLevel-4	403.8m	400.0m	~	0.400
_prepareTxsPerLevelOrdered-4	4.085m	4.056m	~	1.000
_prepareTxsPerLevel_Comparison/Original-4	405.2m	401.2m	~	1.000
_prepareTxsPerLevel_Comparison/Optimized-4	3.622m	4.006m	~	0.100
SubtreeSizes/10k_tx_4_per_subtree-4	1.253m	1.247m	~	1.000
SubtreeSizes/10k_tx_16_per_subtree-4	295.1µ	296.8µ	~	1.000
SubtreeSizes/10k_tx_64_per_subtree-4	71.18µ	72.08µ	~	0.200
SubtreeSizes/10k_tx_256_per_subtree-4	17.61µ	17.79µ	~	0.600
SubtreeSizes/10k_tx_512_per_subtree-4	8.762µ	8.760µ	~	1.000
SubtreeSizes/10k_tx_1024_per_subtree-4	4.350µ	4.361µ	~	0.600
SubtreeSizes/10k_tx_2k_per_subtree-4	2.147µ	2.160µ	~	0.400
BlockSizeScaling/10k_tx_64_per_subtree-4	69.48µ	69.47µ	~	0.700
BlockSizeScaling/10k_tx_256_per_subtree-4	17.46µ	17.24µ	~	0.700
BlockSizeScaling/10k_tx_1024_per_subtree-4	4.342µ	4.294µ	~	0.700
BlockSizeScaling/50k_tx_64_per_subtree-4	367.8µ	366.2µ	~	1.000
BlockSizeScaling/50k_tx_256_per_subtree-4	86.99µ	88.59µ	~	0.400
BlockSizeScaling/50k_tx_1024_per_subtree-4	21.74µ	21.43µ	~	0.400
SubtreeAllocations/small_subtrees_exists_check-4	149.5µ	148.3µ	~	0.400
SubtreeAllocations/small_subtrees_data_fetch-4	160.4µ	158.2µ	~	0.400
SubtreeAllocations/small_subtrees_full_validation-4	305.3µ	308.6µ	~	0.700
SubtreeAllocations/medium_subtrees_exists_check-4	8.750µ	8.877µ	~	0.100
SubtreeAllocations/medium_subtrees_data_fetch-4	9.260µ	9.310µ	~	0.400
SubtreeAllocations/medium_subtrees_full_validation-4	17.50µ	17.26µ	~	0.200
SubtreeAllocations/large_subtrees_exists_check-4	2.088µ	2.077µ	~	0.700
SubtreeAllocations/large_subtrees_data_fetch-4	2.209µ	2.195µ	~	0.600
SubtreeAllocations/large_subtrees_full_validation-4	4.318µ	4.292µ	~	0.200
StoreBlock_Sequential/BelowCSVHeight-4	335.1µ	325.2µ	~	0.200
StoreBlock_Sequential/AboveCSVHeight-4	335.7µ	327.4µ	~	0.700
GetUtxoHashes-4	255.3n	258.9n	~	0.400
GetUtxoHashes_ManyOutputs-4	44.57µ	44.65µ	~	1.000
_NewMetaDataFromBytes-4	240.1n	238.8n	~	1.000
_Bytes-4	630.9n	624.2n	~	0.100
_MetaBytes-4	576.6n	562.5n	~	0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-08 08:21 UTC

oskarszoon

Approve. Prod-driven design (66k goroutines at OOM, getTxs fan-out amplifier identified precisely), layered fix — non-blocking TryAcquire on a new dedicated semSubtreeDataCreate separate from the reader sem, 503+Retry-After on saturation, typed ErrServiceUnavailable for caller errors.Is, 503-only retry with exponential backoff + Retry-After honoring, pending chunk-map cap, ctx-every-256-txs in the writer. Each layer has a clear purpose.

Three confirmations before merge:

test CI job is failing on the latest run while everything else (smoketest, sequential-{sqlite,postgres,aerospike}, lint, 14 benches) is green. The PR adds 7 new tests in util/http_test.go so worth confirming this is the same flake hitting other PRs this week vs a real failure.
"Deploy to dev-scale-1 and verify peer-side recovers from saturation" is unchecked in the PR's verification list. Given retry-storm is correctly identified in the Risks section, the load test is the proof-point.
This targets feat/teranode-native-ops, not main. Worth a one-line note in the PR description so the next reader doesn't expect it on main and so the eventual main-port stays tracked.

icellan · 2026-06-12T13:24:05Z

Closing as already integrated.

This PR's single commit (3567cc05d) has the identical patch-id as commit fdaeadb7e on the base branch feat/teranode-native-ops — it is the same change, already merged. The base has since evolved further with superseding fixes on top of it:

8c8907299 — address critical+high review findings
79cc8b8f2 — drop chunked terminator on mid-stream failure so caches don't store truncated bodies
599f644a4 — stop cancel cascade + segregate client-gone errors (fix(subtree_data): stop cancel cascade + segregate client-gone errors #947)

A rebase onto the current base produces zero unique commits (the base is a strict superset of this PR), so there is nothing left to merge.

icellan mentioned this pull request May 8, 2026

fix(blockvalidation): serialize setTxMined via setMinedChan worker #831

Merged

5 tasks

oskarszoon approved these changes May 14, 2026

View reviewed changes

icellan self-assigned this Jun 10, 2026

icellan closed this Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(asset): admission control for subtree_data + 503 retry in peer callers#830

fix(asset): admission control for subtree_data + 503 retry in peer callers#830
icellan wants to merge 1 commit into
feat/teranode-native-opsfrom
fix/asset-subtree-data-admission-control

icellan commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

oskarszoon left a comment

Uh oh!

icellan commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

icellan commented May 8, 2026

Summary

Production evidence

Root cause

Changes

Asset — admission control

Util — typed 503 + retry helper

Callers — retry on peer 503

Behavior changes — what clients should expect

Asset server

Peer validation services

Test plan

Unit tests (in this PR)

Local verification done

Pending verification (cannot do locally)

Risks and rollout notes

Companion / follow-up work (not in this PR)

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Summary

Uh oh!

github-actions Bot commented May 8, 2026

Benchmark Comparison Report

Summary

Uh oh!

oskarszoon left a comment

Choose a reason for hiding this comment

Uh oh!

icellan commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 8, 2026 •

edited

Loading