Skip to content

blockvalidation: OOM during ttn catch-up sync — 70% of heap in go-bt tx/output decode #920

@oskarszoon

Description

@oskarszoon

Summary

blockvalidation OOMs during testnet catch-up sync around block ~5,000. Captured heap profile at 13.17 GiB RSS shows ~70% of inuse heap in go-bt tx/output decode hot paths. Container restarted (RestartCount=1) seconds after the peak snapshot, consistent with OOM kill.

Environment

Evidence

Heap snapshot captured at 2026-05-21T15:07:57Z, RSS = 13.17 GiB at sample time. Container restart timestamp: 2026-05-21T15:09:53Z.

Top of 8.29 GB inuse:

% MB Function
27.3% 2264 go-bt/v2.(*Output).ReadFrom (output.go:51)
20.9% 1731 go-bt/v2/bscript.NewFromBytes (script.go:43)
13.7% 1138 go-bt/v2.(*Tx).ReadFrom (tx.go:198)
8.6% 716 go-bt/v2.(*Tx).ReadFrom (tx.go:205)
7.8% 648 runtime.mallocgc (uncategorized)
7.8% 642 bytes.growSlice (bytes/buffer.go:267)
6.3% 524 go-bt/v2.(*Tx).toBytesHelper (tx.go:540)

(go-bt/v2@v2.6.3)

RSS trajectory (sawtooth)

UTC RSS (MiB) Reason
14:08:42 2704 first-cross
14:20:21 5620 cooldown
14:25:49 5030 cooldown
14:31:18 3219 cooldown
14:36:48 3164 cooldown
14:39:41 3746 delta
14:44:39 5274 delta
14:50:07 2939 cooldown
14:54:01 3481 delta
14:59:29 3126 cooldown
15:03:56 8022 delta
15:07:52 13168 delta
~15:09:53 OOM kill container restart

Sawtooth (climbing → GC-reclaiming → climbing) → allocation churn, not a leak. But the peak crosses cgroup limit before GC can intervene.

Suspected cause

The Output.ReadFrom / bscript.NewFromBytes / Tx.ReadFrom triad allocates fresh []byte per script per output per tx. On historical mainnet blocks containing very large outputs (OP_RETURN data, large script bodies), the per-decode allocation outpaces GC.

Looks structurally similar to the wire-side issue addressed by #885 (per-payload []byte allocation in go-wire.ReadMessageWithEncodingN), but on the validation side via go-bt. The streaming fix in #885 helps legacy ingestion; this hot path is hit during blockvalidation's parse of subtrees / txs received via gRPC + the catchup pipeline, so #885's win does not apply here.

Potential fix directions

Likely needs work in go-bt (similar to the go-wire follow-ups planned):

  1. Pool the per-output script []byte buffers across a block's tx decode.
  2. Replace the per-tx toBytesHelper re-serialization (524 MB at peak — needed only for hashing?) with a streaming hash or a reusable buffer.
  3. Audit bscript.NewFromBytes for unnecessary copies — if the source buffer is already owned, wrap rather than copy.

Repro

  1. Point a Teranode node at testnet, fresh state.
  2. Let it catch up through block ~5,000.
  3. Watch blockvalidation RSS — expect a sawtooth pattern crossing 13 GiB.

Captured artifacts (local, not attached)

Heap-raw + goroutines + CPU profile for the 13 GiB peak and 14 other snapshots across the sawtooth available on request (probe/eu3-bv-2026-05-21/watcher/ in my local checkout).

Related

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions