Skip to content

perf(legacy): stream-decode incoming block messages off the socket#885

Merged
oskarszoon merged 4 commits into
bsv-blockchain:mainfrom
oskarszoon:feat/legacy-streaming-block-decoder
May 19, 2026
Merged

perf(legacy): stream-decode incoming block messages off the socket#885
oskarszoon merged 4 commits into
bsv-blockchain:mainfrom
oskarszoon:feat/legacy-streaming-block-decoder

Conversation

@oskarszoon

Copy link
Copy Markdown
Contributor

Summary

  • Register a wire.SetExternalHandler for the block command in services/legacy/peer that calls MsgBlock.Bsvdecode directly against the network reader instead of letting ReadMessageWithEncodingN allocate a full-payload []byte first.
  • Wire it up in legacy.Server.Init right after wire.SetLimits.
  • Drop payload []byte from the OnBlock listener for block messages; the legacy netsync path never reads it.

Why

Heap profile pulled from a testnet host during a 25.4 GiB RSS peak on 2026-05-16:

% MB Function
46% 4080 go-wire.(*MsgTx).bsvdecodeWithScratch
32% 2857 go-wire.ReadMessageWithEncodingN payload buffer
13% 1142 go-wire.(*MsgBlock).Bsvdecode per-tx contiguous scripts

The 2.86 GB buffer is the lowest-effort win in that profile and it is caller-side — go-wire already exposes SetExternalHandler for exactly this case ("useful, for instance, for very large blocks that may not fit in memory and need to be processed differently" — go-wire/message.go:42).

Tradeoffs

  • Checksum verification skipped for block messages. The external-handler API gives only the payload reader (header consumed by ReadMessageWithEncodingN), so the DoubleHash comparison the default path performs cannot run here. Matches the existing extended-format path which already skips it at multi-GB sizes.
  • OnBlock buf is now nil for block messages. Only consumer (peer_server.OnBlockbsvutil.NewBlockFromBlockAndBytes) stores the slice but reads it only if block.Bytes() is later called. No legacy netsync code path calls Bytes() on an incoming block (grep -rn 'block.Bytes()' services/legacy/), and bsvutil.Block.Bytes() re-serializes from MsgBlock if anything ever does.

Safety

  • io.LimitedReader caps the inner decoder so a malformed varint cannot read past the declared payload boundary and desync the next ReadMessage.
  • Drains any unread suffix before returning so the stream stays aligned on every error path.
  • Registration is sync.Once; safe to call multiple times.

Test plan

  • go build ./services/legacy/...
  • go vet ./services/legacy/...
  • go test -run 'TestStreamingBlockHandler|TestRegisterStreamingBlockHandler' ./services/legacy/peer/ — 3 pass
    • Roundtrip equality vs buffered path (block hash + every tx hash match)
    • Drain-on-error boundary check
    • Idempotent registration
  • Smoke on testnet host, observe legacy RSS during fat-block receive vs the captured 25.4 GiB peak

Follow-ups (go-wire-side, separate PRs)

  • sync.Pool the per-block scratch buffer in MsgBlock.Bsvdecode.
  • Replace per-tx scripts := make([]byte, totalScriptSize) final-copy buffer with a per-block arena.
  • Null the payload []byte reference inside ReadMessageWithEncodingN after Bsvdecode returns.

Register a go-wire SetExternalHandler for the "block" command in
services/legacy/peer that calls MsgBlock.Bsvdecode directly against
the network reader, replacing the default ReadMessageWithEncodingN
path which allocates a fresh []byte the full size of the declared
payload before parsing.

Heap profile pulled from bsva-ovh-teranode-ttn-eu-4 during a 25.4 GiB
RSS peak on 2026-05-16 showed 2.86 GB of inuse heap held by
ReadMessageWithEncodingN's payload buffer alone, the second-largest
contributor after per-tx scratch in go-wire (probe bundle archived
at probe/eu4-rss-peak-2026-05-16). On multi-GB blocks that buffer is
allocated once, copied from once, and then released, but it dominates
peak inuse during the decode window and aggravates GC pressure.

Tradeoffs:

- Payload checksum is no longer verified for the "block" command.
  The external-handler API gives the handler only the payload reader
  (the header is consumed before dispatch), so the DoubleHash
  comparison the default path performs cannot run here. This matches
  the existing extended-format path which already skips checksum
  verification because the cost outweighs the benefit at multi-GB
  sizes.
- The third return value of ReadMessageWithEncodingN (the raw payload
  bytes) is now nil for block messages. Only consumer in the legacy
  path is peer_server.OnBlock -> bsvutil.NewBlockFromBlockAndBytes,
  which stores the slice but reads it only if block.Bytes() is later
  called; no legacy netsync code path calls Bytes() on a block coming
  in over the wire (verified by grep), and bsvutil falls back to
  re-serializing from MsgBlock if it ever is, so correctness is
  preserved.

The handler wraps the reader in an io.LimitedReader so a malformed
varint cannot read past the declared payload boundary and desync the
next ReadMessage call, and drains any unread suffix before returning
so the stream stays aligned regardless of which error path
Bsvdecode takes.

Tests added: roundtrip equality vs the buffered path, drain-on-error
boundary check, idempotent registration.
@github-actions

github-actions Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Current Review:

No critical issues found. The implementation is well-designed and addresses the documented memory optimization goal effectively.

Observations:

  1. Memory optimization verified: The streaming approach eliminates the ~2.86 GB buffer allocation documented in the heap profile by decoding blocks directly from the network reader.

  2. Safety mechanisms in place:

    • io.LimitedReader prevents overread and stream desynchronization
    • Short-stream detection (wire_streaming.go:48-49) prevents silent failures
    • Comprehensive test coverage including boundary conditions
  3. Documentation accuracy: Claims about downstream validation (HandleBlockDirect enforcing PoW via HasMetTargetDifficulty, merkle reconstruction in prepareSubtrees, per-tx parsing) are correct per services/legacy/netsync/handle_block.go:172, :153, :316.

  4. Previous review feedback addressed: All Copilot and human reviewer concerns from earlier commits have been resolved in the current code.

Minor note: The trade-off of skipping wire-level checksum verification is well-documented and justified given downstream validation guarantees.

Comment thread services/legacy/peer/wire_streaming.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces legacy peer memory usage when receiving large block messages by stream-decoding block payloads directly from the socket instead of buffering the full payload first.

Changes:

  • Registers a legacy peer streaming handler during server initialization after wire limits are configured.
  • Adds a block external handler that uses io.LimitedReader, drains unread payload bytes, and returns no retained raw payload.
  • Adds tests for direct handler round-trip decoding, drain-on-error behavior, and repeated registration calls.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
services/legacy/Server.go Wires the streaming block handler into legacy server startup.
services/legacy/peer/wire_streaming.go Implements and registers the go-wire external block handler.
services/legacy/peer/wire_streaming_test.go Adds tests for streaming decode behavior and registration idempotency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/legacy/peer/wire_streaming_test.go
- Reframe checksum doc: not a tradeoff. Downstream HandleBlockDirect
  enforces PoW, merkle reconstruction, and per-tx parse — any wire-
  level corruption a DoubleHash would have caught fails one of those.
  Per Siggi review on bsv-blockchain#885.
- Strengthen registration test: assert wire.ReadMessageWithEncodingN
  takes the streaming path (raw == nil, block decodes correctly) post-
  registration, and survives repeat Register calls. Replaces the
  prior no-op idempotent test. Per Copilot review on bsv-blockchain#885.
@oskarszoon oskarszoon requested a review from Copilot May 18, 2026 10:00
@github-actions

github-actions Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-885 (fe96372)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 142
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.667µ 1.681µ ~ 0.200
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 61.43n 61.52n ~ 1.000
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 61.44n 61.58n ~ 0.300
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 61.57n 61.44n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 29.34n 29.61n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 50.23n 50.93n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 107.0n 118.3n ~ 0.100
MiningCandidate_Stringify_Short-4 264.7n 265.7n ~ 0.200
MiningCandidate_Stringify_Long-4 1.923µ 1.964µ ~ 0.100
MiningSolution_Stringify-4 976.5n 985.4n ~ 0.100
BlockInfo_MarshalJSON-4 1.769µ 1.805µ ~ 0.200
NewFromBytes-4 127.7n 127.2n ~ 0.700
Mine_EasyDifficulty-4 60.72µ 61.06µ ~ 0.100
Mine_WithAddress-4 6.780µ 6.900µ ~ 0.700
BlockAssembler_AddTx-4 0.02934n 0.03508n ~ 0.400
AddNode-4 10.77 10.95 ~ 0.400
AddNodeWithMap-4 10.52 11.01 ~ 0.100
DirectSubtreeAdd/4_per_subtree-4 57.70n 61.55n ~ 0.400
DirectSubtreeAdd/64_per_subtree-4 28.72n 28.67n ~ 1.000
DirectSubtreeAdd/256_per_subtree-4 27.52n 27.48n ~ 0.400
DirectSubtreeAdd/1024_per_subtree-4 26.39n 26.34n ~ 0.400
DirectSubtreeAdd/2048_per_subtree-4 26.13n 26.03n ~ 0.700
SubtreeProcessorAdd/4_per_subtree-4 290.5n 283.2n ~ 0.400
SubtreeProcessorAdd/64_per_subtree-4 278.9n 277.5n ~ 0.700
SubtreeProcessorAdd/256_per_subtree-4 280.5n 281.0n ~ 1.000
SubtreeProcessorAdd/1024_per_subtree-4 268.5n 273.2n ~ 0.100
SubtreeProcessorAdd/2048_per_subtree-4 270.0n 272.1n ~ 0.400
SubtreeProcessorRotate/4_per_subtree-4 278.0n 279.6n ~ 0.700
SubtreeProcessorRotate/64_per_subtree-4 275.3n 276.8n ~ 0.700
SubtreeProcessorRotate/256_per_subtree-4 274.2n 275.3n ~ 0.400
SubtreeProcessorRotate/1024_per_subtree-4 274.2n 274.9n ~ 0.700
SubtreeNodeAddOnly/4_per_subtree-4 54.72n 54.62n ~ 0.400
SubtreeNodeAddOnly/64_per_subtree-4 34.30n 34.54n ~ 0.200
SubtreeNodeAddOnly/256_per_subtree-4 33.45n 33.46n ~ 0.800
SubtreeNodeAddOnly/1024_per_subtree-4 32.80n 32.70n ~ 0.700
SubtreeCreationOnly/4_per_subtree-4 115.6n 114.9n ~ 0.400
SubtreeCreationOnly/64_per_subtree-4 402.6n 400.8n ~ 0.400
SubtreeCreationOnly/256_per_subtree-4 1.357µ 1.319µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 4.368µ 4.431µ ~ 0.100
SubtreeCreationOnly/2048_per_subtree-4 8.141µ 8.182µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 277.9n 275.1n ~ 0.300
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 278.3n 273.7n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 810.4µ 598.6µ ~ 0.100
ParallelGetAndSetIfNotExists/10k_nodes-4 1.583m 1.366m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 6.935m 6.905m ~ 1.000
ParallelGetAndSetIfNotExists/100k_nodes-4 13.73m 13.68m ~ 0.400
SequentialGetAndSetIfNotExists/1k_nodes-4 673.0µ 657.2µ ~ 0.200
SequentialGetAndSetIfNotExists/10k_nodes-4 2.834m 2.840m ~ 0.700
SequentialGetAndSetIfNotExists/50k_nodes-4 10.94m 10.93m ~ 1.000
SequentialGetAndSetIfNotExists/100k_nodes-4 20.63m 20.83m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 863.3µ 640.6µ ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 4.224m 4.222m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 16.94m 17.09m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 700.7µ 695.5µ ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 5.993m 5.881m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 38.71m 38.49m ~ 1.000
DiskTxMap_SetIfNotExists-4 3.567µ 3.721µ ~ 0.100
DiskTxMap_SetIfNotExists_Parallel-4 3.464µ 3.452µ ~ 1.000
DiskTxMap_ExistenceOnly-4 307.8n 325.9n ~ 0.100
Queue-4 186.0n 188.1n ~ 0.100
AtomicPointer-4 5.152n 4.984n ~ 0.400
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 840.9µ 858.5µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4 794.4µ 795.6µ ~ 0.700
ReorgOptimizations/AllMarkFalse/Old/10K-4 102.0µ 105.6µ ~ 0.100
ReorgOptimizations/AllMarkFalse/New/10K-4 61.96µ 62.29µ ~ 0.100
ReorgOptimizations/HashSlicePool/Old/10K-4 52.57µ 54.44µ ~ 0.200
ReorgOptimizations/HashSlicePool/New/10K-4 11.93µ 12.14µ ~ 0.700
ReorgOptimizations/NodeFlags/Old/10K-4 4.516µ 5.167µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.528µ 1.642µ ~ 0.700
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 9.792m 9.338m ~ 0.700
ReorgOptimizations/DedupFilterPipeline/New/100K-4 10.221m 9.758m ~ 0.400
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.093m 1.070m ~ 0.100
ReorgOptimizations/AllMarkFalse/New/100K-4 685.9µ 682.6µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/100K-4 524.9µ 550.6µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 313.8µ 309.7µ ~ 0.700
ReorgOptimizations/NodeFlags/Old/100K-4 50.10µ 49.78µ ~ 0.400
ReorgOptimizations/NodeFlags/New/100K-4 16.93µ 15.75µ ~ 0.700
TxMapSetIfNotExists-4 51.10n 51.65n ~ 0.200
TxMapSetIfNotExistsDuplicate-4 37.85n 38.07n ~ 0.200
ChannelSendReceive-4 588.6n 567.1n ~ 0.100
CalcBlockWork-4 496.3n 499.2n ~ 0.700
CalculateWork-4 674.9n 674.7n ~ 0.700
BuildBlockLocatorString_Helpers/Size_10-4 1.348µ 1.341µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 12.92µ 12.97µ ~ 1.000
BuildBlockLocatorString_Helpers/Size_1000-4 156.0µ 154.1µ ~ 1.000
CatchupWithHeaderCache-4 104.3m 104.4m ~ 1.000
_prepareTxsPerLevel-4 424.2m 426.6m ~ 0.400
_prepareTxsPerLevelOrdered-4 3.921m 3.909m ~ 0.400
_prepareTxsPerLevel_Comparison/Original-4 424.2m 432.7m ~ 0.100
_prepareTxsPerLevel_Comparison/Optimized-4 3.865m 3.883m ~ 1.000
SubtreeSizes/10k_tx_4_per_subtree-4 1.349m 1.374m ~ 0.700
SubtreeSizes/10k_tx_16_per_subtree-4 319.8µ 325.1µ ~ 0.700
SubtreeSizes/10k_tx_64_per_subtree-4 76.71µ 79.43µ ~ 0.100
SubtreeSizes/10k_tx_256_per_subtree-4 19.21µ 19.15µ ~ 1.000
SubtreeSizes/10k_tx_512_per_subtree-4 9.602µ 9.610µ ~ 0.700
SubtreeSizes/10k_tx_1024_per_subtree-4 4.853µ 4.720µ ~ 0.100
SubtreeSizes/10k_tx_2k_per_subtree-4 2.379µ 2.363µ ~ 0.100
BlockSizeScaling/10k_tx_64_per_subtree-4 75.06µ 75.18µ ~ 1.000
BlockSizeScaling/10k_tx_256_per_subtree-4 19.44µ 19.09µ ~ 0.100
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.738µ 4.702µ ~ 0.700
BlockSizeScaling/50k_tx_64_per_subtree-4 392.7µ 396.4µ ~ 0.400
BlockSizeScaling/50k_tx_256_per_subtree-4 94.28µ 95.38µ ~ 0.100
BlockSizeScaling/50k_tx_1024_per_subtree-4 22.93µ 23.65µ ~ 0.100
SubtreeAllocations/small_subtrees_exists_check-4 161.3µ 170.7µ ~ 0.200
SubtreeAllocations/small_subtrees_data_fetch-4 162.8µ 174.1µ ~ 0.100
SubtreeAllocations/small_subtrees_full_validation-4 325.9µ 339.2µ ~ 0.100
SubtreeAllocations/medium_subtrees_exists_check-4 9.304µ 9.728µ ~ 0.100
SubtreeAllocations/medium_subtrees_data_fetch-4 9.715µ 10.031µ ~ 0.200
SubtreeAllocations/medium_subtrees_full_validation-4 18.75µ 19.44µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 2.277µ 2.289µ ~ 0.400
SubtreeAllocations/large_subtrees_data_fetch-4 2.332µ 2.463µ ~ 0.100
SubtreeAllocations/large_subtrees_full_validation-4 4.855µ 4.985µ ~ 0.100
_BufferPoolAllocation/16KB-4 3.750µ 3.675µ ~ 0.100
_BufferPoolAllocation/32KB-4 9.207µ 8.678µ ~ 0.700
_BufferPoolAllocation/64KB-4 15.90µ 19.57µ ~ 0.100
_BufferPoolAllocation/128KB-4 23.13µ 34.61µ ~ 0.100
_BufferPoolAllocation/512KB-4 113.3µ 104.5µ ~ 0.700
_BufferPoolConcurrent/32KB-4 17.67µ 18.29µ ~ 0.700
_BufferPoolConcurrent/64KB-4 27.03µ 29.51µ ~ 0.100
_BufferPoolConcurrent/512KB-4 141.8µ 143.2µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/16KB-4 651.1µ 617.2µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 647.4µ 610.0µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 655.6µ 611.9µ ~ 0.200
_SubtreeDeserializationWithBufferSizes/128KB-4 645.4µ 610.4µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 638.0µ 632.0µ ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/16KB-4 35.61m 35.91m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/32KB-4 35.72m 35.84m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/64KB-4 35.67m 35.71m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/128KB-4 35.58m 35.76m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/512KB-4 35.74m 35.31m ~ 0.200
_PooledVsNonPooled/Pooled-4 829.5n 835.6n ~ 0.100
_PooledVsNonPooled/NonPooled-4 7.007µ 8.410µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 6.415µ 6.525µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.124µ 10.581µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 8.697µ 10.943µ ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 319.2µ 330.9µ ~ 0.200
StoreBlock_Sequential/AboveCSVHeight-4 332.5µ 331.8µ ~ 1.000
GetUtxoHashes-4 255.2n 259.7n ~ 1.000
GetUtxoHashes_ManyOutputs-4 43.45µ 43.48µ ~ 1.000
_NewMetaDataFromBytes-4 236.6n 242.0n ~ 0.100
_Bytes-4 632.2n 627.0n ~ 0.100
_MetaBytes-4 567.2n 566.0n ~ 0.200

Threshold: >10% with p < 0.05 | Generated: 2026-05-18 10:30 UTC

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread services/legacy/peer/wire_streaming.go
Comment thread services/legacy/peer/wire_streaming.go Outdated
- Detect undersized payload: io.Copy on a LimitedReader returns nil
  if the underlying reader EOFs before N reaches 0, so a successful
  Bsvdecode followed by a truncated stream would silently report
  success and desync the next ReadMessage. Check lr.N after drain
  and return an error when the declared payload was not fully
  consumed. Per Copilot review on bsv-blockchain#885.
- Add TestStreamingBlockHandler_ShortStreamReturnsError covering it.
- Drop incorrect TCP CRC claim from the rationale comment (TCP uses
  a 16-bit additive checksum, not a CRC). Reframe what is actually
  given up: the early-rejection signal, not integrity itself. Per
  Copilot review on bsv-blockchain#885.
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants