fix(blockchain): assign block ids idempotently per hash to prevent phantom-id sync wedge#1043
Conversation
…up quick-validate
…lock-id reuse assumption
…warn on ambiguous block-id reuse Addresses consensus-review findings: reservation entries now expire a fixed time after creation regardless of lookups (M1), and reuseBlockIDFromUTXO logs when a tx carries multiple mined-in ids so an invariant break surfaces (L2).
…on_main_chain Closes the toggled/untoggled consensus split (Option C): the in-memory CheckBlockIsInCurrentChain off-chain-set filter is a fast NEGATIVE only; a non-existent (phantom) id <= maxBlockID survived it and was wrongly treated as on-chain, disagreeing with the authoritative SQL route. Now survivors are confirmed against on_main_chain (indexed fast path), so a dangling id is rejected identically on toggled and untoggled nodes.
…t (Option C residual) The store-level CheckBlockIsInCurrentChain is fixed to confirm survivors against on_main_chain. The toggle-on off-chain-prefetch fast path still treats an absent-from-off-chain id as on-chain, so it retains the latent split for a dangling id. Making it sound removes its positive short-circuit (obsoleting a profiled optimization) — left as a documented team decision. Phantom creation is already prevented by the idempotent AssignBlockID fix, so this is latent.
|
🤖 Claude Code Review Status: Complete Current Review: No issues found. This PR implements a robust solution to the phantom block ID problem with comprehensive coverage: Core Changes:
Test Coverage:
Integration:
Quality:
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-08 14:44 UTC |
…over DB-error path Follow the documented snake_case file convention (codingConventions.md:118); old CamelCase siblings are tolerated legacy. Add a DB-error test covering AssignBlockID's storage-error return (AssignBlockID 76.5%->82.4%, blockIDByHash->100%).
…cal client) Server.AssignBlockID 0%->88.9% (success/idempotent/invalid-hash), Client.AssignBlockID 0%->100%, LocalClient.AssignBlockID 0%->100% — the new-code coverage gap flagged by Sonar on the gRPC surface.
ordishs
left a comment
There was a problem hiding this comment.
Approve with follow-ups.
The central fix is correct and verified locally: go build ./... clean, and go test -race -tags testtxmetacache green across stores/blockchain/..., services/blockchain/..., services/legacy/netsync/..., and services/blockvalidation/.... The core invariant holds — StoreBlock commits under WithID(block.ID) with no second nextval, and block.ID is sourced from AssignBlockID/UTXO-recovery, so the UTXO mined-info id and the committed row id are provably the same. The mutex correctly serializes check-then-reserve with a post-lock committed re-check, and the reservation is dropped only after commit. New tests are meaningful (two-path race → single committed id, concurrent convergence, cleared-on-commit, DB-error, phantom-below-maxID rejection).
Non-blocking follow-ups:
- (HIGH, latent) off-chain-prefetch fastPath split.
checkOldBlockIDsOffChainPrefetchstill treatsid <= maxBlockID && id not-in off-chain-setas on-chain (the documented "Option C residual"). An id-sequence gap (reserved-but-never-committed → TTL expiry/crash) is reachable independently of the double-mint path this PR closes; if a tx'sBlockIDsever references a gap id, aUseInMemoryChainCheck=onnode would accept what an=offnode rejects. Phantom creation is closed, but this residual is the one place the fix is left incomplete. - (MEDIUM) reservation durability. Convergence holds only while the reservation is live or committed. If single-block processing exceeds the 10-min TTL, a racing caller can take the
nextvalbranch before the first attempt writes UTXO mined-info (assign precedes createUtxos), reintroducing the split — rarer, not eliminated. Size the TTL above worst-case single-block processing, or land theblock_id_reservationstable already noted as a follow-up. - (MEDIUM) transient-error retry loop. Reclassifying
ErrTxMissingParent/ErrTxNotFoundasBlockIncompleteis the right call (never persists invalid), but a genuinely-invalid block spending a never-existed output now retries indefinitely rather than fast-rejecting. Consider a bounded retry/backoff so a crafted block can't pin a catchup worker. - (LOW) uint64→uint32 truncation at the
block.IDcall sites has nomath.MaxUint32guard; cheap insurance against a silent wrap breaking idempotency. - (LOW)
reuseBlockIDFromUTXOtrustsBlockIDs[0]on multi-id (reorg/re-mine) txs — guarded and logged, but the one spot the one-id-per-hash invariant is taken on faith.
Process note: the PR description says "correctness fix only ... no other changes," but the diff also includes the #1031 transient-error reclassification, a sync-peer updateLastBlockTime() refresh, and a sizable checkOldBlockIDs refactor. All look correct and related — worth aligning the description so the scope is accurate.
Replace the 6 unchecked uint32(id) // nolint:gosec narrowings of the AssignBlockID result with safeconversion.Uint64ToUint32 (via a small blockIDToUint32 helper in quick_validate.go), erroring on overflow rather than silently wrapping — a wrap would alias another block's id and break the one-id-per-hash idempotency. Addresses review note from bsv-blockchain#1043.
|
Addressed @ordishs's review ( In this PR:
Filed as tracked follow-ups (non-blocking, per the review):
The |
…tom-block-id # Conflicts: # stores/blockchain/sql/sql.go
|
…-legacysyncing Conflict in the generated services/blockchain/blockchain_api/blockchain_api.pb.go (upstream bsv-blockchain#1043 added the AssignBlockID RPC; this branch removed LegacySync). Resolved by regenerating from the cleanly-merged .proto via `make gen` — both AssignBlockID and the LegacySync/LEGACYSYNCING removal (reserved 3) are present.


Summary
AssignBlockIDgRPC call, instead of independently callingGetNextBlockIDand minting two ids for the same block.CheckBlockIsInCurrentChain's in-memory fast path now treats the off-chain set as a negative filter only and confirms survivors against the authoritativeon_main_chainflag, so a phantom / id-gap id can't be accepted as on-chain by auseInMemoryChainCheck=onnode when an always-SQL node would reject it.Why
A fresh testnet node wedged during IBD around height 1038100: a block was marked invalid by
checkOldBlockIDsbecause a transaction spent an output whose UTXO mined-in block id had no row in theblockstable — a phantom id. Once a block is invalid it is cached, so the tip can never advance.Root cause: below the highest checkpoint, two ingestion paths process the same block concurrently and each allocated an independent id from the Postgres sequence — legacy netsync wrote the UTXO mined-info under one id before commit, blockvalidation catchup committed the block row under a different id. The UTXO mined-info id and the committed block-row id diverged; the orphaned id is the phantom that later fails the on-chain check and wedges the tip.
AssignBlockID(ctx, hash)is the single idempotent authority both paths now use (committed-id lookup → in-memory reservation → nextval). Same hash always returns the same id; concurrent callers converge. The UTXO mined-info id and the committed row can no longer diverge. The chain-check change makes the toggled-on in-memory path agree with the authoritative SQL route on dangling ids.Scope
The core change is the phantom-block-id fix (idempotent
AssignBlockID+ the Option CCheckBlockIsInCurrentChainconsistency fix). The diff also carries a few changes that travelled with this work — calling them out for accurate scope (all on the block-ingestion / catchup path):ErrTxMissingParent/ErrTxNotFound→BlockIncomplete, retried via alternative peers);updateLastBlockTime()refresh at block receipt, so a long validation isn't mistaken for a stall and the sync peer isn't rotated mid-processing;checkOldBlockIDsrefactor (introduces the off-chain-prefetch split, with the documented Option C residual);uint64→uint32narrowing of assigned block ids (replaces thenolint:goseccasts withsafeconversion.Uint64ToUint32).No peer-registry / FSM-state / catchup-orchestration changes — the sync-lease (legacy-priority coordination) was deliberately dropped.
Not in this PR
checkOldBlockIDsOffChainPrefetchkeeps the same latent survivor-accept residual (documented inline); fixing it obsoletes the prefetch optimization, a separate perf/design call. Phantom creation is already prevented here. Tracked as blockvalidation: off-chain-prefetch fast path can accept an id-gap block id the SQL route rejects #1055.Follow-ups (tracked, non-blocking)
replicas: 1assumption; ablock_id_reservationstable covers both).GetNextBlockID— no ingestion caller remains after this PR.Test plan
go build ./...clean;go test -race -tags testtxmetacachegreen acrossstores/blockchain/...,services/blockchain/...,services/legacy/netsync/...,services/blockvalidation/...AssignBlockID_test.go)CheckBlockIsInCurrentChain: in-memory vs SQL agree on a phantom / id-gap id (CheckBlockIsInCurrentChain_test.go)