Skip to content

db/state: merge schedule non-determinism across nodes — impacts decentralized snapshot distribution #20531

@mh0lt

Description

@mh0lt

Problem

The aggregator merge schedule is not fully deterministic across nodes. Two nodes at the same step can have different file layouts due to:

  1. Preverified legacy accommodation: getMergeLimit() in snap_repo.go adjusts merge limits based on preverified files, which differ across binary releases
  2. Cross-domain coordination: commitment domain must wait for accounts/storage to merge first. Interrupted merges leave divergent intermediate states
  3. Two merge code paths: aggregator.go (old, active) vs forkable_agg.go/snap_repo.go (new) with different merge logic

Why this matters

Decentralized snapshot distribution (#19660) requires nodes to compare what files they have. If the merge schedule is deterministic, two honest nodes at the same step produce identical files with identical torrent hashes — enabling direct hash comparison and threshold consensus (#19658).

If non-deterministic, we need structural range-coverage comparison instead of hash comparison, which is significantly more complex and weakens the trust model.

Current merge schedule

The bit-manipulation schedule (endStep & -endStep) is inherently deterministic:

Step  8: [0-8)
Step 12: [0-8) [8-12)
Step 16: [0-16)
Step 24: [0-16) [16-24)
Step 32: [0-32)

But it's perturbed by:

  • maxSpan cap from stepsInFrozenFile (configurable via erigondb.toml)
  • getMergeLimit() returning larger sizes when preverified files exist
  • Commitment domain holding merges until accounts/storage catch up

Example of divergence

Node A (clean run):

v1.0-accounts.0-4096.kv    (deep merge)
v1.0-accounts.4096-4224.kv (128 steps)
v1.0-accounts.4224-4232.kv (8 steps)

Node B (interrupted merge, restarted):

v1.0-accounts.0-2048.kv    (partial deep merge)
v1.0-accounts.2048-4096.kv
v1.0-accounts.4096-4224.kv
v1.0-accounts.4224-4232.kv

Both cover steps 0-4232 but with different file boundaries and torrent hashes.

Proposed resolution

Make the merge schedule purely deterministic: given (step, MergeStages config), the expected file layout is computable. On startup, identify missing merges and execute them to converge to the canonical layout.

This would mean:

  • All nodes at the same step produce identical files
  • chain.toml can use hash comparison (simple, torrent-compatible)
  • Threshold consensus works (same hash = agreement)
  • UCAN can sign (chain, step) tuples — layout is implied

Alternative

Accept non-determinism and use structural range-coverage comparison in chain.toml. Files are compared by step range, not hash. More complex, weaker trust model.

Blocking

This decision blocks the chain.toml V2 format design (#19660) — the format differs depending on whether we can assume deterministic layouts.

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions