Skip to content

blocksync: define the exact conditions for a node to attempt block syncing #3415

@cason

Description

@cason

Block sync is a protocol that allows a node joining an existing network to catch-up faster to the head (last block) of the blockchain. The same result, but at a substantially slower speed can be achieved by joining the consensus protocol and receiving from peers the blocks decided at each height of consensus, accompanied by a set of Precommit votes forming a commit for that block.

The conditions for a node to leave the Block Sync protocol and join the consensus protocol are encoded in the IsCaughtUp() method. There are somewhat complex but the rationale is to compare the local height, the latest block in the local copy of the blockchain, with the latest (highest) height reported by the node's peers as part of the Block Sync protocol. If the reported peers' heights is not more than 1 unit higher than the local height, the node is ready to join the consensus protocol.

There are some corner cases to consider, though. Essentially, the node should not block forever on Block Sync if it fails to retrieve block from its peers (after some reasonable amount of time). But the code at the moment requires the node to have peers that respond to the node's requests in a timely manner, which gives rise to other corner scenarios to be considered.

Of particular interest is the case in which the node running Block Sync is an active validator and it holds a substantial portion of the total validator voting power. In this scenario, most likely to be observed in testnets, the node might not need to receive any block from peers (i.e., to catch-up) because the network cannot commit blocks in the absence of the node, and its voting power as a validator. This particular scenario has been reported by users and is addressed by #3406.

The goal of this PR is to better formalize the conditions under which a node should remain attempting to retrieve blocks from its peers via Block Sync protocol, as we are likely to find out further corner cases not covered by the current implementation.

Notice that this issue is particularly relevant for networks from release v0.38.x, where block sync has became mandatory (see tendermint/tendermint#8433 for additional context). Up to release v0.37.x, node operators could just disable Block Sync in the configuration file if they realize that a node was getting stuck in this phase of the node's bootstrap/recovery procedure.

Definition of Done

  • Define, validate and formalize the conditions for a node to stay in the Block Sync protocol
    • Stretch goal: write a specification and update/extend the documentation for this protocol
  • Adapt the implementation (internal/blocksync package) and the node setup (node package) accordingly
  • Produce test components (test units, e2e testbeds) to validate the behavior under the identified corner cases

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions