-
Notifications
You must be signed in to change notification settings - Fork 780
Description
Block sync is a protocol that allows a node joining an existing network to catch-up faster to the head (last block) of the blockchain. The same result, but at a substantially slower speed can be achieved by joining the consensus protocol and receiving from peers the blocks decided at each height of consensus, accompanied by a set of Precommit votes forming a commit for that block.
The conditions for a node to leave the Block Sync protocol and join the consensus protocol are encoded in the IsCaughtUp() method. There are somewhat complex but the rationale is to compare the local height, the latest block in the local copy of the blockchain, with the latest (highest) height reported by the node's peers as part of the Block Sync protocol. If the reported peers' heights is not more than 1 unit higher than the local height, the node is ready to join the consensus protocol.
There are some corner cases to consider, though. Essentially, the node should not block forever on Block Sync if it fails to retrieve block from its peers (after some reasonable amount of time). But the code at the moment requires the node to have peers that respond to the node's requests in a timely manner, which gives rise to other corner scenarios to be considered.
Of particular interest is the case in which the node running Block Sync is an active validator and it holds a substantial portion of the total validator voting power. In this scenario, most likely to be observed in testnets, the node might not need to receive any block from peers (i.e., to catch-up) because the network cannot commit blocks in the absence of the node, and its voting power as a validator. This particular scenario has been reported by users and is addressed by #3406.
The goal of this PR is to better formalize the conditions under which a node should remain attempting to retrieve blocks from its peers via Block Sync protocol, as we are likely to find out further corner cases not covered by the current implementation.
Notice that this issue is particularly relevant for networks from release v0.38.x, where block sync has became mandatory (see tendermint/tendermint#8433 for additional context). Up to release v0.37.x, node operators could just disable Block Sync in the configuration file if they realize that a node was getting stuck in this phase of the node's bootstrap/recovery procedure.
Definition of Done
- Define, validate and formalize the conditions for a node to stay in the Block Sync protocol
- Stretch goal: write a specification and update/extend the documentation for this protocol
- Adapt the implementation (
internal/blocksyncpackage) and the node setup (nodepackage) accordingly - Produce test components (test units, e2e testbeds) to validate the behavior under the identified corner cases