blocksync: define the exact conditions for a node to attempt block syncing

[Block sync](https://github.com/cometbft/cometbft/blob/main/docs/explanation/core/block-sync.md) is a protocol that allows a node joining an existing network to catch-up _faster_ to the head (last block) of the blockchain. The same result, but at a substantially slower speed can be achieved by joining the consensus protocol and receiving from peers the blocks decided at each height of consensus, accompanied by a set of `Precommit` votes forming a commit for that block.

The conditions for a node to leave the Block Sync protocol and join the consensus protocol are encoded in the [`IsCaughtUp() method`](https://github.com/cometbft/cometbft/blob/4a077a6888c94aacb3c19212a3e2aafea9e8c000/internal/blocksync/pool.go#L182). There are somewhat complex but the rationale is to compare the local height, the latest block in the local copy of the blockchain, with the latest (highest) height reported by the node's peers as part of the Block Sync protocol. If the reported peers' heights is not more than 1 unit higher than the local height, the node is ready to join the consensus protocol.

There are some corner cases to consider, though. Essentially, the node should not block forever on Block Sync if it fails to retrieve block from its peers (after some reasonable amount of time). But the code at the moment requires the node to have peers that respond to the node's requests in a timely manner, which gives rise to other corner scenarios to be considered.

Of particular interest is the case in which the node running Block Sync is an active validator and it holds a substantial portion of the total validator voting power. In this scenario, most likely to be observed in testnets, the node might not need to receive any block from peers (i.e., to catch-up) because the network cannot commit blocks in the absence of the node, and its voting power as a validator. This particular scenario has been reported by users and is addressed by https://github.com/cometbft/cometbft/pull/3406.

The goal of this PR is to better formalize the conditions under which a node should remain attempting to retrieve blocks from its peers via Block Sync protocol, as we are likely to find out further corner cases not covered by the current implementation.

Notice that this issue is particularly relevant for networks from release v0.38.x, where block sync has became mandatory (see https://github.com/tendermint/tendermint/pull/8433 for additional context). Up to release v0.37.x, node operators could just disable Block Sync in the configuration file if they realize that a node was getting stuck in this phase of the node's bootstrap/recovery procedure.

### Definition of Done

- [ ] Define, validate and formalize the conditions for a node to stay in the Block Sync protocol
   - [ ] Stretch goal: write a specification and update/extend the documentation for this protocol
- [ ] Adapt the implementation (`internal/blocksync` package) and the node setup (`node` package) accordingly
- [ ] Produce test components (test units, e2e testbeds) to validate the behavior under the identified corner cases

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blocksync: define the exact conditions for a node to attempt block syncing #3415

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

blocksync: define the exact conditions for a node to attempt block syncing #3415

Description

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions