Render the consensus `needProofBlock` method more robust

`needProofBlock` is a method from the [consensus module](https://github.com/cometbft/cometbft/blob/45a08f3f90a170bd862fa5615aeb9bba12e7f55b/consensus/state.go#L1050) which method assumes that the meta information (`types.BlockMeta`) of the block committed at the previous height (`cs.Height - 1`) is available in the `state.BlockStore` used by the consensus implementation. If this is not the case, a **panic error** is produced and interrupts the execution of CometBFT.

This issue concerns the behavior of this method and situations in which this panic error can be produced. We present here a summary of discussions that took place in Slack and in synchronous calls.

## Context

The `needProofBlock` method from [`consensus/state.go`](https://github.com/cometbft/cometbft/blob/45a08f3f90a170bd862fa5615aeb9bba12e7f55b/consensus/state.go#L1050) informs whether the execution of the last committed block has caused the application hash to change. If the state of the application has changed, which is indicated by a distinct application hash, a new block is needed as it serves as a proof of the new state of the application, which is the reason for the name of the method.

The information provided by the `needProofBlock` method is relevant when nodes are starting the first round (`round == 0`) of a height. If the application state has changed from the execution of the last block, which can be considered the common case, the node should immediately proceed to the `propose` round step to propose a block or wait for a proposed block. The alternative flow at this point is to wait for transactions being available in the mempool before entering the propose step, as a way to prevent the proposal of empty blocks (with no transactions). This alternative flow is not taken by default, it has to be configured (see this [line](https://github.com/cometbft/cometbft/blob/45a08f3f90a170bd862fa5615aeb9bba12e7f55b/consensus/state.go#L1037)),  and it can only be taken when the `needProofBlock` method returns `false`. The method returns false when the application state (hash) has not changed with the previous block, for instance because the previous block carried no transactions.

The `needProofBlock` method not does perform a comparison between states (hashes) and returns `true` when it is invoked for height 1, meaning that the previous block was the genesis block. The same applies for height `cs.state.InitialHeight`, which plays the role of height 1 for chains that do not start from height 0. The reason for this behavior, as attested in comments in the code, is to make sure that "the genesis app hash is signed right away" with the immediate production of a new block.

For other heights, the `needProofBlock` method assumes that the meta information (`types.BlockMeta`) of the block committed at the previous height (`cs.Height - 1`) is available in the `state.BlockStore` used by the consensus implementation. If this is not the case, a panic error is produced and interrupts the execution of CometBFT.

This assumption is reasonable, as for a node to run consensus at height `H` the network should have committed and the node must have executed the block at height `H-1`. This block, or at least its meta information should therefore be available in the block store. In fact, the votes that allowed the decision of block `H-1` are available at height `H`, in the `cs.LastCommit` field. When a node switches to consensus, the content of this field is produced from the block store (see [`reconstructLastCommit`](https://github.com/cometbft/cometbft/blob/45a08f3f90a170bd862fa5615aeb9bba12e7f55b/consensus/state.go#L566) method).

However, as attested in the following scenario, it is possible for a node to be at height `H` of consensus without having in its block store the block committed at height `H-1`. The whole point of this issue is to discuss if panicing and interrupting the operation of CometBFT in this case is the right approach or the assumption reflected in the implementation is unnecessarily strong.

## Critical scenario

Steps to produce a scenario in which the `needProofBlock` method will panic:

1. A node is bootstrapped with a snapshot from height `H-1`, using state sync for instance
2. The node does not run block sync or block sync fails to retrieve blocks from height `H-1` onwards
3. The node switches to consensus operation from height `H`

The scenario is valid as the node joins the consensus instances at height `H` with an application state that reflects the execution of the block committed at height `H-1`. However, the node will not have the block `H-1` in its block store.

## Short-term solution

Given the above analysis and providing an initial conclusion, it is **safe** for the `needProofBlock` method to return `true` when the meta information of the previous block is not available, instead of panicing and interrupting the execution of CometBFT.

The main reason is the fact that the return value of this method is only relevant when the node is configured to not produce empty blocks. This is a non-standard configuration, which may explain the fact that this problem has not been previously reported.

In nodes configured to not produce empty blocks, having this method returning `true` in this scenario has the impact of possibly causing the node to produce an empty block when it would not need to produced one. While this could be considered undesired, it **would only affect a single height**, such as the height `H` from the above presented scenario. In fact, from height `H+1`, the block committed in the previous height will necessarily be present in the block store as the node is on regular consensus operation.

The above observation suggests that logging this unexpected behavior instead of panicing is a reasonable short-term solution.

- [x] #402

## Definitive solution

It is important to observe that the `needProofBlock` method can be invoked (see this [line](https://github.com/cometbft/cometbft/blob/45a08f3f90a170bd862fa5615aeb9bba12e7f55b/consensus/state.go#L962)) in nodes that were not configured to not produce empty blocks. This is includes nodes with a standard configuration, which may be affected by this issue with no reason.

Moreover, it is important to understand, at first, why the consensus implementation assumes that the meta information of the last committed block will always be available, in particular on the height (`H` in the above scenario) at which the node switches to the regular consensus operation. The assumption, in this case, is that this information will be set by either:

1. Replaying information from disk (consensus WAL, block store DB): doesn't apply for fresh nodes
2. The state sync protocol 
3. The block sync protocol

Starting from the last alternative, the block sync protocol will store the information of the last committed block (say, at height `H-1`) before switching to consensus (at height `H`) if it syncs (restore from disk or downloads) at least one block. Rendering block sync mandatory (currently it can be disabled) and requiring it to sync at least one block could be a definitive solution.

> Rendering the sync of at least a block has liveness implications that need to be considered. In summary, there may be no committed blocks to sync and the node in question, which will not switch to consensus while a block is not synced, can be a validator which participation in consensus is required for committing the next block. However, this scenario has been discussed in the context of ABCI++ and vote extensions and a solution has been proposed [citation needed].

Regarding the state sync protocol, it has been observed that the installation of a snapshot from height `H-1` requires the node to know at least the header of the block committed at height `H`, for the sake of state validation. It is not yet clear whether this means that the node must know, considering the above mentioned critical scenario, the meta information of block `H-1`. But it is clear that state sync does not write meta block information to the block store, provided that it has this information.

> We should investigate whether state sync could solve the problem illustrated by this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Render the consensus `needProofBlock` method more robust #370

Context

Critical scenario

Short-term solution

Definitive solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Render the consensus needProofBlock method more robust #370

Description

Context

Critical scenario

Short-term solution

Definitive solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Render the consensus `needProofBlock` method more robust #370