Skip to content

Upon receiving +2/3 nil precommits, the consensus state-machine does not proceed to the step Propose(H,R+1) as specified in the documentation, instead it waits for a timeout #1431

@BrendanChou

Description

@BrendanChou

Bug Report

Setup

CometBFT version:v0.37.0 but also on main at time of writing. Code links are to v0.37.x in order to illustrate the issue.

Have you tried the latest version: yes

ABCI app N/A

Environment: N/A

node command runtime flags: N/A

What happened?

A validator took a long time to sync to a high round number.

This has downstream effects that I haven't formally proven. For example I suspect that it could be impossible (or at least very slow, only able to catch-up during the precommit step) for a validator to catch up to the latest round number in the case that +2/3 of other validators are online and incrementing the round number.

While there are theoretically ways (in the algorithm) for the validator to skip to future rounds once they see +2/3 prevotes or precommits for those future rounds, in-reality the validator may be reading serially from the consensus WAL which means they need to process all prior messages before processing newer (read: higher-round) messages. Due to the incorrect forcing of waiting for the timeout period, the serialization in slowed considerably and the validator cannot process the latest messages in a timely manner. Furthermore, this extends to the chain resuming increasing in height. If processing a long WAL backlog, the validator would be slow to catch up to the latest height because it would be waiting for timeouts on rounds from previous heights.

What did you expect to happen?

The validator should sync to the latest round pretty fast after being gossiped (or reading from WAL) +2/3 nil precommits on every round before the latest.

How to reproduce it

Reproduced by halting a chain with several completed rounds of +2/3 nil precommits. This could be done by, for example, hacking the code to reject all block proposals.

Then, restart a validator node at the latest height and round 0. It will take a long time to sync to the highest round because it will always timeout in the Precommit step even after seeing +2/3 nil precommits. Instead it should proceed directly to the next round.

This is further exacerbated by the increasing timeout periods that increase linearly per round, leading to the consumption of round^2 when trying to catch up to the latest round.

Where is the code

The incorrect code seems to be here

The spec can be found here

PR where this code was introduced. Some comment on that PR that this should be changed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions