-
Notifications
You must be signed in to change notification settings - Fork 780
Description
Bug Report
Setup
CometBFT version:v0.37.0 but also on main at time of writing. Code links are to v0.37.x in order to illustrate the issue.
Have you tried the latest version: yes
ABCI app N/A
Environment: N/A
node command runtime flags: N/A
What happened?
A validator took a long time to sync to a high round number.
This has downstream effects that I haven't formally proven. For example I suspect that it could be impossible (or at least very slow, only able to catch-up during the precommit step) for a validator to catch up to the latest round number in the case that +2/3 of other validators are online and incrementing the round number.
While there are theoretically ways (in the algorithm) for the validator to skip to future rounds once they see +2/3 prevotes or precommits for those future rounds, in-reality the validator may be reading serially from the consensus WAL which means they need to process all prior messages before processing newer (read: higher-round) messages. Due to the incorrect forcing of waiting for the timeout period, the serialization in slowed considerably and the validator cannot process the latest messages in a timely manner. Furthermore, this extends to the chain resuming increasing in height. If processing a long WAL backlog, the validator would be slow to catch up to the latest height because it would be waiting for timeouts on rounds from previous heights.
What did you expect to happen?
The validator should sync to the latest round pretty fast after being gossiped (or reading from WAL) +2/3 nil precommits on every round before the latest.
How to reproduce it
Reproduced by halting a chain with several completed rounds of +2/3 nil precommits. This could be done by, for example, hacking the code to reject all block proposals.
Then, restart a validator node at the latest height and round 0. It will take a long time to sync to the highest round because it will always timeout in the Precommit step even after seeing +2/3 nil precommits. Instead it should proceed directly to the next round.
This is further exacerbated by the increasing timeout periods that increase linearly per round, leading to the consumption of round^2 when trying to catch up to the latest round.
Where is the code
The incorrect code seems to be here
The spec can be found here
PR where this code was introduced. Some comment on that PR that this should be changed.