Skip to content

Clarify the operation of a recovering node #493

@cason

Description

@cason

CometBFT considers the crash-recovery failure model, meaning that nodes may crash and then recovery, rejoining the distributed computation in a consistent state. For this to happen, nodes should persist relevant information and state changes during their regular operation, so that during recovery they are able to restore the state they had just before crashing.

Recovering the state of a node after a crash is a tricky operation. Several modules of CometBFT persist information that they are expected to recover after a crash. The consensus protocol keeps a Write-Ahead Log (WAL) to persist crucial information. The block store, the state store, the evidence reactor, the transaction indexer, and the address book persist data to their own DBs. And the application itself should adhere to the crash-recovery failure model, implementing a persistence strategy.

Among the mentioned modules, probably the best documented recovery procedure regards ABCI applications. The consensus WAL is very superficially covered, while the other DBs are essentially not documented. In any case, the assumptions regarding the persisted state and its recovery are not documented.

It is worth noting that when the state persistence is delegated to a database, the recovery procedure tends to be straightforward, as it is provided by the database implementation. As far as I known, consensus is the only module that adopts transactional semantics for persisted data, based on a WAL. The recovery of the consensus WAL is particularly tricky and undocumented.

Definition of Done:

  • List all databases adopted by CometBFT modules, summarize the persistence assumptions, and document, when it is the case, the relevant aspects of the recovery procedures
  • Document the consensus Write-Ahead Log and the operation of the consensus protocol during recovery. This should include the interaction between consensus and the ABCI application, covered only on the application side in the existing documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    specSpecification-related

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions