-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Is this a BUG REPORT or FEATURE REQUEST? (choose one):
FEATURE REQUEST
This is a somewhat complementary or even orthogonal to issue 1136.
A validator node can fail in many ways, at least two are:
- network connectivity loss (minor protocol penalty for being offline). This could be fixed via issue 1136
- machine breaks (major protocol penalty if double signing, or maybe even for not respecting prevote-the-lock). Issue 1136 does not really resolve this validator obligation.
Is there a promise in the implementation that the state in /data is always correct, in the sense of being able to restart a copied node with this serialized state? If so, what are the current conditions of this promise?
Additionally, the production validator node may want to store state on an SDD (which would be lost under machine failure), not some kind of "safe but slow" NAS. A poor man's version could be running on the SSD, but using something like "minio mirror" to sync the SSD to backup. This has failure modes as well though...
So I'm wondering if the idea of "raftifying" the validator node has been thought through. I think it would certainly be feasible. The basic idea is:
- run 3 validator node instances (each with a copy of the HSM key)
- one of them is the leader, participating in Tendermint as "the" validator
- if it fails then another one takes over via Raft
- perhaps additional nodes not participating in the Raft leader/candidate mechanism could listen and store the state onto backup NAS.
I have been thinking about forking the repo and trying to implement this as an experiment. One problem is that it is not clear to me what exactly the "internal state" of a validator node is (that would be replicated via Raft), as opposed to the "external state" it is involved in according to the Tendermint BFT protocol.
For inspiration, here is a project doing this for SQLite.