Skip to content

Chain halts when validator with >1/3 voting power is unable to sign replayed prevote #8739

@sergio-mena

Description

@sergio-mena

Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source):

master (v0.37.x), commit is e84ca61

ABCI app

e2e APP

Environment:

Tendermint's CI

What happened:

When running e2e nightly tests, one network fails to make progress at a given point (networks/nightly/gen-group00-0007.toml)

What you expected to happen:

e2e tests should pass

Have you tried the latest version: yes/no

Latest version tried was CI e2e-nightly-test, on 2022-06-08

How to reproduce it (as minimally and precisely as possible):

Couldn't repro on laptop so far. Looks like CI is able to hit it frequently

Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file):

See the Github run

Config (you can paste only the changes you've made):

The testnet is networks/nightly/gen-group00-0007.toml

node command runtime flags:

Please check e2e logs

Please provide the output from the http://<ip>:<port>/dump_consensus_state RPC endpoint for consensus bugs

N/A

Anything else we need to know:

After some inspection in logs and source code, this is what we currently know:

  • validator02 restarts [02:52:07], just after it has proposed in height 45, round 0, and it has prevoted for it
  • while validator02 is down, network can't progress: validator02 holds >1/3 voting power
  • when validator02 restarts it hanshakes (OK) [02:52:20], switches to consensus [02:52:21], and replays the WAL
  • validator02 replays its own proposal [02:52:21], which is then seen by the other validators for the first time (so there is connectivity)
  • at the moment of replaying its prevote for its own proposal [02:52:21], it finds a conflict between what it last signed (FilePVLastSignState) and what it is supposed to sign now as part of the replay. So it errors out and doesn't send its prevote (as it can't sign it)
  • the error seen in the log contains the replayed prevote; looking at its contents it seems a nil prevote, which seems strange as it was its own proposal
  • finally, as validator02 errors without being able to send its prevote (whether nil or not) for height 45, round 0. All processes get stuck in height 45, round 0 forever (probably after having prevoted nil), since they can't gather enough voting power worth of prevotes in height 45, round 0 to set up their timeoutPrevote timer and be able to advance to round 1.

As a result of this, the problem seems to be in replaying the prevotes, when the crashed process has >1/3 of the voting power.

The next step should be get a repro with increased log verbosity so we can compare the two prevotes that validator02 is seeing as potential double-signing here

Metadata

Metadata

Assignees

Labels

stalefor use by stalebot

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions