-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Tendermint version (use tendermint version or git rev-parse --verify HEAD if installed from source):
master (v0.37.x), commit is e84ca61
ABCI app
e2e APP
Environment:
Tendermint's CI
What happened:
When running e2e nightly tests, one network fails to make progress at a given point (networks/nightly/gen-group00-0007.toml)
What you expected to happen:
e2e tests should pass
Have you tried the latest version: yes/no
Latest version tried was CI e2e-nightly-test, on 2022-06-08
How to reproduce it (as minimally and precisely as possible):
Couldn't repro on laptop so far. Looks like CI is able to hit it frequently
Logs (paste a small part showing an error (< 10 lines) or link a pastebin, gist, etc. containing more of the log file):
See the Github run
Config (you can paste only the changes you've made):
The testnet is networks/nightly/gen-group00-0007.toml
node command runtime flags:
Please check e2e logs
Please provide the output from the http://<ip>:<port>/dump_consensus_state RPC endpoint for consensus bugs
N/A
Anything else we need to know:
After some inspection in logs and source code, this is what we currently know:
- validator02 restarts [02:52:07], just after it has proposed in height 45, round 0, and it has prevoted for it
- while validator02 is down, network can't progress: validator02 holds >1/3 voting power
- when validator02 restarts it hanshakes (OK) [02:52:20], switches to consensus [02:52:21], and replays the WAL
- validator02 replays its own proposal [02:52:21], which is then seen by the other validators for the first time (so there is connectivity)
- at the moment of replaying its prevote for its own proposal [02:52:21], it finds a conflict between what it last signed (
FilePVLastSignState) and what it is supposed to sign now as part of the replay. So it errors out and doesn't send its prevote (as it can't sign it) - the error seen in the log contains the replayed prevote; looking at its contents it seems a
nilprevote, which seems strange as it was its own proposal - finally, as validator02 errors without being able to send its prevote (whether
nilor not) for height 45, round 0. All processes get stuck in height 45, round 0 forever (probably after having prevotednil), since they can't gather enough voting power worth of prevotes in height 45, round 0 to set up their timeoutPrevote timer and be able to advance to round 1.
As a result of this, the problem seems to be in replaying the prevotes, when the crashed process has >1/3 of the voting power.
The next step should be get a repro with increased log verbosity so we can compare the two prevotes that validator02 is seeing as potential double-signing here