Skip to content

conensus: Restarting node causes DataCorruptionError in WAL #3295

@jackzampolin

Description

@jackzampolin

(originally posted by @fkbenjamin in cosmos/cosmos-sdk#3603)

Reproduced by @jackzampolin with sudo systemctl stop gaiad && sudo systemctl start gaiad

Tendermint Version: v0.30.0-rc0
Gaia Version: v0.31.1

Summary of Bug

On GoS6, I changed the values in gaiad.toml in two of my nodes (one validating node, one not-validating node). After stopping and restarting the node again immediately, I get the following:

I[2019-02-11|15:53:34.544] Starting ABCI with Tendermint                module=main
E[2019-02-11|15:53:35.878] Error dialing peer                           module=p2p err="dial tcp X:X:X:X:12345: i/o timeout"
E[2019-02-11|15:53:36.032] Corrupted entry. Skipping...                 module=consensus wal=/home/ubuntu/.gaiad/data/cs.wal/wal err="DataCorruptionError[failed to read data: EOF]"
E[2019-02-11|15:53:36.202] data has been corrupted in last height of consensus WAL module=consensus err="DataCorruptionError[failed to read data: EOF]" height=1944
E[2019-02-11|15:53:36.202] Encountered corrupt WAL file                 module=consensus err="DataCorruptionError[failed to read data: EOF]"
E[2019-02-11|15:53:36.202] Please repair the WAL file before restarting module=consensus
You can attempt to repair the WAL as follows:

----
WALFILE=~/.tendermint/data/cs.wal/wal
cp $WALFILE ${WALFILE}.bak # backup the file
go run scripts/wal2json/main.go $WALFILE > wal.json # this will panic, but can be ignored
rm $WALFILE # remove the corrupt file
go run scripts/json2wal/main.go wal.json $WALFILE # rebuild the file without corruption
----
E[2019-02-11|15:53:36.202] Error starting conS                          module=consensus err="DataCorruptionError[failed to read data: EOF]"
E[2019-02-11|15:53:36.669] Error dialing peer                           module=p2p err="dial tcp X:X:X:X:12345: connect: connection refused"
E[2019-02-11|15:53:37.559] Error dialing peer                           module=p2p err="dial tcp X:X:X:X:12345: i/o timeout"
E[2019-02-11|15:53:37.813] Connection failed @ recvRoutine (reading byte) module=p2p peer=censoredcensored@X:X:X:X:12345 conn=MConn{X:X:X:X:12345} err=EOF
E[2019-02-11|15:53:37.813] Stopping peer for error                      module=p2p peer="Peer{MConn{X:X:X:X:12345} censoredcensored out}" err=EOF
E[2019-02-11|15:53:38.146] Connection failed @ recvRoutine (reading byte) module=p2p peer=censoredcensored@X:X:X:X:12345 conn=MConn{X:X:X:X:12345} err=EOF
E[2019-02-11|15:53:38.146] Stopping peer for error                      module=p2p peer="Peer{MConn{X:X:X:X:12345} censoredcensored out}" err=EOF
E[2019-02-11|15:53:38.146] MConnection flush failed                     module=p2p peer=censoredcensored@X:X:X:X:12345 err="write tcp X:X:X:X:59324->X:X:X:X:12345: use of closed network connection"
E[2019-02-11|15:53:38.479] Connection failed @ recvRoutine (reading byte) module=p2p peer=censoredcensored@X:X:X:X:12345 conn=MConn{X:X:X:X:12345} err=EOF

This caused me to miss a few blocks on Game of Stakes 6. The issue was solved by stopping the node again, waiting for a few seconds and then restarting it.

Steps to Reproduce

cosmos-sdk: 0.31.1
git commit: b9e523212ec47910a00db00be2f1b7935e201ee7
vendor hash: 85e6c5c7a700e822cccf169c97f4a3974312dfd1
go version go1.11.5 linux/amd64
  1. Change value in gaiad.toml (although I don't think it has anything to do with this)
  2. Stop Node
  3. Immediately start node again with gaiad start
  4. See corrupted WAL Errors

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned

Metadata

Metadata

Assignees

Labels

T:bugType Bug (Confirmed)T:validatorType: Validator related

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions