-
Notifications
You must be signed in to change notification settings - Fork 780
Description
When starting a set of validators using gaia v9, built with the race detector, and inside linux/amd64 Docker containers, I was seeing an intermittent but somewhat frequent data race. The top of the stack looked like this:
WARNING: DATA RACE
Write at 0x00c00029d9c8 by goroutine 67:
github.com/tendermint/tendermint/consensus.(*State).reconstructLastCommit()
github.com/tendermint/tendermint@v0.34.26/consensus/state.go:577 +0xb7
github.com/tendermint/tendermint/consensus.(*Reactor).SwitchToConsensus()
github.com/tendermint/tendermint@v0.34.26/consensus/reactor.go:113 +0xd7
github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).poolRoutine()
github.com/tendermint/tendermint@v0.34.26/blockchain/v0/reactor.go:324 +0x1a73
github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).OnStart.func1()
github.com/tendermint/tendermint@v0.34.26/blockchain/v0/reactor.go:112 +0x3b
Previous read at 0x00c00029d9c8 by goroutine 70:
runtime.racereadrange()
<autogenerated>:1 +0x1b
github.com/tendermint/tendermint/consensus.(*Reactor).updateRoundStateRoutine()
github.com/tendermint/tendermint@v0.34.26/consensus/reactor.go:546 +0x149
github.com/tendermint/tendermint/consensus.(*Reactor).OnStart.func2()
github.com/tendermint/tendermint@v0.34.26/consensus/reactor.go:82 +0x39
Version details are:
- gaia v9.0.0
- cosmos-sdk@v0.45.13-ics
- tendermint v0.34.26
The setup for starting the testnet in this configuration is a bit complicated, but luckily we saw a seemingly identical data race during CI on the Cosmos SDK: https://github.com/cosmos/cosmos-sdk/actions/runs/4295303107/jobs/7485538761#step:6:41
So then I attempted to reproduce that failure. It took more runs than I expected, but I have been able to reproduce it on a Linux docker image. It might reproduce on an M1 Mac, but I didn't see a repro within a few dozen runs. The Linux session looked like this:
$ docker run --rm -it golang:1.19
root@b33bddfadfe0:/go# git clone --depth=1 -b v0.45.15-rc.0 https://github.com/cosmos/cosmos-sdk
root@b33bddfadfe0:/go# cd cosmos-sdk/testutil/network/
root@b33bddfadfe0:/go/cosmos-sdk/testutil/network# echo 'package network_test
import (
"testing"
"github.com/cosmos/cosmos-sdk/testutil/network"
"github.com/stretchr/testify/require"
)
func TestNetwork(t *testing.T) {
n := network.New(t, network.DefaultConfig())
_, err := n.WaitForHeight(1)
require.NoError(t, err)
}' > ./network_test.go
root@b33bddfadfe0:/go/cosmos-sdk/testutil/network# go test -race -c .
root@b33bddfadfe0:/go/cosmos-sdk/testutil/network# i=0; while time ./network.test -test.count=1; do date; echo "$i"; i=$((i+1)); done; echo "failed after $i runs"
PASS
real 1m2.042s
user 0m49.385s
sys 0m22.292s
Thu Mar 2 21:41:22 UTC 2023
0
...Full stack from testnet run: gaia9-race.1.txt
Full stack from a unit test repro run:
unit-test-repro-race.txt
Overall this data race looks about as innocuous as a data race can be. But, because it risks happening at startup, this risks interfering with searches for less innocuous data races in application chains. For example, using GORACE=halt_on_error=1 is meant to stop execution as soon as the first data race is encountered; if this data race happens first, it is an inconvenience that the application will need to be restarted again in hopes of not encountering this data race.