Skip to content

Intermittent data race in consensus.State during startup #487

@mark-rushakoff

Description

@mark-rushakoff

When starting a set of validators using gaia v9, built with the race detector, and inside linux/amd64 Docker containers, I was seeing an intermittent but somewhat frequent data race. The top of the stack looked like this:

WARNING: DATA RACE
Write at 0x00c00029d9c8 by goroutine 67:
  github.com/tendermint/tendermint/consensus.(*State).reconstructLastCommit()
      github.com/tendermint/tendermint@v0.34.26/consensus/state.go:577 +0xb7
  github.com/tendermint/tendermint/consensus.(*Reactor).SwitchToConsensus()
      github.com/tendermint/tendermint@v0.34.26/consensus/reactor.go:113 +0xd7
  github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).poolRoutine()
      github.com/tendermint/tendermint@v0.34.26/blockchain/v0/reactor.go:324 +0x1a73
  github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).OnStart.func1()
      github.com/tendermint/tendermint@v0.34.26/blockchain/v0/reactor.go:112 +0x3b

Previous read at 0x00c00029d9c8 by goroutine 70:
  runtime.racereadrange()
      <autogenerated>:1 +0x1b
  github.com/tendermint/tendermint/consensus.(*Reactor).updateRoundStateRoutine()
      github.com/tendermint/tendermint@v0.34.26/consensus/reactor.go:546 +0x149
  github.com/tendermint/tendermint/consensus.(*Reactor).OnStart.func2()
      github.com/tendermint/tendermint@v0.34.26/consensus/reactor.go:82 +0x39

Version details are:

The setup for starting the testnet in this configuration is a bit complicated, but luckily we saw a seemingly identical data race during CI on the Cosmos SDK: https://github.com/cosmos/cosmos-sdk/actions/runs/4295303107/jobs/7485538761#step:6:41

So then I attempted to reproduce that failure. It took more runs than I expected, but I have been able to reproduce it on a Linux docker image. It might reproduce on an M1 Mac, but I didn't see a repro within a few dozen runs. The Linux session looked like this:

$ docker run --rm -it golang:1.19

root@b33bddfadfe0:/go# git clone --depth=1 -b v0.45.15-rc.0 https://github.com/cosmos/cosmos-sdk

root@b33bddfadfe0:/go# cd cosmos-sdk/testutil/network/

root@b33bddfadfe0:/go/cosmos-sdk/testutil/network# echo 'package network_test

import (
	"testing"

	"github.com/cosmos/cosmos-sdk/testutil/network"
	"github.com/stretchr/testify/require"
)

func TestNetwork(t *testing.T) {
	n := network.New(t, network.DefaultConfig())

	_, err := n.WaitForHeight(1)
	require.NoError(t, err)
}' > ./network_test.go
root@b33bddfadfe0:/go/cosmos-sdk/testutil/network# go test -race -c .

root@b33bddfadfe0:/go/cosmos-sdk/testutil/network#  i=0; while time ./network.test -test.count=1; do date; echo "$i"; i=$((i+1)); done; echo "failed after $i runs"
PASS

real	1m2.042s
user	0m49.385s
sys	0m22.292s
Thu Mar  2 21:41:22 UTC 2023
0
...

Full stack from testnet run: gaia9-race.1.txt

Full stack from a unit test repro run:
unit-test-repro-race.txt

Overall this data race looks about as innocuous as a data race can be. But, because it risks happening at startup, this risks interfering with searches for less innocuous data races in application chains. For example, using GORACE=halt_on_error=1 is meant to stop execution as soon as the first data race is encountered; if this data race happens first, it is an inconvenience that the application will need to be restarted again in hopes of not encountering this data race.

Metadata

Metadata

Assignees

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions