-
Notifications
You must be signed in to change notification settings - Fork 780
Description
Bug Report
Setup
CometBFT version (use cometbft version or git rev-parse --verify HEAD if installed from source): v0.37.0
Have you tried the latest version: no
ABCI app (name for built-in, URL for self-written if it's publicly available): simapp in Cosmos SDK
Environment:
- OS (e.g. from /etc/os-release):
go testinside macOS - Install tools:
- Others:
node command runtime flags:
Config
Default config.
What happened?
In the Cosmos SDK, I am working on a set of tests that run a real CometBFT *node.Node in-process along with a simapp instance.
The tests defer node.Stop(); node.Wait(). Usually they pass.
Occasionally, there are failures like: testing.go:1206: TempDir RemoveAll cleanup: unlinkat /var/folders/.../TestCometStarter_PortContentionattempt_0278742771/001/config: directory not empty. If I switch away from using t.TempDir so that I can inspect the config directory which isn't empty, it always contains addrbook.json.
If I print debug.Stack() from inside (*pex.addrBook).saveToFile, I see that the final call writing out addrbook.jsonis the finala.saveToFilecall in(*pex.addrBook).saveRoutine`:
Lines 494 to 509 in 365b0a7
| func (a *addrBook) saveRoutine() { | |
| defer a.wg.Done() | |
| saveFileTicker := time.NewTicker(dumpAddressInterval) | |
| out: | |
| for { | |
| select { | |
| case <-saveFileTicker.C: | |
| a.saveToFile(a.filePath) | |
| case <-a.Quit(): | |
| break out | |
| } | |
| } | |
| saveFileTicker.Stop() | |
| a.saveToFile(a.filePath) | |
| } |
At the top of that function you will see a defer a.wg.Done() call. If we call (*pex.addrBook).Wait() (which simply calls a.wg.Wait()), then we would block until a.SaveRoutine returned.
However... nothing in the codebase calls (*pex.addrBook).Wait(). If you comment out the function, go test -short ./... passes without any compilation errors.
What did you expect to happen?
How to reproduce it
$ git clone -b mr/testnet-node-stop-wait --depth=1 https://github.com/mark-rushakoff/cosmos-sdk
$ cd cosmos-sdk/simapp
$ while GORACE=halt_on_error=1 GODEBUG=tracebackancestors=3 go test ./internal/testnet/ -run=Contention -count=10 -failfast -race -v; do date; done
There are usually several successful runs before it fails.
Logs
n/a
dump_consensus_state output
n/a
Anything else we need to know
The simplest solution looks like it would be:
- Update the
pex.AddrBookinterface to include aWait()method. - Update
(*pex.addrBook).Wait() to calla.BaseService.Wait(); a.wg.Wait()`. - Add a new
Wait()method to*node.Nodewhich callsn.BaseService.Wait(); n.addrBook.Wait()
Patching CometBFT v0.37.0 with those changes appears to fully prevent the non-empty temp directory test failure that I could previously reproduce.
diff --git a/node/node.go b/node/node.go
index 4e5238d1d..be6af893f 100644
--- a/node/node.go
+++ b/node/node.go
@@ -1313,6 +1313,11 @@ func (n *Node) IsListening() bool {
return n.isListening
}
+func (n *Node) Wait() {
+ n.BaseService.Wait()
+ n.addrBook.Wait()
+}
+
// NodeInfo returns the Node's Info from the Switch.
func (n *Node) NodeInfo() p2p.NodeInfo {
return n.nodeInfo
diff --git a/p2p/pex/addrbook.go b/p2p/pex/addrbook.go
index 857a6a063..86933616b 100644
--- a/p2p/pex/addrbook.go
+++ b/p2p/pex/addrbook.go
@@ -35,6 +35,8 @@ const (
// peers to dial.
// TODO: break this up?
type AddrBook interface {
+ Wait()
+
service.Service
// Add our own addresses so we don't later add ourselves
@@ -171,6 +173,7 @@ func (a *addrBook) OnStop() {
func (a *addrBook) Wait() {
a.wg.Wait()
+ a.BaseService.Wait()
}
func (a *addrBook) FilePath() string {