Skip to content

(*pex.addrBook).Wait() never called, occasionally continues writing after (*node.Node).Wait() returns #646

@mark-rushakoff

Description

@mark-rushakoff

Bug Report

Setup

CometBFT version (use cometbft version or git rev-parse --verify HEAD if installed from source): v0.37.0

Have you tried the latest version: no

ABCI app (name for built-in, URL for self-written if it's publicly available): simapp in Cosmos SDK

Environment:

  • OS (e.g. from /etc/os-release): go test inside macOS
  • Install tools:
  • Others:

node command runtime flags:

Config

Default config.

What happened?

In the Cosmos SDK, I am working on a set of tests that run a real CometBFT *node.Node in-process along with a simapp instance.

The tests defer node.Stop(); node.Wait(). Usually they pass.

Occasionally, there are failures like: testing.go:1206: TempDir RemoveAll cleanup: unlinkat /var/folders/.../TestCometStarter_PortContentionattempt_0278742771/001/config: directory not empty. If I switch away from using t.TempDir so that I can inspect the config directory which isn't empty, it always contains addrbook.json.

If I print debug.Stack() from inside (*pex.addrBook).saveToFile, I see that the final call writing out addrbook.jsonis the finala.saveToFilecall in(*pex.addrBook).saveRoutine`:

func (a *addrBook) saveRoutine() {
defer a.wg.Done()
saveFileTicker := time.NewTicker(dumpAddressInterval)
out:
for {
select {
case <-saveFileTicker.C:
a.saveToFile(a.filePath)
case <-a.Quit():
break out
}
}
saveFileTicker.Stop()
a.saveToFile(a.filePath)
}

At the top of that function you will see a defer a.wg.Done() call. If we call (*pex.addrBook).Wait() (which simply calls a.wg.Wait()), then we would block until a.SaveRoutine returned.

However... nothing in the codebase calls (*pex.addrBook).Wait(). If you comment out the function, go test -short ./... passes without any compilation errors.

What did you expect to happen?

How to reproduce it

$ git clone -b mr/testnet-node-stop-wait --depth=1 https://github.com/mark-rushakoff/cosmos-sdk
$ cd cosmos-sdk/simapp
$ while GORACE=halt_on_error=1 GODEBUG=tracebackancestors=3 go test ./internal/testnet/ -run=Contention -count=10 -failfast -race -v; do date; done

There are usually several successful runs before it fails.

Logs

n/a

dump_consensus_state output

n/a

Anything else we need to know

The simplest solution looks like it would be:

  1. Update the pex.AddrBook interface to include a Wait() method.
  2. Update (*pex.addrBook).Wait() to call a.BaseService.Wait(); a.wg.Wait()`.
  3. Add a new Wait() method to *node.Node which calls n.BaseService.Wait(); n.addrBook.Wait()

Patching CometBFT v0.37.0 with those changes appears to fully prevent the non-empty temp directory test failure that I could previously reproduce.

diff --git a/node/node.go b/node/node.go
index 4e5238d1d..be6af893f 100644
--- a/node/node.go
+++ b/node/node.go
@@ -1313,6 +1313,11 @@ func (n *Node) IsListening() bool {
        return n.isListening
 }

+func (n *Node) Wait() {
+       n.BaseService.Wait()
+       n.addrBook.Wait()
+}
+
 // NodeInfo returns the Node's Info from the Switch.
 func (n *Node) NodeInfo() p2p.NodeInfo {
        return n.nodeInfo
diff --git a/p2p/pex/addrbook.go b/p2p/pex/addrbook.go
index 857a6a063..86933616b 100644
--- a/p2p/pex/addrbook.go
+++ b/p2p/pex/addrbook.go
@@ -35,6 +35,8 @@ const (
 // peers to dial.
 // TODO: break this up?
 type AddrBook interface {
+       Wait()
+
        service.Service

        // Add our own addresses so we don't later add ourselves
@@ -171,6 +173,7 @@ func (a *addrBook) OnStop() {

 func (a *addrBook) Wait() {
        a.wg.Wait()
+       a.BaseService.Wait()
 }

 func (a *addrBook) FilePath() string {

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingp2p

Type

No type

Projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions