Skip to content

kvserver: "async consensus" span spuriously leaks #60677

@tbg

Description

@tbg

Describe the problem

While working on #59992, I got spurious test failures of this kind:

    testcluster.go:134: condition failed to evaluate within 45s: unexpectedly found active spans:
             0.000ms      0.000ms    === operation:async consensus _unfinished:1 node:1 range:37/1:/{Tenant/10-Max} store:1
        goroutine 65 [running]:
        runtime/debug.Stack(0xc005bbb820, 0x56be0a0, 0xc0011a3780)
        	/usr/local/go/src/runtime/debug/stack.go:24 +0x9f
        github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon(0x58a1b20, 0xc000580a80, 0xc005bbb820)
        	/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:36 +0x6d
        github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers(0xc0010d4000, 0x5758a40, 0xc00019e018)
        	/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:134 +0x23d
        github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func2()
        	/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:334 +0x45
        github.com/cockroachdb/cockroach/pkg/util/stop.CloserFn.Close(0xc000f793e0)
        	/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:109 +0x25
        github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop(0xc000eb39e0, 0x5758a40, 0xc00019e018)
        	/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:479 +0x268
        github.com/cockroachdb/cockroach/pkg/sql/logictest.(*logicTest).shutdownCluster(0xc000f87000)
        	/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:1621 +0x1da
        github.com/cockroachdb/cockroach/pkg/sql/logictest.(*logicTest).close(0xc000f87000)
        	/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:1232 +0x3d
        github.com/cockroachdb/cockroach/pkg/sql/logictest.(*logicTest).runFile(0xc000f87000, 0xc000c49180, 0x33, 0x49ec1a0, 0xc, 0x3, 0x0, 0x0, 0x0, 0x0, ...)
        	/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:3124 +0x196
        github.com/cockroachdb/cockroach/pkg/sql/logictest.RunLogicTestWithDefaultConfig.func1.1(0xc000580a80)
        	/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:3316 +0x318
        testing.tRunner(0xc000580a80, 0xc0004fe200)
        	/usr/local/go/src/testing/testing.go:1123 +0xef
        created by testing.(*T).Run
        	/usr/local/go/src/testing/testing.go:1168 +0x2b3

which come from this span:

proposal.ctx, proposal.sp = tracing.ForkSpan(ctx, "async consensus")

To Reproduce

It happens in a lot of tests, but the repro I used was

make stress PKG=./pkg/ccl/backupccl/ PKG=./pkg/ccl/logictestccl TESTS=TestTenantLogic/3node-tenant/distinct_on

(typically no more than three minutes in)

I added this code hoping that it would fix it:

diff --git a/pkg/kv/kvserver/store.go b/pkg/kv/kvserver/store.go
index fee2ca5a6a..1b26580915 100644
--- a/pkg/kv/kvserver/store.go
+++ b/pkg/kv/kvserver/store.go
@@ -1608,6 +1608,24 @@ func (s *Store) Start(ctx context.Context, stopper *stop.Stopper) error {
 	// Set the started flag (for unittests).
 	atomic.StoreInt32(&s.started, 1)
 
+	s.stopper.AddCloser(stop.CloserFn(func() {
+		s.VisitReplicas(func(r *Replica) (more bool) {
+			r.mu.Lock()
+			for k, prop := range r.mu.proposals {
+				delete(r.mu.proposals, k)
+				prop.finishApplication(
+					context.Background(),
+					proposalResult{
+						Err: roachpb.NewError(roachpb.NewAmbiguousResultError("store is stopping")),
+					},
+				)
+			}
+			r.mu.Unlock()
+			fmt.Println("stopped store")
+			return true
+		})
+	}))
+
 	return nil
 }

It did not. I think this proves that the proposal with the leaked span is removed from the proposal map without calling prop.sp.Finish().

I added the following code in #59992 to effectively disable this forked span:

diff --git a/pkg/kv/kvserver/replica_raft.go b/pkg/kv/kvserver/replica_raft.go
index 09d5d86bb5..f8f1fe1b1b 100644
--- a/pkg/kv/kvserver/replica_raft.go
+++ b/pkg/kv/kvserver/replica_raft.go
@@ -129,6 +129,10 @@ func (r *Replica) evalAndPropose(
 		// Fork the proposal's context span so that the proposal's context
 		// can outlive the original proposer's context.
 		proposal.ctx, proposal.sp = tracing.ForkSpan(ctx, "async consensus")
+		{
+			proposal.sp.Finish()
+			proposal.sp = nil
+		}
 
 		// Signal the proposal's response channel immediately.
 		reply := *proposal.Local.Reply

We'll need to go back and understand and fix this.

Metadata

Metadata

Assignees

Labels

A-kv-replicationRelating to Raft, consensus, and coordination.A-tracingRelating to tracing in CockroachDB.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions