-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kvserver: "async consensus" span spuriously leaks #60677
Copy link
Copy link
Closed
Labels
A-kv-replicationRelating to Raft, consensus, and coordination.Relating to Raft, consensus, and coordination.A-tracingRelating to tracing in CockroachDB.Relating to tracing in CockroachDB.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Description
Describe the problem
While working on #59992, I got spurious test failures of this kind:
testcluster.go:134: condition failed to evaluate within 45s: unexpectedly found active spans:
0.000ms 0.000ms === operation:async consensus _unfinished:1 node:1 range:37/1:/{Tenant/10-Max} store:1
goroutine 65 [running]:
runtime/debug.Stack(0xc005bbb820, 0x56be0a0, 0xc0011a3780)
/usr/local/go/src/runtime/debug/stack.go:24 +0x9f
github.com/cockroachdb/cockroach/pkg/testutils.SucceedsSoon(0x58a1b20, 0xc000580a80, 0xc005bbb820)
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/soon.go:36 +0x6d
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers(0xc0010d4000, 0x5758a40, 0xc00019e018)
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:134 +0x23d
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func2()
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:334 +0x45
github.com/cockroachdb/cockroach/pkg/util/stop.CloserFn.Close(0xc000f793e0)
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:109 +0x25
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop(0xc000eb39e0, 0x5758a40, 0xc00019e018)
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:479 +0x268
github.com/cockroachdb/cockroach/pkg/sql/logictest.(*logicTest).shutdownCluster(0xc000f87000)
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:1621 +0x1da
github.com/cockroachdb/cockroach/pkg/sql/logictest.(*logicTest).close(0xc000f87000)
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:1232 +0x3d
github.com/cockroachdb/cockroach/pkg/sql/logictest.(*logicTest).runFile(0xc000f87000, 0xc000c49180, 0x33, 0x49ec1a0, 0xc, 0x3, 0x0, 0x0, 0x0, 0x0, ...)
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:3124 +0x196
github.com/cockroachdb/cockroach/pkg/sql/logictest.RunLogicTestWithDefaultConfig.func1.1(0xc000580a80)
/home/tobias/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:3316 +0x318
testing.tRunner(0xc000580a80, 0xc0004fe200)
/usr/local/go/src/testing/testing.go:1123 +0xef
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:1168 +0x2b3
which come from this span:
cockroach/pkg/kv/kvserver/replica_raft.go
Line 131 in 218a5a3
| proposal.ctx, proposal.sp = tracing.ForkSpan(ctx, "async consensus") |
To Reproduce
It happens in a lot of tests, but the repro I used was
make stress PKG=./pkg/ccl/backupccl/ PKG=./pkg/ccl/logictestccl TESTS=TestTenantLogic/3node-tenant/distinct_on
(typically no more than three minutes in)
I added this code hoping that it would fix it:
diff --git a/pkg/kv/kvserver/store.go b/pkg/kv/kvserver/store.go
index fee2ca5a6a..1b26580915 100644
--- a/pkg/kv/kvserver/store.go
+++ b/pkg/kv/kvserver/store.go
@@ -1608,6 +1608,24 @@ func (s *Store) Start(ctx context.Context, stopper *stop.Stopper) error {
// Set the started flag (for unittests).
atomic.StoreInt32(&s.started, 1)
+ s.stopper.AddCloser(stop.CloserFn(func() {
+ s.VisitReplicas(func(r *Replica) (more bool) {
+ r.mu.Lock()
+ for k, prop := range r.mu.proposals {
+ delete(r.mu.proposals, k)
+ prop.finishApplication(
+ context.Background(),
+ proposalResult{
+ Err: roachpb.NewError(roachpb.NewAmbiguousResultError("store is stopping")),
+ },
+ )
+ }
+ r.mu.Unlock()
+ fmt.Println("stopped store")
+ return true
+ })
+ }))
+
return nil
}It did not. I think this proves that the proposal with the leaked span is removed from the proposal map without calling prop.sp.Finish().
I added the following code in #59992 to effectively disable this forked span:
diff --git a/pkg/kv/kvserver/replica_raft.go b/pkg/kv/kvserver/replica_raft.go
index 09d5d86bb5..f8f1fe1b1b 100644
--- a/pkg/kv/kvserver/replica_raft.go
+++ b/pkg/kv/kvserver/replica_raft.go
@@ -129,6 +129,10 @@ func (r *Replica) evalAndPropose(
// Fork the proposal's context span so that the proposal's context
// can outlive the original proposer's context.
proposal.ctx, proposal.sp = tracing.ForkSpan(ctx, "async consensus")
+ {
+ proposal.sp.Finish()
+ proposal.sp = nil
+ }
// Signal the proposal's response channel immediately.
reply := *proposal.Local.ReplyWe'll need to go back and understand and fix this.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
A-kv-replicationRelating to Raft, consensus, and coordination.Relating to Raft, consensus, and coordination.A-tracingRelating to tracing in CockroachDB.Relating to tracing in CockroachDB.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.