Skip to content

kvserver: plug a tracing span leak#61279

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
irfansharif:210301.prop-span-leak
Mar 3, 2021
Merged

kvserver: plug a tracing span leak#61279
craig[bot] merged 1 commit intocockroachdb:masterfrom
irfansharif:210301.prop-span-leak

Conversation

@irfansharif
Copy link
Copy Markdown
Contributor

Fixes #60677, removing a stop-gap introduced in #59992.

We were previously leaking "async consensus" spans, which was possible
when a given proposal was never flushed out of the replica's proposal
buffer. On server shut down, this buffered proposal was never finished,
and thus the embedded span never closed. We now add a closer to clean up
after ourselves.

Release justification: bug fixes and low-risk updates to new
functionality.

Release note: None


+cc @angelapwen / @knz / @erikgrinaker for pod-visibility.

@irfansharif irfansharif requested a review from tbg March 1, 2021 19:27
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@irfansharif irfansharif changed the title storage: plug a tracing span leak kvserver: plug a tracing span leak Mar 1, 2021
@irfansharif irfansharif force-pushed the 210301.prop-span-leak branch from 4668c3c to 0c032ac Compare March 1, 2021 19:28
Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is right but I'll let Tobias confirm.

Reviewed 3 of 3 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)


pkg/testutils/testcluster/testcluster.go, line 145 at r1 (raw file):

			}
			var buf strings.Builder
			buf.WriteString(fmt.Sprintf("unexpectedly found %d active spans:\n", len(sps)))

fmt.Fprintf(&buf, "unexpectedly ...")

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I had basically written this code but failed to realize that the proposal buffer might be where the proposal is hiding.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)

@irfansharif irfansharif force-pushed the 210301.prop-span-leak branch from 0c032ac to 94304d6 Compare March 2, 2021 14:12
Copy link
Copy Markdown
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/testutils/testcluster/testcluster.go, line 145 at r1 (raw file):

Previously, knz (kena) wrote…

fmt.Fprintf(&buf, "unexpectedly ...")

Done.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 2, 2021

Build failed:

Fixes cockroachdb#60677, removing a stop-gap introduced in cockroachdb#59992.

We were previously leaking  "async consensus" spans, which was possible
when a given proposal was never flushed out of the replica's proposal
buffer. On server shut down, this buffered proposal was never finished,
and thus the embedded span never closed. We now add a closer to clean up
after ourselves.

Release justification: bug fixes and low-risk updates to new
functionality.

Release note: None
@irfansharif
Copy link
Copy Markdown
Contributor Author

I think CI tripped over #61120? Dunno.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 2, 2021

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 3, 2021

Build succeeded:

@craig craig bot merged commit 50641ee into cockroachdb:master Mar 3, 2021
@irfansharif irfansharif deleted the 210301.prop-span-leak branch March 3, 2021 01:21
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 19, 2021
Fixes cockroachdb#63761. As of cockroachdb#61279, it was possible for us to deadlock due to
inconsistent lock orderings between Stopper.mu and Replica.mu. We were
previously holding onto Stopper.mu while executing all closers,
including those that may acquire other locks. Because closers can be
defined anywhere (and may consequently grab any in-scope lock), we
should order Stopper.mu to come after all other locks in the system.

The closer added in cockroachdb#61279 iterated over all non-destroyed replicas,
locking Replica.mu to check for the replica's destroy status (in
Store.VisitReplicas). This deadlocked with the lease acquisition code
path that first grabs an exclusive lock over Replica.mu (see
InitOrJoinRequest), and uses the stopper to kick off an async task
acquiring the lease. The stopper internally locks Stopper.mu to check
whether or not it was already stopped.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 19, 2021
Fixes cockroachdb#63761. As of cockroachdb#61279, it was possible for us to deadlock due to
inconsistent lock orderings between Stopper.mu and Replica.mu. We were
previously holding onto Stopper.mu while executing all closers,
including those that may acquire other locks. Because closers can be
defined anywhere (and may consequently grab any in-scope lock), we
should order Stopper.mu to come after all other locks in the system.

The closer added in cockroachdb#61279 iterated over all non-destroyed replicas,
locking Replica.mu to check for the replica's destroy status (in
Store.VisitReplicas). This deadlocked with the lease acquisition code
path that first grabs an exclusive lock over Replica.mu (see
InitOrJoinRequest), and uses the stopper to kick off an async task
acquiring the lease. The stopper internally locks Stopper.mu to check
whether or not it was already stopped.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Apr 20, 2021
Fixes cockroachdb#63761. As of cockroachdb#61279, it was possible for us to deadlock due to
inconsistent lock orderings between Stopper.mu and Replica.mu. We were
previously holding onto Stopper.mu while executing all closers,
including those that may acquire other locks. Because closers can be
defined anywhere (and may consequently grab any in-scope lock), we
should order Stopper.mu to come after all other locks in the system.

The closer added in cockroachdb#61279 iterated over all non-destroyed replicas,
locking Replica.mu to check for the replica's destroy status (in
Store.VisitReplicas). This deadlocked with the lease acquisition code
path that first grabs an exclusive lock over Replica.mu (see
InitOrJoinRequest), and uses the stopper to kick off an async task
acquiring the lease. The stopper internally locks Stopper.mu to check
whether or not it was already stopped.

Release note: None
craig bot pushed a commit that referenced this pull request Apr 20, 2021
63867: stop: break deadlock between Stopper.mu and Replica.mu r=irfansharif a=irfansharif

Fixes #63761. As of #61279, it was possible for us to deadlock due to
inconsistent lock orderings between Stopper.mu and Replica.mu. We were
previously holding onto Stopper.mu while executing all closers,
including those that may acquire other locks. Because closers can be
defined anywhere (and may consequently grab any in-scope lock), we
should order Stopper.mu to come after all other locks in the system.

The closer added in #61279 iterated over all non-destroyed replicas,
locking Replica.mu to check for the replica's destroy status (in
Store.VisitReplicas). This deadlocked with the lease acquisition code
path that first grabs an exclusive lock over Replica.mu (see
InitOrJoinRequest), and uses the stopper to kick off an async task
acquiring the lease. The stopper internally locks Stopper.mu to check
whether or not it was already stopped.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: "async consensus" span spuriously leaks

4 participants