Skip to content

acceptance, ccl, cli, cmd, config, gossip, internal, jobs: cleanup lock usages#108064

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
Santamaura:assess-locks-1
Sep 8, 2023
Merged

acceptance, ccl, cli, cmd, config, gossip, internal, jobs: cleanup lock usages#108064
craig[bot] merged 1 commit intocockroachdb:masterfrom
Santamaura:assess-locks-1

Conversation

@Santamaura
Copy link
Copy Markdown
Contributor

@Santamaura Santamaura commented Aug 2, 2023

This PR updates unsafe/unnecessary manual unlocks to defer unlocks.
The criteria for leaving some unlocks as is:

  • will not cause a leak
  • releases the resources much earlier
  • deferring will cause a deadlock without significant refactor

Part of #107946

Release note: None

Signoffs

  • acceptance
  • backupccl
  • changefeedccl
  • multitenantccl
  • sqlproxyccl
  • cli
  • cmd
  • config
  • gossip
  • internal
  • jobs

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a brief look and gave some advice on a few instances where your refactor should be reconsidered. Maybe as a first pass, you can make a PR for all the "simple" refactors where adding a defer is less disruptive.

One thing I'm not 100% clear on is the impact of adding the defer inside an immediately executed anonymous function. Maybe a good topic to discuss at tech pod.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Santamaura)


pkg/acceptance/localcluster/cluster.go line 488 at r1 (raw file):

func (n *Node) StatusClient(ctx context.Context) serverpb.StatusClient {
	n.Lock()
	defer n.Unlock()

Adding this defer here, moves the unlock until after the GRPCUnvalidatedDial...Connect() call which could take an unknown amount of time.

The unlock was likely put at the top here in order to keep it from being held for a long time. Basically just to grab the statusClient safely.

Consider doing this within an anonymous function like:

var existingClient ...
func () {
  n.Lock()
  defer n.Unlock()
  existingClient = ...
}()

Although in this particular case it's impossible for the code between the Lock/Unlock to panic so it's less of a concern. But consider this pattern in the situations below.


pkg/acceptance/localcluster/cluster.go line 704 at r1 (raw file):

	}
	n.Lock()
	defer n.Unlock()

Replacing multiple Lock/Unlock calls with a single one could radically change the performance profile of the code. It's unlikely that this is a valid refactor.

The locks were carefully taken in the original code around specific variables that needed to be lifted out from n.

One thing that's a hint much like the gRPC example above is that we have an os.ReadFile that's originally done without a lock held which after your change does happen during the Lock. Reading a file could take an extremely long time compared to a Lock being held. This causes problems when requests are made concurrently on the function and all concurrent requests then have to wait for the Lock. If you hold the Lock while reading the file, no one else can make progress until you're done reading the file, whereas if you maintain the fine-grained locking, many calls can proceed concurrently because they block each other for only a miniscule period of time.

Consider the measured difference on pages like https://colin-scott.github.io/personal_website/research/interactive_latency.html where mutex lock/unlock is on the order of 17ns while SSD random read is 16,000ns. So do not perceive many Lock/Unlock operations as "bad" necessarily.


pkg/ccl/backupccl/restore_job.go line 2133 at r1 (raw file):

					}
					mu.Lock()
					defer mu.Unlock()

defer moves the mu.Unlock() to the end of the function call, not the select or the for loop. In this case you would immediately create a deadlock because the for loop runs forever and would never unlock.

This is another case where an anonymous function immediately executed can help.

@Santamaura Santamaura force-pushed the assess-locks-1 branch 5 times, most recently from de4b2b8 to e879113 Compare August 9, 2023 15:53
@Santamaura Santamaura marked this pull request as ready for review August 9, 2023 16:45
@Santamaura Santamaura requested review from a team as code owners August 9, 2023 16:45
@Santamaura Santamaura requested review from DrewKimball, adityamaru, bananabrick and jayshrivastava and removed request for a team August 9, 2023 16:45
Copy link
Copy Markdown
Contributor

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changefeedccl/ changes lgtm!

Copy link
Copy Markdown
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a pass. I suggest going back through all the changes and making sure the critical sections are not modified. We want this to be a cosmetic change and nothing more. I'd error on the side of adding the nolint comment rather than changing what the critical section covers. Sometimes innocuous looking changes are not that innocuous.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @bananabrick, @DrewKimball, and @Santamaura)


pkg/acceptance/localcluster/cluster.go line 697 at r2 (raw file):

	for r := retry.Start(opts); r.Next(); {
		var pid int
		if n.cmd != nil {

Here we're accessing n.cmd outside of the lock but in one of the fragments below (line 784) we lock just to get n.cmd


pkg/acceptance/localcluster/cluster.go line 698 at r2 (raw file):

			pid = n.cmd.Process.Pid
		}
		n.Unlock()

[nit] I would just add the linter comment here, it's cleaner the old way here


pkg/ccl/changefeedccl/event_processing.go line 661 at r2 (raw file):

	// that there are no more events.
	if notifyFlush {
		c.flushCh <- struct{}{}

We weren't holding the lock here before.. In principle, this can block.


pkg/ccl/changefeedccl/sink_kafka.go line 55 at r2 (raw file):

func (l *maybeLocker) Unlock() {
	if l.locked {
		defer l.wrapped.Unlock()

This is probably incorrect (we set locked=false before actually unlocking)


pkg/ccl/changefeedccl/kvevent/blocking_buffer.go line 157 at r2 (raw file):

	if canFlush {
		select {
		case b.signalCh <- struct{}{}:

We weren't holding the lock here before. In principle it could matter if signalCh is unbuffered and the receiver needs the lock.


pkg/ccl/sqlproxyccl/tenantdirsvr/test_directory_svr.go line 203 at r2 (raw file):

				case <-s.stopper.ShouldQuiesce():
					func() {
						s.mu.eventListeners.Remove(elem)

missing lock/unlock here. Consider adding a method on s for removing an event listener


pkg/gossip/client.go line 156 at r2 (raw file):

// timestamps.
func (c *client) requestGossip(g *Gossip, stream Gossip_GossipClient) error {
	args := func() *Request {

[nit] it would be cleaner if this func only returned NodeAddr and highwaterstamps


pkg/internal/sqlsmith/bulkio.go line 219 at r2 (raw file):

		defer s.lock.Unlock()
		return len(s.bulkExports) == 0
	}(); !noBulkExports {

The ! here seems wrong, we want to exit if there are no bulkExport. Also, dropping the lock and reacquiring is not the same as before. In principle, bulkExports could change in-between. I think we want a function that returns the files. We already check for empty files afterwards.

Copy link
Copy Markdown
Contributor

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spatial changes look ok, but unconfident on reviewing the rest

@Santamaura Santamaura force-pushed the assess-locks-1 branch 2 times, most recently from d0b74b4 to 0a81c70 Compare August 10, 2023 19:21
Copy link
Copy Markdown
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @bananabrick, @dhartunian, @DrewKimball, and @RaduBerinde)


pkg/acceptance/localcluster/cluster.go line 698 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] I would just add the linter comment here, it's cleaner the old way here

Done


pkg/ccl/backupccl/restore_job.go line 2133 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

defer moves the mu.Unlock() to the end of the function call, not the select or the for loop. In this case you would immediately create a deadlock because the for loop runs forever and would never unlock.

This is another case where an anonymous function immediately executed can help.

Done.


pkg/ccl/changefeedccl/event_processing.go line 661 at r2 (raw file):

Previously, RaduBerinde wrote…

We weren't holding the lock here before.. In principle, this can block.

Updated


pkg/ccl/changefeedccl/sink_kafka.go line 55 at r2 (raw file):

Previously, RaduBerinde wrote…

This is probably incorrect (we set locked=false before actually unlocking)

Done


pkg/ccl/changefeedccl/kvevent/blocking_buffer.go line 157 at r2 (raw file):

Previously, RaduBerinde wrote…

We weren't holding the lock here before. In principle it could matter if signalCh is unbuffered and the receiver needs the lock.

Updated


pkg/ccl/sqlproxyccl/tenantdirsvr/test_directory_svr.go line 203 at r2 (raw file):

Previously, RaduBerinde wrote…

missing lock/unlock here. Consider adding a method on s for removing an event listener

Added


pkg/gossip/client.go line 156 at r2 (raw file):

Previously, RaduBerinde wrote…

[nit] it would be cleaner if this func only returned NodeAddr and highwaterstamps

Done


pkg/internal/sqlsmith/bulkio.go line 219 at r2 (raw file):

Previously, RaduBerinde wrote…

The ! here seems wrong, we want to exit if there are no bulkExport. Also, dropping the lock and reacquiring is not the same as before. In principle, bulkExports could change in-between. I think we want a function that returns the files. We already check for empty files afterwards.

Updated

@blathers-crl blathers-crl bot requested a review from otan August 10, 2023 19:32
@Santamaura Santamaura requested a review from a team as a code owner August 28, 2023 20:42
@Santamaura Santamaura force-pushed the assess-locks-1 branch 2 times, most recently from e7e8b4b to 9b8984d Compare September 5, 2023 19:15
Copy link
Copy Markdown
Contributor

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backupccl lgtm, thanks

@Santamaura
Copy link
Copy Markdown
Contributor Author

Could I have someone from @cockroachdb/sqlproxy-prs review the sqlproxyccl changes? Thanks in advance.

Copy link
Copy Markdown
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlproxyccl and tenantcostclient changes LGTM

Copy link
Copy Markdown
Contributor

@adityamaru adityamaru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jobs lgtm

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gossip LGTM

As we make these changes, are we verifying that escape analysis is doing the right thing and avoiding new heap allocations around these closures?

Reviewed 3 of 18 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @dhartunian, @DrewKimball, @otan, @RaduBerinde, and @Santamaura)

Copy link
Copy Markdown
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, as long as any captured variables are only used within the scope of the main functions then it will not allocate to the heap.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @dhartunian, @DrewKimball, @otan, and @RaduBerinde)

@nvb
Copy link
Copy Markdown
Contributor

nvb commented Sep 6, 2023

That should be the case, but escape analysis has a way of being finicky around closures, especially if those closures are calling other functions that accept pointer arguments. It's worth verifying now instead of searching for small performance regressions later.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 22 files at r2, 9 of 18 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @dhartunian, @DrewKimball, @otan, @RaduBerinde, and @Santamaura)


pkg/gossip/client.go line 129 at r5 (raw file):

				func() {
					g.mu.RLock()
					defer g.mu.RUnlock()

let's avoid logging under this lock. try this:

peerID, addr := func() { /* locking here */}()
if peerID != 0 {
   /* use copy here */
} else { /*ditto*/ }

pkg/gossip/gossip.go line 546 at r5 (raw file):

	var n int
	var status redact.SafeString
	func() {

nit: n, status := func() ...


pkg/internal/sqlsmith/bulkio.go line 215 at r5 (raw file):

	var exp string
	// Find all CSV files created by the EXPORT.
	files := func() tree.Exprs {

nit: consider making the function return exp too.


pkg/internal/sqlsmith/bulkio.go line 240 at r5 (raw file):

	var schema string
	var tableSchema []byte
	func() {

nit: schema, tableSchema := func() ...


pkg/jobs/adopt.go line 433 at r5 (raw file):

	var username username.SQLUsername
	var typ jobspb.Type
	func() {

nit: username, typ := func() ...


pkg/jobs/progress.go line 143 at r5 (raw file):

// new progress amount.
func (p *ProgressUpdateBatcher) Add(ctx context.Context, delta float32) error {
	var shouldReport bool

nit: shouldReport, completed := func() ...

@Santamaura Santamaura force-pushed the assess-locks-1 branch 2 times, most recently from 575663b to dc6e0c2 Compare September 6, 2023 16:45
Copy link
Copy Markdown
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @dhartunian, @DrewKimball, @knz, @nvanbenschoten, @otan, and @RaduBerinde)


pkg/gossip/client.go line 129 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

let's avoid logging under this lock. try this:

peerID, addr := func() { /* locking here */}()
if peerID != 0 {
   /* use copy here */
} else { /*ditto*/ }

Done.


pkg/gossip/gossip.go line 546 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: n, status := func() ...

Done.


pkg/internal/sqlsmith/bulkio.go line 215 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: consider making the function return exp too.

Done.


pkg/internal/sqlsmith/bulkio.go line 240 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: schema, tableSchema := func() ...

Done.


pkg/jobs/adopt.go line 433 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: username, typ := func() ...

Done.


pkg/jobs/progress.go line 143 at r5 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: shouldReport, completed := func() ...

Done.

gossip, internal, jobs: cleanup lock usages

This PR updates unsafe/unnecessary manual unlocks to
defer unlocks.  The criteria for
leaving some unlocks as is:
 - will not cause a leak
 - releases the resources much earlier
 - deferring will cause a deadlock without significant refactor

Part of cockroachdb#107946

Release note: None
Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 5 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @dhartunian, @DrewKimball, @otan, @RaduBerinde, and @Santamaura)

Copy link
Copy Markdown
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be the case, but escape analysis has a way of being finicky around closures, especially if those closures are calling other functions that accept pointer arguments. It's worth verifying now instead of searching for small performance regressions later.

I found your goescape script it was very useful 🙂 . I reviewed all the files and there's only 1 occurrence of a closure from the changes escaping to the heap, which I think should have minimal perf impact.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @dhartunian, @DrewKimball, @otan, and @RaduBerinde)

@Santamaura
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 8, 2023

Build succeeded:

@craig craig bot merged commit b7840ea into cockroachdb:master Sep 8, 2023
@Santamaura Santamaura deleted the assess-locks-1 branch September 8, 2023 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants