acceptance, ccl, cli, cmd, config, gossip, internal, jobs: cleanup lock usages#108064
acceptance, ccl, cli, cmd, config, gossip, internal, jobs: cleanup lock usages#108064craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
dhartunian
left a comment
There was a problem hiding this comment.
I took a brief look and gave some advice on a few instances where your refactor should be reconsidered. Maybe as a first pass, you can make a PR for all the "simple" refactors where adding a defer is less disruptive.
One thing I'm not 100% clear on is the impact of adding the defer inside an immediately executed anonymous function. Maybe a good topic to discuss at tech pod.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @Santamaura)
pkg/acceptance/localcluster/cluster.go line 488 at r1 (raw file):
func (n *Node) StatusClient(ctx context.Context) serverpb.StatusClient { n.Lock() defer n.Unlock()
Adding this defer here, moves the unlock until after the GRPCUnvalidatedDial...Connect() call which could take an unknown amount of time.
The unlock was likely put at the top here in order to keep it from being held for a long time. Basically just to grab the statusClient safely.
Consider doing this within an anonymous function like:
var existingClient ...
func () {
n.Lock()
defer n.Unlock()
existingClient = ...
}()
Although in this particular case it's impossible for the code between the Lock/Unlock to panic so it's less of a concern. But consider this pattern in the situations below.
pkg/acceptance/localcluster/cluster.go line 704 at r1 (raw file):
} n.Lock() defer n.Unlock()
Replacing multiple Lock/Unlock calls with a single one could radically change the performance profile of the code. It's unlikely that this is a valid refactor.
The locks were carefully taken in the original code around specific variables that needed to be lifted out from n.
One thing that's a hint much like the gRPC example above is that we have an os.ReadFile that's originally done without a lock held which after your change does happen during the Lock. Reading a file could take an extremely long time compared to a Lock being held. This causes problems when requests are made concurrently on the function and all concurrent requests then have to wait for the Lock. If you hold the Lock while reading the file, no one else can make progress until you're done reading the file, whereas if you maintain the fine-grained locking, many calls can proceed concurrently because they block each other for only a miniscule period of time.
Consider the measured difference on pages like https://colin-scott.github.io/personal_website/research/interactive_latency.html where mutex lock/unlock is on the order of 17ns while SSD random read is 16,000ns. So do not perceive many Lock/Unlock operations as "bad" necessarily.
pkg/ccl/backupccl/restore_job.go line 2133 at r1 (raw file):
} mu.Lock() defer mu.Unlock()
defer moves the mu.Unlock() to the end of the function call, not the select or the for loop. In this case you would immediately create a deadlock because the for loop runs forever and would never unlock.
This is another case where an anonymous function immediately executed can help.
de4b2b8 to
e879113
Compare
jayshrivastava
left a comment
There was a problem hiding this comment.
changefeedccl/ changes lgtm!
RaduBerinde
left a comment
There was a problem hiding this comment.
I made a pass. I suggest going back through all the changes and making sure the critical sections are not modified. We want this to be a cosmetic change and nothing more. I'd error on the side of adding the nolint comment rather than changing what the critical section covers. Sometimes innocuous looking changes are not that innocuous.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @bananabrick, @DrewKimball, and @Santamaura)
pkg/acceptance/localcluster/cluster.go line 697 at r2 (raw file):
for r := retry.Start(opts); r.Next(); { var pid int if n.cmd != nil {
Here we're accessing n.cmd outside of the lock but in one of the fragments below (line 784) we lock just to get n.cmd
pkg/acceptance/localcluster/cluster.go line 698 at r2 (raw file):
pid = n.cmd.Process.Pid } n.Unlock()
[nit] I would just add the linter comment here, it's cleaner the old way here
pkg/ccl/changefeedccl/event_processing.go line 661 at r2 (raw file):
// that there are no more events. if notifyFlush { c.flushCh <- struct{}{}
We weren't holding the lock here before.. In principle, this can block.
pkg/ccl/changefeedccl/sink_kafka.go line 55 at r2 (raw file):
func (l *maybeLocker) Unlock() { if l.locked { defer l.wrapped.Unlock()
This is probably incorrect (we set locked=false before actually unlocking)
pkg/ccl/changefeedccl/kvevent/blocking_buffer.go line 157 at r2 (raw file):
if canFlush { select { case b.signalCh <- struct{}{}:
We weren't holding the lock here before. In principle it could matter if signalCh is unbuffered and the receiver needs the lock.
pkg/ccl/sqlproxyccl/tenantdirsvr/test_directory_svr.go line 203 at r2 (raw file):
case <-s.stopper.ShouldQuiesce(): func() { s.mu.eventListeners.Remove(elem)
missing lock/unlock here. Consider adding a method on s for removing an event listener
pkg/gossip/client.go line 156 at r2 (raw file):
// timestamps. func (c *client) requestGossip(g *Gossip, stream Gossip_GossipClient) error { args := func() *Request {
[nit] it would be cleaner if this func only returned NodeAddr and highwaterstamps
pkg/internal/sqlsmith/bulkio.go line 219 at r2 (raw file):
defer s.lock.Unlock() return len(s.bulkExports) == 0 }(); !noBulkExports {
The ! here seems wrong, we want to exit if there are no bulkExport. Also, dropping the lock and reacquiring is not the same as before. In principle, bulkExports could change in-between. I think we want a function that returns the files. We already check for empty files afterwards.
otan
left a comment
There was a problem hiding this comment.
spatial changes look ok, but unconfident on reviewing the rest
d0b74b4 to
0a81c70
Compare
Santamaura
left a comment
There was a problem hiding this comment.
TFTRs!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @adityamaru, @bananabrick, @dhartunian, @DrewKimball, and @RaduBerinde)
pkg/acceptance/localcluster/cluster.go line 698 at r2 (raw file):
Previously, RaduBerinde wrote…
[nit] I would just add the linter comment here, it's cleaner the old way here
Done
pkg/ccl/backupccl/restore_job.go line 2133 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
defermoves themu.Unlock()to the end of the function call, not theselector theforloop. In this case you would immediately create a deadlock because theforloop runs forever and would never unlock.This is another case where an anonymous function immediately executed can help.
Done.
pkg/ccl/changefeedccl/event_processing.go line 661 at r2 (raw file):
Previously, RaduBerinde wrote…
We weren't holding the lock here before.. In principle, this can block.
Updated
pkg/ccl/changefeedccl/sink_kafka.go line 55 at r2 (raw file):
Previously, RaduBerinde wrote…
This is probably incorrect (we set
locked=falsebefore actually unlocking)
Done
pkg/ccl/changefeedccl/kvevent/blocking_buffer.go line 157 at r2 (raw file):
Previously, RaduBerinde wrote…
We weren't holding the lock here before. In principle it could matter if
signalChis unbuffered and the receiver needs the lock.
Updated
pkg/ccl/sqlproxyccl/tenantdirsvr/test_directory_svr.go line 203 at r2 (raw file):
Previously, RaduBerinde wrote…
missing lock/unlock here. Consider adding a method on
sfor removing an event listener
Added
pkg/gossip/client.go line 156 at r2 (raw file):
Previously, RaduBerinde wrote…
[nit] it would be cleaner if this func only returned NodeAddr and highwaterstamps
Done
pkg/internal/sqlsmith/bulkio.go line 219 at r2 (raw file):
Previously, RaduBerinde wrote…
The ! here seems wrong, we want to exit if there are no bulkExport. Also, dropping the lock and reacquiring is not the same as before. In principle, bulkExports could change in-between. I think we want a function that returns the
files. We already check for empty files afterwards.
Updated
0a81c70 to
aa07d8d
Compare
aa07d8d to
ba13c8e
Compare
ba13c8e to
15c24e9
Compare
e7e8b4b to
9b8984d
Compare
adityamaru
left a comment
There was a problem hiding this comment.
backupccl lgtm, thanks
|
Could I have someone from @cockroachdb/sqlproxy-prs review the |
jeffswenson
left a comment
There was a problem hiding this comment.
sqlproxyccl and tenantcostclient changes LGTM
nvb
left a comment
There was a problem hiding this comment.
gossip LGTM
As we make these changes, are we verifying that escape analysis is doing the right thing and avoiding new heap allocations around these closures?
Reviewed 3 of 18 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @dhartunian, @DrewKimball, @otan, @RaduBerinde, and @Santamaura)
Santamaura
left a comment
There was a problem hiding this comment.
From what I understand, as long as any captured variables are only used within the scope of the main functions then it will not allocate to the heap.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @dhartunian, @DrewKimball, @otan, and @RaduBerinde)
|
That should be the case, but escape analysis has a way of being finicky around closures, especially if those closures are calling other functions that accept pointer arguments. It's worth verifying now instead of searching for small performance regressions later. |
knz
left a comment
There was a problem hiding this comment.
Reviewed 4 of 22 files at r2, 9 of 18 files at r5, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @dhartunian, @DrewKimball, @otan, @RaduBerinde, and @Santamaura)
pkg/gossip/client.go line 129 at r5 (raw file):
func() { g.mu.RLock() defer g.mu.RUnlock()
let's avoid logging under this lock. try this:
peerID, addr := func() { /* locking here */}()
if peerID != 0 {
/* use copy here */
} else { /*ditto*/ }pkg/gossip/gossip.go line 546 at r5 (raw file):
var n int var status redact.SafeString func() {
nit: n, status := func() ...
pkg/internal/sqlsmith/bulkio.go line 215 at r5 (raw file):
var exp string // Find all CSV files created by the EXPORT. files := func() tree.Exprs {
nit: consider making the function return exp too.
pkg/internal/sqlsmith/bulkio.go line 240 at r5 (raw file):
var schema string var tableSchema []byte func() {
nit: schema, tableSchema := func() ...
pkg/jobs/adopt.go line 433 at r5 (raw file):
var username username.SQLUsername var typ jobspb.Type func() {
nit: username, typ := func() ...
pkg/jobs/progress.go line 143 at r5 (raw file):
// new progress amount. func (p *ProgressUpdateBatcher) Add(ctx context.Context, delta float32) error { var shouldReport bool
nit: shouldReport, completed := func() ...
575663b to
dc6e0c2
Compare
Santamaura
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @dhartunian, @DrewKimball, @knz, @nvanbenschoten, @otan, and @RaduBerinde)
pkg/gossip/client.go line 129 at r5 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
let's avoid logging under this lock. try this:
peerID, addr := func() { /* locking here */}() if peerID != 0 { /* use copy here */ } else { /*ditto*/ }
Done.
pkg/gossip/gossip.go line 546 at r5 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
nit:
n, status := func() ...
Done.
pkg/internal/sqlsmith/bulkio.go line 215 at r5 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
nit: consider making the function return
exptoo.
Done.
pkg/internal/sqlsmith/bulkio.go line 240 at r5 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
nit:
schema, tableSchema := func() ...
Done.
pkg/jobs/adopt.go line 433 at r5 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
nit:
username, typ := func() ...
Done.
pkg/jobs/progress.go line 143 at r5 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
nit:
shouldReport, completed := func() ...
Done.
dc6e0c2 to
180a422
Compare
gossip, internal, jobs: cleanup lock usages This PR updates unsafe/unnecessary manual unlocks to defer unlocks. The criteria for leaving some unlocks as is: - will not cause a leak - releases the resources much earlier - deferring will cause a deadlock without significant refactor Part of cockroachdb#107946 Release note: None
180a422 to
d1c1727
Compare
knz
left a comment
There was a problem hiding this comment.
Reviewed 4 of 5 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @dhartunian, @DrewKimball, @otan, @RaduBerinde, and @Santamaura)
Santamaura
left a comment
There was a problem hiding this comment.
That should be the case, but escape analysis has a way of being finicky around closures, especially if those closures are calling other functions that accept pointer arguments. It's worth verifying now instead of searching for small performance regressions later.
I found your goescape script it was very useful 🙂 . I reviewed all the files and there's only 1 occurrence of a closure from the changes escaping to the heap, which I think should have minimal perf impact.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @bananabrick, @dhartunian, @DrewKimball, @otan, and @RaduBerinde)
|
bors r+ |
|
Build succeeded: |
This PR updates unsafe/unnecessary manual unlocks to defer unlocks.
The criteria for leaving some unlocks as is:
Part of #107946
Release note: None
Signoffs