kv: log on excessive latch hold duration#115746
kv: log on excessive latch hold duration#115746craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. I was unable to automatically find a reviewer. You can try CCing one of the following members:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
0d2fb34 to
ad83ea0
Compare
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. I was unable to automatically find a reviewer. You can try CCing one of the following members:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
ad83ea0 to
ff67a23
Compare
ff67a23 to
5a8da46
Compare
|
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. I was unable to automatically find a reviewer. You can try CCing one of the following members:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
5a8da46 to
ef15fe7
Compare
|
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. I have added a few people who may be able to assist in reviewing: 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
ef15fe7 to
57bc081
Compare
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
57bc081 to
a332c17
Compare
|
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
a332c17 to
b80d02f
Compare
lyang24
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/spanlatch/manager.go line 640 at r1 (raw file):
held := now.Sub(lg.acquireTime) log.Warningf(context.Background(), "%s has held latch for %s. Some possible causes are:"+
appreciate help with the wording on some of the reasons that causes the long latch held here
d24c3f5 to
76b957d
Compare
nvb
left a comment
There was a problem hiding this comment.
Nice! This is looking good.
Reviewed 4 of 5 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @lyang24)
-- commits line 4 at r2:
s/helding/holding/
-- commits line 5 at r2:
s/system variable/cluster setting/
"that is introduced"
-- commits line 12 at r2:
Thanks for writing this! I don't think this specific change needs a release note though.
pkg/kv/kvserver/spanlatch/manager.go line 640 at r1 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
appreciate help with the wording on some of the reasons that causes the long latch held here
How about something like:
request %s held latches for %s. Some possible causes are slow disk reads, slow raft replication, and expensive request processing.
Intent resolution for a txn isn't performed while holding (it's async), so I omitted that here.
pkg/kv/kvserver/spanlatch/manager.go line 147 at r2 (raw file):
// checking of conflicts, and waiting. snap *snapshot acquireTime time.Time
time.Time is unfortunately 24 bytes large. We can grab the nanos out of it with UnixNano() and drop it down to 8 bytes, which will make this change less impactful on the memory size of Guard.
pkg/kv/kvserver/spanlatch/manager.go line 248 at r2 (raw file):
return nil, err } lg.acquireTime = timeutil.Now()
What do you think about pushing this assignment into wait (right before return nil) so that we don't need to duplicate it?
pkg/kv/kvserver/spanlatch/manager.go line 637 at r2 (raw file):
m.mu.Unlock() now := time.Now() if !lg.acquireTime.IsZero() && m.settings != nil && m.everySecondLogger.ShouldLog() && now.Sub(lg.acquireTime) > longLatchHoldDuration.Get(&m.settings.SV) {
Instead of conditioning this on m.everySecondLogger.ShouldLog() and then omitting this log from traces when it was recently logged, we can do something like:
if !lg.acquireTime.IsZero() && m.settings != nil && now.Sub(lg.acquireTime) > longLatchHoldDuration.Get(&m.settings.SV) {
msg := "..."
if m.everySecondLogger.ShouldLog() {
log.Warningf(ctx, msg, lg.baFmt, held)
} else {
log.VEventf(ctx, 2, msg, lg.baFmt, held)
}
}pkg/kv/kvserver/spanlatch/manager.go line 638 at r2 (raw file):
now := time.Now() if !lg.acquireTime.IsZero() && m.settings != nil && m.everySecondLogger.ShouldLog() && now.Sub(lg.acquireTime) > longLatchHoldDuration.Get(&m.settings.SV) { held := now.Sub(lg.acquireTime)
Let's pull this up so that we don't need to duplicate it in the if logic.
pkg/kv/kvserver/spanlatch/manager.go line 639 at r2 (raw file):
if !lg.acquireTime.IsZero() && m.settings != nil && m.everySecondLogger.ShouldLog() && now.Sub(lg.acquireTime) > longLatchHoldDuration.Get(&m.settings.SV) { held := now.Sub(lg.acquireTime) log.Warningf(context.Background(),
In a new commit, we should plumb a context down through managerImpl.FinishReq and into Manager.Release so that we don't need to use context.Background(). We can change this to context.TODO() if you'd like to do that in a separate PR after.
pkg/kv/kvserver/spanlatch/settings.go line 1 at r2 (raw file):
// Copyright 2020 The Cockroach Authors.
s/2020/2024/
76b957d to
c74177b
Compare
|
|
c74177b to
1d23d47
Compare
|
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
018c388 to
0df1d8b
Compare
There was a problem hiding this comment.
Fixed comments and added a new commit that plumbed context all the way down to latch manager release. The suggest really helped it turned out nicely.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/spanlatch/manager.go line 147 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
time.Time is unfortunately 24 bytes large. We can grab the nanos out of it with
UnixNano()and drop it down to 8 bytes, which will make this change less impactful on the memory size ofGuard.
wow yea unix nano could totally do the job while saving space
pkg/kv/kvserver/spanlatch/manager.go line 639 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
In a new commit, we should plumb a context down through
managerImpl.FinishReqand intoManager.Releaseso that we don't need to usecontext.Background(). We can change this tocontext.TODO()if you'd like to do that in a separate PR after.
added a new commit
0df1d8b to
06d0c60
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewed 1 of 3 files at r4, 9 of 9 files at r6, 8 of 8 files at r7, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @lyang24)
pkg/kv/kvserver/spanlatch/settings.go line 23 at r6 (raw file):
) const LongLatchHeldMsg = "%s has held latch for %d ns. Some possible causes are " +
Why extract this into an exported constant and separate it from its users? Unless we need it for testing, we should probably just declare it inline in Release, right above where it is used.
Something like:
if lg.acquireTime != 0 && m.settings != nil && held > longLatchHoldDuration.Get(&m.settings.SV).Nanoseconds() {
const msg := "..."
if m.everySecondLogger.ShouldLog() {
log.Warningf(context.Background(), msg, lg.baFmt, held)
} else {
log.VEventf(context.Background(), 2, msg, lg.baFmt, held)
}
}pkg/kv/kvserver/spanlatch/manager_test.go line 175 at r7 (raw file):
// a b c d // ctx := context.Background()
nit: move this above the comment about which latches are being acquired.
pkg/kv/kvserver/spanlatch/manager_test.go line 184 at r7 (raw file):
lg2C := m.MustAcquireCh(spans("a", "b", read, ts0)) lg2 := testLatchSucceeds(t, lg2C) m.Release(context.Background(), lg2)
ctx
pkg/kv/kvserver/spanlatch/manager_test.go line 239 at r7 (raw file):
for _, lgC := range attempts { lg := testLatchSucceeds(t, lgC) m.Release(context.Background(), lg)
ctx
pkg/kv/kvserver/spanlatch/manager_test.go line 740 at r7 (raw file):
snap.close() if len(lgBuf) == cap(lgBuf) { m.Release(context.Background(), <-lgBuf)
Since this is a benchmark, we don't want to add a function call here (even though it should be inlined), so let's extract out a ctx variable outside the readsPerWrite loop.
d299b2b to
8987790
Compare
This commit aims to help observability by logging request holding latches over threshold. long_latch_hold_duration is a new cluster setting that is introduced to set the latch holding time threshold, latches held over the threshold will be logged at most every second. To achieve logging spanlatch.manager now contains a pointer to cluster setting. Fixes: cockroachdb#114609 Release note: None
8987790 to
fe7c1ed
Compare
This commit adds context from concurrency manager's finishReq method to latch manager's release method. The goal is to supply the right context when logging events on latch release. Informs: cockroachdb#114609 Release note: None
fe7c1ed to
338aa9c
Compare
lyang24
left a comment
There was a problem hiding this comment.
Appreciate the meticulous reviews. I had some fumble when editing stacked commits but it seems like reviewable captures the diff correctly. I think its good for another review once the test passes.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/spanlatch/settings.go line 23 at r6 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why extract this into an exported constant and separate it from its users? Unless we need it for testing, we should probably just declare it inline in
Release, right above where it is used.Something like:
if lg.acquireTime != 0 && m.settings != nil && held > longLatchHoldDuration.Get(&m.settings.SV).Nanoseconds() { const msg := "..." if m.everySecondLogger.ShouldLog() { log.Warningf(context.Background(), msg, lg.baFmt, held) } else { log.VEventf(context.Background(), 2, msg, lg.baFmt, held) } }
oh I was wondering if the msg variable will be allocated in memory everything release it called vs initialize it once out side of the call. That seems like micro optimization it should be fine, i changed msg inside the call.
pkg/kv/kvserver/spanlatch/manager_test.go line 740 at r7 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Since this is a benchmark, we don't want to add a function call here (even though it should be inlined), so let's extract out a ctx variable outside the
readsPerWriteloop.
thank you for catching this subtle change
pkg/kv/kvserver/spanlatch/manager_test.go line 690 at r9 (raw file):
lg3 := m.AcquireOptimistic(spans("a", "e", write, zeroTS), poison.Policy_Error, nil) require.True(t, m.CheckOptimisticNoConflicts(lg3, spans("a", "e", write, zeroTS))) lg3, err = m.WaitUntilAcquired(ctx, lg3)
since context.background() is empty context i changed m.WaitUntilAcquiredin this function to use ctx variable for consistency
nvb
left a comment
There was a problem hiding this comment.
Reviewed 9 of 9 files at r8, 8 of 8 files at r9, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @lyang24)
pkg/kv/kvserver/spanlatch/settings.go line 23 at r6 (raw file):
Previously, lyang24 (Lanqing Yang) wrote…
oh I was wondering if the msg variable will be allocated in memory everything release it called vs initialize it once out side of the call. That seems like micro optimization it should be fine, i changed msg inside the call.
String literals in Go are always statically allocated, so there is no cost to scoping this more locally to its one user.
|
bors r+ |
|
Build succeeded: |
This commit aims to help observability by logging request holding latches over
threshold. long_latch_hold_duration is a new cluster setting that is introduced
to set the latch holding time threshold, latches held over the threshold will
be logged at most every second. To achieve logging spanlatch.manager now
contains a pointer to cluster setting.
Fixes: #114609
Release note: None