Skip to content

kv: log on excessive latch hold duration#115746

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
lyang24:long_latch_held
Feb 2, 2024
Merged

kv: log on excessive latch hold duration#115746
craig[bot] merged 2 commits intocockroachdb:masterfrom
lyang24:long_latch_held

Conversation

@lyang24
Copy link
Copy Markdown
Contributor

@lyang24 lyang24 commented Dec 7, 2023

This commit aims to help observability by logging request holding latches over
threshold. long_latch_hold_duration is a new cluster setting that is introduced
to set the latch holding time threshold, latches held over the threshold will
be logged at most every second. To achieve logging spanlatch.manager now
contains a pointer to cluster setting.

Fixes: #114609

Release note: None

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Dec 7, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

I was unable to automatically find a reviewer. You can try CCing one of the following members:

  • A person you worked with closely on this PR.
  • The person who created the ticket, or a CRDB organization member involved with the ticket (author, commenter, etc.).
  • Join our community slack channel and ask on #contributors.
  • Try find someone else from here.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-untriaged blathers was unable to find an owner labels Dec 7, 2023
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Dec 7, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

I was unable to automatically find a reviewer. You can try CCing one of the following members:

  • A person you worked with closely on this PR.
  • The person who created the ticket, or a CRDB organization member involved with the ticket (author, commenter, etc.).
  • Join our community slack channel and ask on #contributors.
  • Try find someone else from here.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@annrpom annrpom removed the X-blathers-untriaged blathers was unable to find an owner label Dec 14, 2023
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jan 8, 2024

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

I was unable to automatically find a reviewer. You can try CCing one of the following members:

  • A person you worked with closely on this PR.
  • The person who created the ticket, or a CRDB organization member involved with the ticket (author, commenter, etc.).
  • Join our community slack channel and ask on #contributors.
  • Try find someone else from here.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the X-blathers-untriaged blathers was unable to find an owner label Jan 8, 2024
@lyang24 lyang24 changed the title kv(wip): log on excessive latch hold duration kv: log on excessive latch hold duration Jan 8, 2024
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jan 8, 2024

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

I have added a few people who may be able to assist in reviewing:

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the X-blathers-triaged blathers was able to find an owner label Jan 8, 2024
@blathers-crl blathers-crl bot requested a review from nvb January 8, 2024 07:02
@lyang24 lyang24 removed the X-blathers-untriaged blathers was unable to find an owner label Jan 13, 2024
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jan 13, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jan 23, 2024

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@lyang24 lyang24 marked this pull request as ready for review January 23, 2024 18:17
@lyang24 lyang24 requested a review from a team as a code owner January 23, 2024 18:17
Copy link
Copy Markdown
Contributor Author

@lyang24 lyang24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/spanlatch/manager.go line 640 at r1 (raw file):

		held := now.Sub(lg.acquireTime)
		log.Warningf(context.Background(),
			"%s has held latch for %s. Some possible causes are:"+

appreciate help with the wording on some of the reasons that causes the long latch held here

@lyang24 lyang24 force-pushed the long_latch_held branch 2 times, most recently from d24c3f5 to 76b957d Compare January 23, 2024 20:35
Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is looking good.

Reviewed 4 of 5 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lyang24)


-- commits line 4 at r2:
s/helding/holding/


-- commits line 5 at r2:
s/system variable/cluster setting/

"that is introduced"


-- commits line 12 at r2:
Thanks for writing this! I don't think this specific change needs a release note though.


pkg/kv/kvserver/spanlatch/manager.go line 640 at r1 (raw file):

Previously, lyang24 (Lanqing Yang) wrote…

appreciate help with the wording on some of the reasons that causes the long latch held here

How about something like:

request %s held latches for %s. Some possible causes are slow disk reads, slow raft replication, and expensive request processing.

Intent resolution for a txn isn't performed while holding (it's async), so I omitted that here.


pkg/kv/kvserver/spanlatch/manager.go line 147 at r2 (raw file):

	// checking of conflicts, and waiting.
	snap        *snapshot
	acquireTime time.Time

time.Time is unfortunately 24 bytes large. We can grab the nanos out of it with UnixNano() and drop it down to 8 bytes, which will make this change less impactful on the memory size of Guard.


pkg/kv/kvserver/spanlatch/manager.go line 248 at r2 (raw file):

		return nil, err
	}
	lg.acquireTime = timeutil.Now()

What do you think about pushing this assignment into wait (right before return nil) so that we don't need to duplicate it?


pkg/kv/kvserver/spanlatch/manager.go line 637 at r2 (raw file):

	m.mu.Unlock()
	now := time.Now()
	if !lg.acquireTime.IsZero() && m.settings != nil && m.everySecondLogger.ShouldLog() && now.Sub(lg.acquireTime) > longLatchHoldDuration.Get(&m.settings.SV) {

Instead of conditioning this on m.everySecondLogger.ShouldLog() and then omitting this log from traces when it was recently logged, we can do something like:

if !lg.acquireTime.IsZero() && m.settings != nil && now.Sub(lg.acquireTime) > longLatchHoldDuration.Get(&m.settings.SV) {
    msg := "..."
    if m.everySecondLogger.ShouldLog() {
        log.Warningf(ctx, msg, lg.baFmt, held)
    } else {
        log.VEventf(ctx, 2, msg, lg.baFmt, held)
    }
}

pkg/kv/kvserver/spanlatch/manager.go line 638 at r2 (raw file):

	now := time.Now()
	if !lg.acquireTime.IsZero() && m.settings != nil && m.everySecondLogger.ShouldLog() && now.Sub(lg.acquireTime) > longLatchHoldDuration.Get(&m.settings.SV) {
		held := now.Sub(lg.acquireTime)

Let's pull this up so that we don't need to duplicate it in the if logic.


pkg/kv/kvserver/spanlatch/manager.go line 639 at r2 (raw file):

	if !lg.acquireTime.IsZero() && m.settings != nil && m.everySecondLogger.ShouldLog() && now.Sub(lg.acquireTime) > longLatchHoldDuration.Get(&m.settings.SV) {
		held := now.Sub(lg.acquireTime)
		log.Warningf(context.Background(),

In a new commit, we should plumb a context down through managerImpl.FinishReq and into Manager.Release so that we don't need to use context.Background(). We can change this to context.TODO() if you'd like to do that in a separate PR after.


pkg/kv/kvserver/spanlatch/settings.go line 1 at r2 (raw file):

// Copyright 2020 The Cockroach Authors.

s/2020/2024/

@lyang24
Copy link
Copy Markdown
Contributor Author

lyang24 commented Jan 25, 2024

still need to plumb down context before the new review

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jan 26, 2024

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

cockroach-teamcity commented Jan 26, 2024

CLA assistant check
All committers have signed the CLA.

@lyang24 lyang24 force-pushed the long_latch_held branch 3 times, most recently from 018c388 to 0df1d8b Compare January 26, 2024 08:04
Copy link
Copy Markdown
Contributor Author

@lyang24 lyang24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed comments and added a new commit that plumbed context all the way down to latch manager release. The suggest really helped it turned out nicely.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/spanlatch/manager.go line 147 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

time.Time is unfortunately 24 bytes large. We can grab the nanos out of it with UnixNano() and drop it down to 8 bytes, which will make this change less impactful on the memory size of Guard.

wow yea unix nano could totally do the job while saving space


pkg/kv/kvserver/spanlatch/manager.go line 639 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

In a new commit, we should plumb a context down through managerImpl.FinishReq and into Manager.Release so that we don't need to use context.Background(). We can change this to context.TODO() if you'd like to do that in a separate PR after.

added a new commit

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r4, 9 of 9 files at r6, 8 of 8 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lyang24)


pkg/kv/kvserver/spanlatch/settings.go line 23 at r6 (raw file):

)

const LongLatchHeldMsg = "%s has held latch for %d ns. Some possible causes are " +

Why extract this into an exported constant and separate it from its users? Unless we need it for testing, we should probably just declare it inline in Release, right above where it is used.

Something like:

if lg.acquireTime != 0 && m.settings != nil && held > longLatchHoldDuration.Get(&m.settings.SV).Nanoseconds() {  
  const msg := "..."
  if m.everySecondLogger.ShouldLog() {  
   log.Warningf(context.Background(), msg, lg.baFmt, held)  
  } else {  
   log.VEventf(context.Background(), 2, msg, lg.baFmt, held)  
  }  
}

pkg/kv/kvserver/spanlatch/manager_test.go line 175 at r7 (raw file):

	//    a   b      c   d
	//
	ctx := context.Background()

nit: move this above the comment about which latches are being acquired.


pkg/kv/kvserver/spanlatch/manager_test.go line 184 at r7 (raw file):

	lg2C := m.MustAcquireCh(spans("a", "b", read, ts0))
	lg2 := testLatchSucceeds(t, lg2C)
	m.Release(context.Background(), lg2)

ctx


pkg/kv/kvserver/spanlatch/manager_test.go line 239 at r7 (raw file):

	for _, lgC := range attempts {
		lg := testLatchSucceeds(t, lgC)
		m.Release(context.Background(), lg)

ctx


pkg/kv/kvserver/spanlatch/manager_test.go line 740 at r7 (raw file):

				snap.close()
				if len(lgBuf) == cap(lgBuf) {
					m.Release(context.Background(), <-lgBuf)

Since this is a benchmark, we don't want to add a function call here (even though it should be inlined), so let's extract out a ctx variable outside the readsPerWrite loop.

@lyang24 lyang24 force-pushed the long_latch_held branch 3 times, most recently from d299b2b to 8987790 Compare February 1, 2024 20:03
This commit aims to help observability by logging request holding latches over
threshold. long_latch_hold_duration is a new cluster setting that is introduced
to set the latch holding time threshold, latches held over the threshold will
be logged at most every second. To achieve logging spanlatch.manager now
contains a pointer to cluster setting.

Fixes: cockroachdb#114609

Release note: None
This commit adds context from concurrency manager's finishReq method to latch
manager's release method. The goal is to supply the right context when logging
events on latch release.

Informs: cockroachdb#114609

Release note: None
Copy link
Copy Markdown
Contributor Author

@lyang24 lyang24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the meticulous reviews. I had some fumble when editing stacked commits but it seems like reviewable captures the diff correctly. I think its good for another review once the test passes.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/spanlatch/settings.go line 23 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why extract this into an exported constant and separate it from its users? Unless we need it for testing, we should probably just declare it inline in Release, right above where it is used.

Something like:

if lg.acquireTime != 0 && m.settings != nil && held > longLatchHoldDuration.Get(&m.settings.SV).Nanoseconds() {  
  const msg := "..."
  if m.everySecondLogger.ShouldLog() {  
   log.Warningf(context.Background(), msg, lg.baFmt, held)  
  } else {  
   log.VEventf(context.Background(), 2, msg, lg.baFmt, held)  
  }  
}

oh I was wondering if the msg variable will be allocated in memory everything release it called vs initialize it once out side of the call. That seems like micro optimization it should be fine, i changed msg inside the call.


pkg/kv/kvserver/spanlatch/manager_test.go line 740 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Since this is a benchmark, we don't want to add a function call here (even though it should be inlined), so let's extract out a ctx variable outside the readsPerWrite loop.

thank you for catching this subtle change


pkg/kv/kvserver/spanlatch/manager_test.go line 690 at r9 (raw file):

	lg3 := m.AcquireOptimistic(spans("a", "e", write, zeroTS), poison.Policy_Error, nil)
	require.True(t, m.CheckOptimisticNoConflicts(lg3, spans("a", "e", write, zeroTS)))
	lg3, err = m.WaitUntilAcquired(ctx, lg3)

since context.background() is empty context i changed m.WaitUntilAcquiredin this function to use ctx variable for consistency

Copy link
Copy Markdown
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 9 of 9 files at r8, 8 of 8 files at r9, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @lyang24)


pkg/kv/kvserver/spanlatch/settings.go line 23 at r6 (raw file):

Previously, lyang24 (Lanqing Yang) wrote…

oh I was wondering if the msg variable will be allocated in memory everything release it called vs initialize it once out side of the call. That seems like micro optimization it should be fine, i changed msg inside the call.

String literals in Go are always statically allocated, so there is no cost to scoping this more locally to its one user.

@lyang24
Copy link
Copy Markdown
Contributor Author

lyang24 commented Feb 1, 2024

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Feb 2, 2024

Build succeeded:

@craig craig bot merged commit 3d7a44a into cockroachdb:master Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kv: log on excessive latch hold duration

4 participants