Skip to content

insights: identify statements with high retry counts#86415

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
matthewtodd:insights-high-retry-count
Aug 19, 2022
Merged

insights: identify statements with high retry counts#86415
craig[bot] merged 1 commit intocockroachdb:masterfrom
matthewtodd:insights-high-retry-count

Conversation

@matthewtodd
Copy link
Copy Markdown

@matthewtodd matthewtodd commented Aug 18, 2022

Addresses #85827.

This change undoes the previous "concerns" concept from #85345 and
replaces it with a "problems" field.12 We then mark slow statements
retried more than 5 times as having a "HighRetryCount" problem.3 Slow
statements without any specifically identified problems will have
"Unknown" in their list of problems. We will add support for further
problem detection in future commits.

Release justification: Category 2: Bug fixes and low-risk updates to new
functionality

Release note (ops change): The new
sql.insights.high_retry_count.threshold cluster setting may be used to
configure how many times a slow statement (as identified by the
execution insights system) must have been retried to be marked as having
a high retry count.

Footnotes

  1. Every execution insight we offer happens because a statement was
    slow, so the "SlowExecution" concern was the only one we'd ever have.

  2. The protocol buffer changes are safe because we haven't deployed
    this code anywhere yet.

  3. This threshold may be configured via the
    sql.insights.high_retry_count.threshold cluster setting.

@matthewtodd matthewtodd requested review from a team August 18, 2022 19:43
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Author

@matthewtodd matthewtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/sqlstats/insights/insights.proto line 25 at r2 (raw file):

  HighWaitTime = 3;
  HighRetryCount = 4;
  FailedExecution = 5;

I'm working through documentation for these values now.

Copy link
Copy Markdown
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r1, 1 of 4 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)


pkg/sql/sqlstats/insights/detector.go line 48 at r1 (raw file):

	result := false
	for _, d := range a.detectors {
		result = d.isSlow(statement) || result

can you do a check if the value is already true, break for loop, so you don't need to continue checking


pkg/sql/sqlstats/insights/insights.go line 86 at r1 (raw file):

	settings.TenantWritable,
	"sql.insights.high_retry_count.threshold",
	"slow statements retried more than this number of times will be marked as having a high retry count",

should you add something like "... marked as an insight with high..." ? Just want to clarify that this will be marked on the insights, in case they want to check

Addresses #85827.

This change undoes the previous "concerns" concept from #85345 and
replaces it with a "problems" field.[^1][^2] We then mark slow statements
retried more than 10 times as having a "HighRetryCount" problem.[^3] Slow
statements without any specifically identified problems will have
"Unknown" in their list of problems. We will add support for further
problem detection in future commits.

[^1]: Every execution insight we offer happens because a statement was
  slow, so the "SlowExecution" concern was the only one we'd ever have.

[^2]: The protocol buffer changes are safe because we haven't deployed
  this code anywhere yet.

[^3]: This threshold may be configured via the
  `sql.insights.high_retry_count.threshold` cluster setting.

Release justification: Category 2: Bug fixes and low-risk updates to new
functionality

Release note (ops change): The new
`sql.insights.high_retry_count.threshold` cluster setting may be used to
configure how many times a slow statement (as identified by the
execution insights system) must have been retried to be marked as having
a high retry count.
Copy link
Copy Markdown
Author

@matthewtodd matthewtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @maryliag)


pkg/sql/sqlstats/insights/detector.go line 48 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

can you do a check if the value is already true, break for loop, so you don't need to continue checking

That's what the comment is about -- we don't short-circuit the for loop so that all the detectors have the chance to observe the statement. This is important for the (fancy) anomaly detector, so that it gets to build up its sense of normal.


pkg/sql/sqlstats/insights/insights.go line 86 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

should you add something like "... marked as an insight with high..." ? Just want to clarify that this will be marked on the insights, in case they want to check

Yeah, I'm playing with how to use the names. Our insight is that this statement was slow because of being retried a bunch. I nodded toward that with the phrase "slow statements" here, leaning on the insights namespace in the cluster setting to provide the context. But I think we can do more to make this clear, especially calling out that we won't be identifying all highly-retried statements, just the slow ones.

I've just updated to "the number of retries a slow statement must have undergone for its high retry count to be highlighted as a potential problem," leaning on the word "problem" from the insights proto / table. WDYT?

Copy link
Copy Markdown
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question, otherwise :lgtm:
Great to see the room for all other problems we can just add in 😄

Reviewed 5 of 6 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)


pkg/sql/sqlstats/insights/detector.go line 48 at r1 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

That's what the comment is about -- we don't short-circuit the for loop so that all the detectors have the chance to observe the statement. This is important for the (fancy) anomaly detector, so that it gets to build up its sense of normal.

ahhh I see what you mean now!


pkg/sql/sqlstats/insights/insights.go line 86 at r1 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

Yeah, I'm playing with how to use the names. Our insight is that this statement was slow because of being retried a bunch. I nodded toward that with the phrase "slow statements" here, leaning on the insights namespace in the cluster setting to provide the context. But I think we can do more to make this clear, especially calling out that we won't be identifying all highly-retried statements, just the slow ones.

I've just updated to "the number of retries a slow statement must have undergone for its high retry count to be highlighted as a potential problem," leaning on the word "problem" from the insights proto / table. WDYT?

I like this approach! Thank you for updating!


pkg/sql/sqlstats/insights/insights.proto line 36 at r4 (raw file):

  // This statement was slow because of being retried multiple times, again due
  // to contention. The "high" threshold may be configured by the

is a retry always because of contention?

Copy link
Copy Markdown
Author

@matthewtodd matthewtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR, @maryliag!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)


pkg/sql/sqlstats/insights/insights.proto line 36 at r4 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

is a retry always because of contention?

AFAIK? I threw these together quickly from the Problems table in the doc.

@matthewtodd
Copy link
Copy Markdown
Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 19, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 19, 2022

Build failed:

@matthewtodd
Copy link
Copy Markdown
Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 19, 2022

Build succeeded:

@craig craig bot merged commit 8eabfb7 into cockroachdb:master Aug 19, 2022
@matthewtodd matthewtodd deleted the insights-high-retry-count branch August 19, 2022 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants