insights: identify statements with high retry counts#86415
insights: identify statements with high retry counts#86415craig[bot] merged 1 commit intocockroachdb:masterfrom matthewtodd:insights-high-retry-count
Conversation
matthewtodd
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/sql/sqlstats/insights/insights.proto line 25 at r2 (raw file):
HighWaitTime = 3; HighRetryCount = 4; FailedExecution = 5;
I'm working through documentation for these values now.
maryliag
left a comment
There was a problem hiding this comment.
Reviewed 13 of 13 files at r1, 1 of 4 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)
pkg/sql/sqlstats/insights/detector.go line 48 at r1 (raw file):
result := false for _, d := range a.detectors { result = d.isSlow(statement) || result
can you do a check if the value is already true, break for loop, so you don't need to continue checking
pkg/sql/sqlstats/insights/insights.go line 86 at r1 (raw file):
settings.TenantWritable, "sql.insights.high_retry_count.threshold", "slow statements retried more than this number of times will be marked as having a high retry count",
should you add something like "... marked as an insight with high..." ? Just want to clarify that this will be marked on the insights, in case they want to check
Addresses #85827. This change undoes the previous "concerns" concept from #85345 and replaces it with a "problems" field.[^1][^2] We then mark slow statements retried more than 10 times as having a "HighRetryCount" problem.[^3] Slow statements without any specifically identified problems will have "Unknown" in their list of problems. We will add support for further problem detection in future commits. [^1]: Every execution insight we offer happens because a statement was slow, so the "SlowExecution" concern was the only one we'd ever have. [^2]: The protocol buffer changes are safe because we haven't deployed this code anywhere yet. [^3]: This threshold may be configured via the `sql.insights.high_retry_count.threshold` cluster setting. Release justification: Category 2: Bug fixes and low-risk updates to new functionality Release note (ops change): The new `sql.insights.high_retry_count.threshold` cluster setting may be used to configure how many times a slow statement (as identified by the execution insights system) must have been retried to be marked as having a high retry count.
matthewtodd
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @maryliag)
pkg/sql/sqlstats/insights/detector.go line 48 at r1 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
can you do a check if the value is already true, break for loop, so you don't need to continue checking
That's what the comment is about -- we don't short-circuit the for loop so that all the detectors have the chance to observe the statement. This is important for the (fancy) anomaly detector, so that it gets to build up its sense of normal.
pkg/sql/sqlstats/insights/insights.go line 86 at r1 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
should you add something like "... marked as an insight with high..." ? Just want to clarify that this will be marked on the insights, in case they want to check
Yeah, I'm playing with how to use the names. Our insight is that this statement was slow because of being retried a bunch. I nodded toward that with the phrase "slow statements" here, leaning on the insights namespace in the cluster setting to provide the context. But I think we can do more to make this clear, especially calling out that we won't be identifying all highly-retried statements, just the slow ones.
I've just updated to "the number of retries a slow statement must have undergone for its high retry count to be highlighted as a potential problem," leaning on the word "problem" from the insights proto / table. WDYT?
maryliag
left a comment
There was a problem hiding this comment.
Just one question, otherwise
Great to see the room for all other problems we can just add in 😄
Reviewed 5 of 6 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)
pkg/sql/sqlstats/insights/detector.go line 48 at r1 (raw file):
Previously, matthewtodd (Matthew Todd) wrote…
That's what the comment is about -- we don't short-circuit the for loop so that all the detectors have the chance to observe the statement. This is important for the (fancy) anomaly detector, so that it gets to build up its sense of normal.
ahhh I see what you mean now!
pkg/sql/sqlstats/insights/insights.go line 86 at r1 (raw file):
Previously, matthewtodd (Matthew Todd) wrote…
Yeah, I'm playing with how to use the names. Our insight is that this statement was slow because of being retried a bunch. I nodded toward that with the phrase "slow statements" here, leaning on the
insightsnamespace in the cluster setting to provide the context. But I think we can do more to make this clear, especially calling out that we won't be identifying all highly-retried statements, just the slow ones.I've just updated to "the number of retries a slow statement must have undergone for its high retry count to be highlighted as a potential problem," leaning on the word "problem" from the insights proto / table. WDYT?
I like this approach! Thank you for updating!
pkg/sql/sqlstats/insights/insights.proto line 36 at r4 (raw file):
// This statement was slow because of being retried multiple times, again due // to contention. The "high" threshold may be configured by the
is a retry always because of contention?
matthewtodd
left a comment
There was a problem hiding this comment.
TFTR, @maryliag!
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)
pkg/sql/sqlstats/insights/insights.proto line 36 at r4 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
is a retry always because of contention?
AFAIK? I threw these together quickly from the Problems table in the doc.
|
bors r+ |
|
Build failed (retrying...): |
|
Build failed: |
|
bors r+ |
|
Build succeeded: |
Addresses #85827.
This change undoes the previous "concerns" concept from #85345 and
replaces it with a "problems" field.12 We then mark slow statements
retried more than 5 times as having a "HighRetryCount" problem.3 Slow
statements without any specifically identified problems will have
"Unknown" in their list of problems. We will add support for further
problem detection in future commits.
Release justification: Category 2: Bug fixes and low-risk updates to new
functionality
Release note (ops change): The new
sql.insights.high_retry_count.thresholdcluster setting may be used toconfigure how many times a slow statement (as identified by the
execution insights system) must have been retried to be marked as having
a high retry count.
Footnotes
Every execution insight we offer happens because a statement was
slow, so the "SlowExecution" concern was the only one we'd ever have. ↩
The protocol buffer changes are safe because we haven't deployed
this code anywhere yet. ↩
This threshold may be configured via the
sql.insights.high_retry_count.thresholdcluster setting. ↩