outliers: latency quantile detector by matthewtodd · Pull Request #82473 · cockroachdb/cockroach

matthewtodd · 2022-06-06T16:59:51Z

Note that this detector comes disabled by default. To enable, use the
sql.stats.outliers.experimental.latency_quantile_detection.enabled
cluster setting.

This new detection "algorithm" is a simple, perhaps naïve, heuristic,
though it may be good enough to be useful: we consider a statement an
outlier if its execution latency is > p99 latency for its fingerprint,
so long as it's also greater than twice the median latency and greater
than an arbitrary user-defined threshold.¹

We expect the operation of this detector to be memory constrained. To
avoid OOM errors, we offer a cluster setting² to cap the overall
memory used, along with metrics reporting the number of fingerprints
being tracked, memory usage, and "evictions," or fingerprint histories
being dropped due to memory pressure. If a high rate of evictions is
observed, it will be necessary to raise either one or both of the memory
limit and the "interesting" threshold settings to regain a steady state.

Note also that this detector is not designed to be used concurrently by
multiple goroutines. Correctness currently relies on the global lock in
outliers.Registry, to be addressed next in #81021.

Release note: None

This threshold is defined by a cluster setting,
sql.stats.outliers.experimental.latency_quantile_detection.interesting_threshold.
Without such a threshold, we would otherwise flag a statement that, say,
normally executes in 3ms taking 7ms; but 7ms is probably still fast
enough that there's nothing to be done. ↩
sql.stats.outliers.experimental.latency_quantile_detection.memory_limit ↩

cockroach-teamcity · 2022-06-06T16:59:59Z

This change is

Prototyping something that could help cockroachdb#80666. I'm hoping we can get to the following ability: Pick a stmt fingerprint, declare a sampling probability which controls when verbose tracing is enabled for it, and a latency threshold for which a trace is persisted/logged/etc. Maybe even a count of how many traces I actually want recorded? Some of these controls could be in the UI, or traces made visible in the UI, but I’d happily take a set of dubious cluster settings or table writes to install these to-be-traced-with-some-probability fingerprints, or just vanilla log outputs. With a given stmt rate (say 1000/s) and a given percentile I’m trying to capture (say p99.9), we have 1 stmt/s in the 99.9th percentile. I need to make this math less handwavey, but I should be able to set a sampling probability P such that with high likelihood (>95%) we capture at least one trace captured over the next S seconds. The sampling rate lets you control how the overhead you’re introducing for those statements in aggregate, which if dialed up higher lets you lower S. You might want to do such a thing for infrequently executed statements. Still not sure how to figure out what statement to trace (we don't capture per-stmt %-iles -- too expensive? Maybe we will in some form after cockroachdb#82473?). And also not sure what we can do better than just logging with captured traces (stash them under /tracez somehow?) or in some system table capped in # of rows + with row-level TTL to evict captures + make room for new, more recent captures? Anyway, this first stab does the simplest thing: ask for all input through shoddy cluster settings and just log stuff. A quick and dirty attempt to try and catch an outlier execution for the history table write in TPC-C. Didn't see an overall increase in p99s as a result of the sampling + took about 10s to capture this, showing it was due to latch waits. > SELECT encode(fingerprint_id, 'hex'), (statistics -> 'statistics' ->> 'cnt')::INT AS count, metadata ->> 'query' AS query FROM system.statement_statistics ORDER BY COUNT DESC limit 10; encode | count | query -----------------+-------+-------------------------------------------------------------------------------------------------------------------- ... 4e4214880f87d799 | 2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__) > SET CLUSTER SETTING trace.fingerprint = '4e4214880f87d799'; > SET CLUSTER SETTING trace.fingerprint.probability = 0.01; > SET CLUSTER SETTING trace.fingerprint.threshold = '90ms'; 0.521ms 0.005ms event:kv/kvserver/concurrency/concurrency_manager.go:260 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] acquiring latches› 0.532ms 0.011ms event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] waiting to acquire read latch /Table/109/1/105/3/519/0@0,0, held by write latch /Table/109/1/105/3/519/0@1654849354.483984000,0› 99.838ms 99.306ms event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] scanning lock table for conflicting locks› 99.851ms 0.013ms event:kv/kvserver/replica_read.go:251 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] executing read-only batch› Release note: None

dhartunian

Very nice commit message. Thanks for the detail. with some nits below.

Are there any benchmark targets that would need to be reached for this to be enabled by default or will this always be disabled out of the box?

I like how the outlier detection is structured and how you can register additional custom detectors. Hadn't seen that code before.

Reviewed 3 of 13 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian and @matthewtodd)

pkg/sql/sqlstats/outliers/outliers.go line 98 at r1 (raw file):

// NewMetrics builds a new instance of our Metrics struct.
func NewMetrics() *Metrics {

Is there a reason to return a pointer to Metrics here? I see that other cases like in txnidcache/metrics.go we return a value.

pkg/sql/sqlstats/outliers/detector_test.go line 68 at r1 (raw file):

}

func TestLatencyQuantileDetector(t *testing.T) {

I didn't read super closely but could be a candidate for table-based tests especially if you're seeding the outlier detector with a bunch of data beforehand that the tests can then all use.

pkg/sql/sqlstats/outliers/detector_test.go line 77 at r1 (raw file):

		st := cluster.MakeTestingClusterSettings()
		LatencyQuantileDetectionEnabled.Override(context.Background(), &st.SV, true)
		d := newLatencyQuantileDetector(st, NewMetrics())

nit: does it matter to test that the enabled status isn't cached on new? maybe move this line before the override?

pkg/sql/sqlstats/outliers/detector_test.go line 139 at r1 (raw file):

		// without any eviction that would put us around 2 megabytes of memory used, surpassing our
		// 1 megabyte cap and triggering some evictions.
		for i := 0; i < 20_000; i++ {

nit: can probably test this with a lower number if you bump the memory threshold down, right?

maryliag

Looks awesome! Just a small nit from me, and the improvements David mention on the tests, otherwise

Reviewed 12 of 13 files at r1, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @dhartunian, @maryliag, and @matthewtodd)

pkg/sql/sqlstats/outliers/detector.go line 96 at r1 (raw file):

}

func newLatencyQuantileDetector(settings *cluster.Settings, metrics *Metrics) detector {

is the use of new here means you're detecting the newest latency or the detector itself is new? if is referring to the latency I would use latestLatency..., but if refers to the function I would remove the new (since we can have a new detector in the future and this new would be old, and so on)

Closes #79451. Note that this detector comes disabled by default. To enable, use the `sql.stats.outliers.experimental.latency_quantile_detection.enabled` cluster setting. This new detection "algorithm" is a simple, perhaps naïve, heuristic, though it may be good enough to be useful: we consider a statement an outlier if its execution latency is > p99 latency for its fingerprint, so long as it's also greater than twice the median latency and greater than an arbitrary user-defined threshold.[^1] We expect the operation of this detector to be memory constrained. To avoid OOM errors, we offer a cluster setting[^2] to cap the overall memory used, along with metrics reporting the number of fingerprints being tracked, memory usage, and "evictions," or fingerprint histories being dropped due to memory pressure. If a high rate of evictions is observed, it will be necessary to raise either one or both of the memory limit and the "interesting" threshold settings to regain a steady state. Note also that this detector is not designed to be used concurrently by multiple goroutines. Correctness currently relies on the global lock in `outliers.Registry`, to be addressed next in #81021. [^1]: This threshold is defined by a cluster setting, `sql.stats.outliers.experimental.latency_quantile_detection.interesting_threshold`. Without such a threshold, we would otherwise flag a statement that, say, normally executes in 3ms taking 7ms; but 7ms is probably still fast enough that there's nothing to be done. [^2]: `sql.stats.outliers.experimental.latency_quantile_detection.memory_limit` Release note: None

matthewtodd

Thank you both for your reviews!

Good question on the benchmark targets, @dhartunian. Yes, the aim is to enable outlier detection by default. 🤞 I'm going to start thinking about what matters to us and tuning runtime performance to hit those targets next in #81021.

Since you both last looked, I've also fixed a bug here that was allowing the memory usage to grow beyond the cap: we were previously only triggering cache evictions when we saw new fingerprints; now we also try as the per-fingerprint summaries themselves grow. (This required inlining the LRU UnorderedCache structure into a linked list and a map holding an index into that list. It's not an insignificant change, but it was straightforward and mechanical enough -- and, per the benchmark, reduced the number of allocations per run -- that I'm going to go ahead and merge here, which should be safe since the whole outliers system is disabled by default; and I think I'm also ready to regain some momentum after being out for so long.)

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @dhartunian and @maryliag)

pkg/sql/sqlstats/outliers/detector.go line 96 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

is the use of new here means you're detecting the newest latency or the detector itself is new? if is referring to the latency I would use latestLatency..., but if refers to the function I would remove the new (since we can have a new detector in the future and this new would be old, and so on)

Ah, I see what you mean. I was using new here to signal that this is a constructor function, making a new instance of this kind of detector.

pkg/sql/sqlstats/outliers/outliers.go line 98 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Is there a reason to return a pointer to Metrics here? I see that other cases like in txnidcache/metrics.go we return a value.

Nope! Adjusted to return a value.

pkg/sql/sqlstats/outliers/detector_test.go line 68 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I didn't read super closely but could be a candidate for table-based tests especially if you're seeding the outlier detector with a bunch of data beforehand that the tests can then all use.

Good call, reworked into a couple of blocks of table-driven tests.

pkg/sql/sqlstats/outliers/detector_test.go line 77 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: does it matter to test that the enabled status isn't cached on new? maybe move this line before the override?

Nice idea, moved the override after the constructor just in case.

pkg/sql/sqlstats/outliers/detector_test.go line 139 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: can probably test this with a lower number if you bump the memory threshold down, right?

Yes, done! Down from 20,000 to just 20 now. :-)

craig · 2022-07-05T18:07:23Z

Build succeeded:

GitHub CI (Cockroach)

matthewtodd requested a review from a team June 6, 2022 19:04

matthewtodd marked this pull request as ready for review June 6, 2022 19:04

matthewtodd requested review from a team June 6, 2022 19:04

This was referenced Jun 10, 2022

[prototype] sql,tracing: per-stmt probabilistic trace sampling #82732

Closed

[prototype] sql,tracing: per-stmt probabilistic trace sampling #82864

Closed

dhartunian approved these changes Jun 15, 2022

View reviewed changes

maryliag approved these changes Jun 24, 2022

View reviewed changes

matthewtodd requested a review from a team as a code owner July 1, 2022 17:19

matthewtodd commented Jul 5, 2022

View reviewed changes

craig bot merged commit 97f8422 into cockroachdb:master Jul 5, 2022

matthewtodd deleted the outliers-p99-latency branch July 5, 2022 18:14

renatolabs mentioned this pull request Jul 22, 2022

roachperf: regression around 2022-07-07 #84882

Closed

30 tasks

irfansharif mentioned this pull request Oct 12, 2022

outliers: incorporate tracing information #80666

Closed

irfansharif mentioned this pull request Oct 19, 2022

tracing,*: always-on probabilistic tracing #90292

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

outliers: latency quantile detector#82473

outliers: latency quantile detector#82473
craig[bot] merged 1 commit intocockroachdb:masterfrom
matthewtodd:outliers-p99-latency

matthewtodd commented Jun 6, 2022

Uh oh!

cockroach-teamcity commented Jun 6, 2022

Uh oh!

dhartunian left a comment

Uh oh!

maryliag left a comment

Uh oh!

matthewtodd left a comment

Uh oh!

craig bot commented Jul 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

matthewtodd commented Jun 6, 2022

Footnotes

Uh oh!

cockroach-teamcity commented Jun 6, 2022

Uh oh!

dhartunian left a comment

Choose a reason for hiding this comment

Uh oh!

maryliag left a comment

Choose a reason for hiding this comment

Uh oh!

matthewtodd left a comment

Choose a reason for hiding this comment

Uh oh!

craig bot commented Jul 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants