outliers: latency quantile detector#82473
outliers: latency quantile detector#82473craig[bot] merged 1 commit intocockroachdb:masterfrom matthewtodd:outliers-p99-latency
Conversation
Prototyping something that could help cockroachdb#80666. I'm hoping we can get to the following ability: Pick a stmt fingerprint, declare a sampling probability which controls when verbose tracing is enabled for it, and a latency threshold for which a trace is persisted/logged/etc. Maybe even a count of how many traces I actually want recorded? Some of these controls could be in the UI, or traces made visible in the UI, but I’d happily take a set of dubious cluster settings or table writes to install these to-be-traced-with-some-probability fingerprints, or just vanilla log outputs. With a given stmt rate (say 1000/s) and a given percentile I’m trying to capture (say p99.9), we have 1 stmt/s in the 99.9th percentile. I need to make this math less handwavey, but I should be able to set a sampling probability P such that with high likelihood (>95%) we capture at least one trace captured over the next S seconds. The sampling rate lets you control how the overhead you’re introducing for those statements in aggregate, which if dialed up higher lets you lower S. You might want to do such a thing for infrequently executed statements. Still not sure how to figure out what statement to trace (we don't capture per-stmt %-iles -- too expensive? Maybe we will in some form after cockroachdb#82473?). And also not sure what we can do better than just logging with captured traces (stash them under /tracez somehow?) or in some system table capped in # of rows + with row-level TTL to evict captures + make room for new, more recent captures? Anyway, this first stab does the simplest thing: ask for all input through shoddy cluster settings and just log stuff. A quick and dirty attempt to try and catch an outlier execution for the history table write in TPC-C. Didn't see an overall increase in p99s as a result of the sampling + took about 10s to capture this, showing it was due to latch waits. > SELECT encode(fingerprint_id, 'hex'), (statistics -> 'statistics' ->> 'cnt')::INT AS count, metadata ->> 'query' AS query FROM system.statement_statistics ORDER BY COUNT DESC limit 10; encode | count | query -----------------+-------+-------------------------------------------------------------------------------------------------------------------- ... 4e4214880f87d799 | 2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__) > SET CLUSTER SETTING trace.fingerprint = '4e4214880f87d799'; > SET CLUSTER SETTING trace.fingerprint.probability = 0.01; > SET CLUSTER SETTING trace.fingerprint.threshold = '90ms'; 0.521ms 0.005ms event:kv/kvserver/concurrency/concurrency_manager.go:260 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] acquiring latches› 0.532ms 0.011ms event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] waiting to acquire read latch /Table/109/1/105/3/519/0@0,0, held by write latch /Table/109/1/105/3/519/0@1654849354.483984000,0› 99.838ms 99.306ms event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] scanning lock table for conflicting locks› 99.851ms 0.013ms event:kv/kvserver/replica_read.go:251 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] executing read-only batch› Release note: None
dhartunian
left a comment
There was a problem hiding this comment.
Very nice commit message. Thanks for the detail. with some nits below.
Are there any benchmark targets that would need to be reached for this to be enabled by default or will this always be disabled out of the box?
I like how the outlier detection is structured and how you can register additional custom detectors. Hadn't seen that code before.
Reviewed 3 of 13 files at r1, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @dhartunian and @matthewtodd)
pkg/sql/sqlstats/outliers/outliers.go line 98 at r1 (raw file):
// NewMetrics builds a new instance of our Metrics struct. func NewMetrics() *Metrics {
Is there a reason to return a pointer to Metrics here? I see that other cases like in txnidcache/metrics.go we return a value.
pkg/sql/sqlstats/outliers/detector_test.go line 68 at r1 (raw file):
} func TestLatencyQuantileDetector(t *testing.T) {
I didn't read super closely but could be a candidate for table-based tests especially if you're seeding the outlier detector with a bunch of data beforehand that the tests can then all use.
pkg/sql/sqlstats/outliers/detector_test.go line 77 at r1 (raw file):
st := cluster.MakeTestingClusterSettings() LatencyQuantileDetectionEnabled.Override(context.Background(), &st.SV, true) d := newLatencyQuantileDetector(st, NewMetrics())
nit: does it matter to test that the enabled status isn't cached on new? maybe move this line before the override?
pkg/sql/sqlstats/outliers/detector_test.go line 139 at r1 (raw file):
// without any eviction that would put us around 2 megabytes of memory used, surpassing our // 1 megabyte cap and triggering some evictions. for i := 0; i < 20_000; i++ {
nit: can probably test this with a lower number if you bump the memory threshold down, right?
maryliag
left a comment
There was a problem hiding this comment.
Looks awesome! Just a small nit from me, and the improvements David mention on the tests, otherwise
Reviewed 12 of 13 files at r1, all commit messages.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @dhartunian, @maryliag, and @matthewtodd)
pkg/sql/sqlstats/outliers/detector.go line 96 at r1 (raw file):
} func newLatencyQuantileDetector(settings *cluster.Settings, metrics *Metrics) detector {
is the use of new here means you're detecting the newest latency or the detector itself is new? if is referring to the latency I would use latestLatency..., but if refers to the function I would remove the new (since we can have a new detector in the future and this new would be old, and so on)
Closes #79451. Note that this detector comes disabled by default. To enable, use the `sql.stats.outliers.experimental.latency_quantile_detection.enabled` cluster setting. This new detection "algorithm" is a simple, perhaps naïve, heuristic, though it may be good enough to be useful: we consider a statement an outlier if its execution latency is > p99 latency for its fingerprint, so long as it's also greater than twice the median latency and greater than an arbitrary user-defined threshold.[^1] We expect the operation of this detector to be memory constrained. To avoid OOM errors, we offer a cluster setting[^2] to cap the overall memory used, along with metrics reporting the number of fingerprints being tracked, memory usage, and "evictions," or fingerprint histories being dropped due to memory pressure. If a high rate of evictions is observed, it will be necessary to raise either one or both of the memory limit and the "interesting" threshold settings to regain a steady state. Note also that this detector is not designed to be used concurrently by multiple goroutines. Correctness currently relies on the global lock in `outliers.Registry`, to be addressed next in #81021. [^1]: This threshold is defined by a cluster setting, `sql.stats.outliers.experimental.latency_quantile_detection.interesting_threshold`. Without such a threshold, we would otherwise flag a statement that, say, normally executes in 3ms taking 7ms; but 7ms is probably still fast enough that there's nothing to be done. [^2]: `sql.stats.outliers.experimental.latency_quantile_detection.memory_limit` Release note: None
matthewtodd
left a comment
There was a problem hiding this comment.
Thank you both for your reviews!
Good question on the benchmark targets, @dhartunian. Yes, the aim is to enable outlier detection by default. 🤞 I'm going to start thinking about what matters to us and tuning runtime performance to hit those targets next in #81021.
Since you both last looked, I've also fixed a bug here that was allowing the memory usage to grow beyond the cap: we were previously only triggering cache evictions when we saw new fingerprints; now we also try as the per-fingerprint summaries themselves grow. (This required inlining the LRU UnorderedCache structure into a linked list and a map holding an index into that list. It's not an insignificant change, but it was straightforward and mechanical enough -- and, per the benchmark, reduced the number of allocations per run -- that I'm going to go ahead and merge here, which should be safe since the whole outliers system is disabled by default; and I think I'm also ready to regain some momentum after being out for so long.)
bors r+
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @dhartunian and @maryliag)
pkg/sql/sqlstats/outliers/detector.go line 96 at r1 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
is the use of
newhere means you're detecting the newest latency or the detector itself is new? if is referring to the latency I would uselatestLatency..., but if refers to the function I would remove the new (since we can have a new detector in the future and this new would be old, and so on)
Ah, I see what you mean. I was using new here to signal that this is a constructor function, making a new instance of this kind of detector.
pkg/sql/sqlstats/outliers/outliers.go line 98 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
Is there a reason to return a pointer to
Metricshere? I see that other cases like intxnidcache/metrics.gowe return a value.
Nope! Adjusted to return a value.
pkg/sql/sqlstats/outliers/detector_test.go line 68 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
I didn't read super closely but could be a candidate for table-based tests especially if you're seeding the outlier detector with a bunch of data beforehand that the tests can then all use.
Good call, reworked into a couple of blocks of table-driven tests.
pkg/sql/sqlstats/outliers/detector_test.go line 77 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
nit: does it matter to test that the enabled status isn't cached on
new? maybe move this line before the override?
Nice idea, moved the override after the constructor just in case.
pkg/sql/sqlstats/outliers/detector_test.go line 139 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
nit: can probably test this with a lower number if you bump the memory threshold down, right?
Yes, done! Down from 20,000 to just 20 now. :-)
|
Build succeeded: |
Closes #79451.
Note that this detector comes disabled by default. To enable, use the
sql.stats.outliers.experimental.latency_quantile_detection.enabledcluster setting.
This new detection "algorithm" is a simple, perhaps naïve, heuristic,
though it may be good enough to be useful: we consider a statement an
outlier if its execution latency is > p99 latency for its fingerprint,
so long as it's also greater than twice the median latency and greater
than an arbitrary user-defined threshold.1
We expect the operation of this detector to be memory constrained. To
avoid OOM errors, we offer a cluster setting2 to cap the overall
memory used, along with metrics reporting the number of fingerprints
being tracked, memory usage, and "evictions," or fingerprint histories
being dropped due to memory pressure. If a high rate of evictions is
observed, it will be necessary to raise either one or both of the memory
limit and the "interesting" threshold settings to regain a steady state.
Note also that this detector is not designed to be used concurrently by
multiple goroutines. Correctness currently relies on the global lock in
outliers.Registry, to be addressed next in #81021.Release note: None
Footnotes
This threshold is defined by a cluster setting,
sql.stats.outliers.experimental.latency_quantile_detection.interesting_threshold.Without such a threshold, we would otherwise flag a statement that, say,
normally executes in 3ms taking 7ms; but 7ms is probably still fast
enough that there's nothing to be done. ↩
sql.stats.outliers.experimental.latency_quantile_detection.memory_limit↩