Skip to content

outliers: latency quantile detector#82473

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
matthewtodd:outliers-p99-latency
Jul 5, 2022
Merged

outliers: latency quantile detector#82473
craig[bot] merged 1 commit intocockroachdb:masterfrom
matthewtodd:outliers-p99-latency

Conversation

@matthewtodd
Copy link
Copy Markdown

Closes #79451.

Note that this detector comes disabled by default. To enable, use the
sql.stats.outliers.experimental.latency_quantile_detection.enabled
cluster setting.

This new detection "algorithm" is a simple, perhaps naïve, heuristic,
though it may be good enough to be useful: we consider a statement an
outlier if its execution latency is > p99 latency for its fingerprint,
so long as it's also greater than twice the median latency and greater
than an arbitrary user-defined threshold.1

We expect the operation of this detector to be memory constrained. To
avoid OOM errors, we offer a cluster setting2 to cap the overall
memory used, along with metrics reporting the number of fingerprints
being tracked, memory usage, and "evictions," or fingerprint histories
being dropped due to memory pressure. If a high rate of evictions is
observed, it will be necessary to raise either one or both of the memory
limit and the "interesting" threshold settings to regain a steady state.

Note also that this detector is not designed to be used concurrently by
multiple goroutines. Correctness currently relies on the global lock in
outliers.Registry, to be addressed next in #81021.

Release note: None

Footnotes

  1. This threshold is defined by a cluster setting,
    sql.stats.outliers.experimental.latency_quantile_detection.interesting_threshold.
    Without such a threshold, we would otherwise flag a statement that, say,
    normally executes in 3ms taking 7ms; but 7ms is probably still fast
    enough that there's nothing to be done.

  2. sql.stats.outliers.experimental.latency_quantile_detection.memory_limit

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@matthewtodd matthewtodd requested a review from a team June 6, 2022 19:04
@matthewtodd matthewtodd marked this pull request as ready for review June 6, 2022 19:04
@matthewtodd matthewtodd requested review from a team June 6, 2022 19:04
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jun 10, 2022
Prototyping something that could help cockroachdb#80666. I'm hoping we can get to
the following ability:

  Pick a stmt fingerprint, declare a sampling probability which controls
  when verbose tracing is enabled for it, and a latency threshold for
  which a trace is persisted/logged/etc. Maybe even a count of how many
  traces I actually want recorded? Some of these controls could be in the
  UI, or traces made visible in the UI, but I’d happily take a set of
  dubious cluster settings or table writes to install these
  to-be-traced-with-some-probability fingerprints, or just vanilla log
  outputs. With a given stmt rate (say 1000/s) and a given percentile I’m
  trying to capture (say p99.9), we have 1 stmt/s in the 99.9th
  percentile. I need to make this math less handwavey, but I should be
  able to set a sampling probability P such that with high likelihood
  (>95%) we capture at least one trace captured over the next S seconds. The
  sampling rate lets you control how the overhead you’re introducing for
  those statements in aggregate, which if dialed up higher lets you
  lower S. You might want to do such a thing for infrequently executed
  statements.

Still not sure how to figure out what statement to trace (we don't
capture per-stmt %-iles -- too expensive? Maybe we will in some form
after cockroachdb#82473?). And also not sure what we can do better than just
logging with captured traces (stash them under /tracez somehow?) or in
some system table capped in # of rows + with row-level TTL to evict
captures + make room for new, more recent captures? Anyway, this first
stab does the simplest thing: ask for all input through shoddy cluster
settings and just log stuff. A quick and dirty attempt to try and catch
an outlier execution for the history table write in TPC-C. Didn't see an
overall increase in p99s as a result of the sampling + took about 10s to
capture this, showing it was due to latch waits.

    > SELECT
      encode(fingerprint_id, 'hex'),
      (statistics -> 'statistics' ->> 'cnt')::INT AS count,
      metadata ->> 'query' AS query
    FROM system.statement_statistics
    ORDER BY COUNT DESC limit 10;

           encode    | count |                                             query
    -----------------+-------+--------------------------------------------------------------------------------------------------------------------
    ...
    4e4214880f87d799 |  2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__)

    > SET CLUSTER SETTING trace.fingerprint = '4e4214880f87d799';
    > SET CLUSTER SETTING trace.fingerprint.probability = 0.01;
    > SET CLUSTER SETTING trace.fingerprint.threshold = '90ms';

 0.521ms      0.005ms    event:kv/kvserver/concurrency/concurrency_manager.go:260 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] acquiring latches›
 0.532ms      0.011ms    event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] waiting to acquire read latch /Table/109/1/105/3/519/0@0,0, held by write latch /Table/109/1/105/3/519/0@1654849354.483984000,0›
99.838ms     99.306ms    event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] scanning lock table for conflicting locks›
99.851ms      0.013ms    event:kv/kvserver/replica_read.go:251 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] executing read-only batch›

Release note: None
Copy link
Copy Markdown
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice commit message. Thanks for the detail. :lgtm: with some nits below.

Are there any benchmark targets that would need to be reached for this to be enabled by default or will this always be disabled out of the box?

I like how the outlier detection is structured and how you can register additional custom detectors. Hadn't seen that code before.

Reviewed 3 of 13 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dhartunian and @matthewtodd)


pkg/sql/sqlstats/outliers/outliers.go line 98 at r1 (raw file):

// NewMetrics builds a new instance of our Metrics struct.
func NewMetrics() *Metrics {

Is there a reason to return a pointer to Metrics here? I see that other cases like in txnidcache/metrics.go we return a value.


pkg/sql/sqlstats/outliers/detector_test.go line 68 at r1 (raw file):

}

func TestLatencyQuantileDetector(t *testing.T) {

I didn't read super closely but could be a candidate for table-based tests especially if you're seeding the outlier detector with a bunch of data beforehand that the tests can then all use.


pkg/sql/sqlstats/outliers/detector_test.go line 77 at r1 (raw file):

		st := cluster.MakeTestingClusterSettings()
		LatencyQuantileDetectionEnabled.Override(context.Background(), &st.SV, true)
		d := newLatencyQuantileDetector(st, NewMetrics())

nit: does it matter to test that the enabled status isn't cached on new? maybe move this line before the override?


pkg/sql/sqlstats/outliers/detector_test.go line 139 at r1 (raw file):

		// without any eviction that would put us around 2 megabytes of memory used, surpassing our
		// 1 megabyte cap and triggering some evictions.
		for i := 0; i < 20_000; i++ {

nit: can probably test this with a lower number if you bump the memory threshold down, right?

Copy link
Copy Markdown
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome! Just a small nit from me, and the improvements David mention on the tests, otherwise :lgtm:

Reviewed 12 of 13 files at r1, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @dhartunian, @maryliag, and @matthewtodd)


pkg/sql/sqlstats/outliers/detector.go line 96 at r1 (raw file):

}

func newLatencyQuantileDetector(settings *cluster.Settings, metrics *Metrics) detector {

is the use of new here means you're detecting the newest latency or the detector itself is new? if is referring to the latency I would use latestLatency..., but if refers to the function I would remove the new (since we can have a new detector in the future and this new would be old, and so on)

Closes #79451.

Note that this detector comes disabled by default. To enable, use the
`sql.stats.outliers.experimental.latency_quantile_detection.enabled`
cluster setting.

This new detection "algorithm" is a simple, perhaps naïve, heuristic,
though it may be good enough to be useful: we consider a statement an
outlier if its execution latency is > p99 latency for its fingerprint,
so long as it's also greater than twice the median latency and greater
than an arbitrary user-defined threshold.[^1]

We expect the operation of this detector to be memory constrained. To
avoid OOM errors, we offer a cluster setting[^2] to cap the overall
memory used, along with metrics reporting the number of fingerprints
being tracked, memory usage, and "evictions," or fingerprint histories
being dropped due to memory pressure. If a high rate of evictions is
observed, it will be necessary to raise either one or both of the memory
limit and the "interesting" threshold settings to regain a steady state.

Note also that this detector is not designed to be used concurrently by
multiple goroutines. Correctness currently relies on the global lock in
`outliers.Registry`, to be addressed next in #81021.

[^1]: This threshold is defined by a cluster setting,
`sql.stats.outliers.experimental.latency_quantile_detection.interesting_threshold`.
Without such a threshold, we would otherwise flag a statement that, say,
normally executes in 3ms taking 7ms; but 7ms is probably still fast
enough that there's nothing to be done.

[^2]: `sql.stats.outliers.experimental.latency_quantile_detection.memory_limit`

Release note: None
@matthewtodd matthewtodd requested a review from a team as a code owner July 1, 2022 17:19
Copy link
Copy Markdown
Author

@matthewtodd matthewtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both for your reviews!

Good question on the benchmark targets, @dhartunian. Yes, the aim is to enable outlier detection by default. 🤞 I'm going to start thinking about what matters to us and tuning runtime performance to hit those targets next in #81021.

Since you both last looked, I've also fixed a bug here that was allowing the memory usage to grow beyond the cap: we were previously only triggering cache evictions when we saw new fingerprints; now we also try as the per-fingerprint summaries themselves grow. (This required inlining the LRU UnorderedCache structure into a linked list and a map holding an index into that list. It's not an insignificant change, but it was straightforward and mechanical enough -- and, per the benchmark, reduced the number of allocations per run -- that I'm going to go ahead and merge here, which should be safe since the whole outliers system is disabled by default; and I think I'm also ready to regain some momentum after being out for so long.)

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @dhartunian and @maryliag)


pkg/sql/sqlstats/outliers/detector.go line 96 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

is the use of new here means you're detecting the newest latency or the detector itself is new? if is referring to the latency I would use latestLatency..., but if refers to the function I would remove the new (since we can have a new detector in the future and this new would be old, and so on)

Ah, I see what you mean. I was using new here to signal that this is a constructor function, making a new instance of this kind of detector.


pkg/sql/sqlstats/outliers/outliers.go line 98 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

Is there a reason to return a pointer to Metrics here? I see that other cases like in txnidcache/metrics.go we return a value.

Nope! Adjusted to return a value.


pkg/sql/sqlstats/outliers/detector_test.go line 68 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I didn't read super closely but could be a candidate for table-based tests especially if you're seeding the outlier detector with a bunch of data beforehand that the tests can then all use.

Good call, reworked into a couple of blocks of table-driven tests.


pkg/sql/sqlstats/outliers/detector_test.go line 77 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: does it matter to test that the enabled status isn't cached on new? maybe move this line before the override?

Nice idea, moved the override after the constructor just in case.


pkg/sql/sqlstats/outliers/detector_test.go line 139 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

nit: can probably test this with a lower number if you bump the memory threshold down, right?

Yes, done! Down from 20,000 to just 20 now. :-)

@craig craig bot merged commit 97f8422 into cockroachdb:master Jul 5, 2022
@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 5, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

outliers: configurable statistical detection per fingerprint

4 participants