Skip to content

release-22.1: sql,tracing: per-stmt probabilistic trace sampling#89774

Merged
irfansharif merged 1 commit intocockroachdb:release-22.1from
irfansharif:221011.tail-tracing-22.1
Oct 14, 2022
Merged

release-22.1: sql,tracing: per-stmt probabilistic trace sampling#89774
irfansharif merged 1 commit intocockroachdb:release-22.1from
irfansharif:221011.tail-tracing-22.1

Conversation

@irfansharif
Copy link
Copy Markdown
Contributor

@irfansharif irfansharif commented Oct 11, 2022

This commit introduces a backportable alternative to #82750 and #83020, to help solve for #82896. It gives a palatable alternative to sql.trace.stmt.enable_threshold. See those PRs/issues for verbose commentary, but roughly we can do the following:

  Pick a stmt fingerprint, declare a sampling probability which controls
  when verbose tracing is enabled for it, and a latency threshold for
  which a trace is logged. With a given stmt rate (say 1000/s) and a
  given percentile we're trying to capture (say p99.9), we have N stmt/s
  in the 99.9th percentile (1 in our example). We should be able to set
  a sampling probability P such that with high likelihood (>95%) we
  capture at least one trace captured over the next S seconds. The
  sampling rate lets you control how the overhead you’re introducing for
  those statements in aggregate, which if dialed up higher lets you
  lower S. You might want to do such a thing for infrequently executed
  statements.

This commit does the simplest thing: ask for all input through shoddy cluster settings and just log traces to our usual files. Below we outline an example of how to use these settings to catch outlier executions for writes to the history table in TPC-C. When using it, we did not see an overall increase in p99s as a result of the sampling. It also took only about 10s to capture this data, showing clearly that it was due to latch waits.

  > SELECT
      encode(fingerprint_id, 'hex'),
      (statistics -> 'statistics' ->> 'cnt')::INT AS count,
      metadata ->> 'query' AS query
    FROM system.statement_statistics
    ORDER BY COUNT DESC limit 10;

           encode    | count |                                             query
    -----------------+-------+--------------------------------------------------------------------------------------------------------------------
    ...
    4e4214880f87d799 |  2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__)

    > SET CLUSTER SETTING trace.fingerprint = '4e4214880f87d799'; -- fingerprint
    > SET CLUSTER SETTING trace.fingerprint.probability = 0.01;   -- 1% sampling probability
    > SET CLUSTER SETTING trace.fingerprint.threshold = '90ms';   -- latency threshold

     0.521ms      0.005ms    event:kv/kvserver/concurrency/concurrency_manager.go:260 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] acquiring latches›
     0.532ms      0.011ms    event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] waiting to acquire read latch /Table/109/1/105/3/519/0@0,0, held by write latch /Table/109/1/105/3/519/0@1654849354.483984000,0›
    99.838ms     99.306ms    event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] scanning lock table for conflicting locks›
    99.851ms      0.013ms    event:kv/kvserver/replica_read.go:251 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] executing read-only batch›

Compare this to sql.trace.stmt.enable_threshold which enables verbose tracing for all statements with 100% probability. It introduces far too much overhead for it to be used in reasonable production clusters. The overhead also often masks the problems we're looking to capture.

Release note (general change): We introduced a trifecta of three cluster settings to collect trace data for outlier executions with low overhead. This is only going to be available in 22.1; in 22.2 and beyond we have other mechanisms to collect outlier traces. Traces come in handy when looking to investigate latency spikes, and these three settings are intended to supplant most uses of sql.trace.stmt.enable_threshold. That setting enables verbose tracing for all statements with 100% probability which can cause a lot of overhead in production clusters, and also a lot of logging pressure. Instead we introduce the following:

  • trace.fingerprint
  • trace.fingerprint.probability
  • trace.fingerprint.threshold

Put together (all have to be set) it only enables tracing for the statement with the set hex-encoded fingerprint, does so probabilistically (where probability is whatever trace.fingerprint.probability is set to), logging it only if the latency threshold is exceeded (configured using trace.fingerprint.threshold). To obtain a hex-encoded fingerprint, consider looking at the contents of system.statement_statistics. For example:

  > SELECT
      encode(fingerprint_id, 'hex'),
      (statistics -> 'statistics' ->> 'cnt')::INT AS count,
      metadata ->> 'query' AS query
    FROM system.statement_statistics
    ORDER BY COUNT DESC limit 10;

           encode    | count |                                             query
    -----------------+-------+--------------------------------------------------------------------------------------------------------------------
    ...
    4e4214880f87d799 |  2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__)

Release justification: adds helpful instrumentation possibilities for latency investigations.

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Oct 11, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

This commit introduces a backportable alternative to cockroachdb#82750 and cockroachdb#83020,
to help solve for cockroachdb#82896. It gives a palatable alternative to
sql.trace.stmt.enable_threshold. See those PRs/issues for verbose
commentary, but roughly we can do the following:

  Pick a stmt fingerprint, declare a sampling probability which controls
  when verbose tracing is enabled for it, and a latency threshold for
  which a trace is logged. With a given stmt rate (say 1000/s) and a
  given percentile we're trying to capture (say p99.9), we have N stmt/s
  in the 99.9th percentile (1 in our example). We should be able to set
  a sampling probability P such that with high likelihood (>95%) we
  capture at least one trace captured over the next S seconds. The
  sampling rate lets you control how the overhead you’re introducing for
  those statements in aggregate, which if dialed up higher lets you
  lower S. You might want to do such a thing for infrequently executed
  statements.

This commit does the simplest thing: ask for all input through shoddy
cluster settings and just log traces to our usual files. Below we
outline an example of how to use these settings to catch outlier
executions for writes to the history table in TPC-C. When using it, we
did not see an overall increase in p99s as a result of the sampling. It
also took only about 10s to capture this data, showing clearly that it
was due to latch waits.

  > SELECT
      encode(fingerprint_id, 'hex'),
      (statistics -> 'statistics' ->> 'cnt')::INT AS count,
      metadata ->> 'query' AS query
    FROM system.statement_statistics
    ORDER BY COUNT DESC limit 10;

           encode    | count |                                             query
    -----------------+-------+--------------------------------------------------------------------------------------------------------------------
    ...
    4e4214880f87d799 |  2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__)

    > SET CLUSTER SETTING trace.fingerprint = '4e4214880f87d799'; -- fingerprint
    > SET CLUSTER SETTING trace.fingerprint.probability = 0.01;   -- 1% sampling probability
    > SET CLUSTER SETTING trace.fingerprint.threshold = '90ms';   -- latency threshold

     0.521ms      0.005ms    event:kv/kvserver/concurrency/concurrency_manager.go:260 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] acquiring latches›
     0.532ms      0.011ms    event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] waiting to acquire read latch /Table/109/1/105/3/519/0@0,0, held by write latch /Table/109/1/105/3/519/0@1654849354.483984000,0›
    99.838ms     99.306ms    event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] scanning lock table for conflicting locks›
    99.851ms      0.013ms    event:kv/kvserver/replica_read.go:251 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] executing read-only batch›

Compare this to sql.trace.stmt.enable_threshold which enables verbose
tracing for all statements with 100% probability. It introduces far too
much overhead for it to be used in reasonable production clusters. The
overhead also often masks the problems we're looking to capture.

Release note (general change): We introduced a trifecta of three cluster
settings to collect trace data for outlier executions with low overhead.
This is only going to be available in 22.1; in 22.2 and beyond we have
other mechanisms to collect outlier traces. Traces come in handy when
looking to investigate latency spikes, and these three settings are
intended to supplant most uses of sql.trace.stmt.enable_threshold. That
setting enables verbose tracing for all statements with 100% probability
which can cause a lot of overhead in production clusters, and also a lot
of logging pressure. Instead we introduce the following:
- trace.fingerprint
- trace.fingerprint.probability
- trace.fingerprint.threshold
Put together (all have to be set) it only enables tracing for the statement
with the set hex-encoded fingerprint, does so probabilistically (where
probability is whatever trace.fingerprint.probability is set to),
logging it only if the latency threshold is exceeded (configured using
trace.fingerprint.threshold). To obtain a hex-encoded fingerprint,
consider looking at the contents of system.statement_statistics. For
example:

  > SELECT
      encode(fingerprint_id, 'hex'),
      (statistics -> 'statistics' ->> 'cnt')::INT AS count,
      metadata ->> 'query' AS query
    FROM system.statement_statistics
    ORDER BY COUNT DESC limit 10;

           encode    | count |                                             query
    -----------------+-------+--------------------------------------------------------------------------------------------------------------------
    ...
    4e4214880f87d799 |  2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__)
@irfansharif irfansharif force-pushed the 221011.tail-tracing-22.1 branch from 0774848 to 0ad199d Compare October 11, 2022 22:04
@irfansharif irfansharif requested review from a team and tbg October 14, 2022 14:57
@irfansharif
Copy link
Copy Markdown
Contributor Author

(Bump.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants