Skip to content

stmtdiagnostics: support continuous bundle collection#83020

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
irfansharif:220616.continuous-diagnostic-collection
Sep 28, 2022
Merged

stmtdiagnostics: support continuous bundle collection#83020
craig[bot] merged 1 commit intocockroachdb:masterfrom
irfansharif:220616.continuous-diagnostic-collection

Conversation

@irfansharif
Copy link
Copy Markdown
Contributor

..until expiry. Informs #82896 (more specifically this is a short-term
alternative to the part pertaining to continuous tail capture). The
issue has more background, but we repeat some below for posterity.

It's desirable to draw from a set of tail execution traces collected
over time when investigating tail latencies. #82750 introduced a
probabilistic mechanism to capture a single tail event for a individual
stmts with bounded overhead (determined by the sampling probability,
trading off how long until a single capture is obtained). This PR
introduces a sql.stmt_diagnostics.collect_continuously_until_expired
to collect captures continuously over some period of time for aggregate
analysis. Longer term we'd want:

  • Controls over the maximum number of captures we'd want stored over
    some period of time;
  • Eviction of older bundles, assuming they're less relevant, making room
    for newer captures.

To safeguard against misuse (in this current form we should only use it
for experiments or escalations under controlled environments), we only
act on this setting provided the diagnostics request has an expiration
timestamp and a specified probability, crude measures to prevent
unbounded growth.


To get some idea of how this can be used, consider the kinds of
experiments we're running as part of #75066. Specifically we have a
reproduction where we can observe spikes in latencies for foreground
traffic in the presence of concurrent backups (incremental/full). In an
experiment with incremental backups running every 10m, with full backups
running every 35m (RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'),
we observe latency spikes during overlap periods. With this cluster
setting we were able to set up trace captures over a 10h window to get a
set of representative outlier traces to investigate further.

> SELECT crdb_internal.request_statement_bundle(
  'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint
  0.05,                                  -- 5% sampling probability
  '30ms'::INTERVAL,                      -- 30ms target (p99.9)
  '10h'::INTERVAL                        -- capture window
);

> WITH histogram AS
     (SELECT extract('minute', collected_at) AS minute,
             count(*) FROM system.statement_diagnostics
      GROUP BY minute)
SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq
FROM histogram
ORDER BY count DESC
LIMIT 10;

  minute |              freq
---------+---------------------------------
      36 | ******************************
      38 | *********************
      35 | *********************
      00 | *********************
      37 | ********************
      30 | ********************
      40 | *****************
      20 | **************
      10 | *************
      50 | ***********
(10 rows)

We see that we captured just the set of bundles/traces we were interested in.

Release note: None

@irfansharif irfansharif requested a review from a team as a code owner June 16, 2022 21:32
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 8 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @irfansharif)


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 677 at r1 (raw file):

				//   there's a latency threshold);
				// - Requests need to have an expiration set;
				if (req.samplingProbability > 0 && req.samplingProbability < 1) &&

What's the reasoning for not allowing 1.0 probability?


pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 339 at r1 (raw file):

	})

	t.Run("continuous capture disabled without sampling probability", func(t *testing.T) {

I'm not sure that I follow what this test case is testing - given the name of the test case, I'd expect we'd enable continuous capture, run the same query twice, and assert that only one bundle was collected. Can you add some commentary here and for the next test case about the goals?

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Longer term we'd want:
Controls over the maximum number of captures we'd want stored over
some period of time;

Why introduce a cluster setting for this, as opposed to a per-diagnostics request count of samples to store? I say let's go straight to that. To hedge against various columns in the diagnostics requests table not surviving the test of time, let's group some of these request params in a jSON column (I'd also put the sampling probability in there). I'm thinking a limit on the count of bundles to store, whereby the request is considered satisfied if this limit is reached, is good for now (like, no need for replacing "less interesting" samples with "more interesting" ones).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)

Copy link
Copy Markdown
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fell asleep with this PR but it's still something I actually want to have make it to 22.2. So I'll hold off on more invasive changes like those you're suggesting Andrei (also I'm a bit lazy). This whole thing is a bit of a hack, we're able to dump all bundles into this table since we never introduced a unique FK constraint (I'm not sure that was the intent given how it's wired up in the UI). Speaking of the UI, when we do collect additional bundles for a single request, it's not equipped currently to show all of them, but that's also something I'll also defer to others, including perhaps surfacing the sampling probability in the stmt diagnostic modal. This PR has a purely let-me-run-experiments agenda, perhaps under supervision on actually production clusters for latency sensitive customers. To make this more featureful I would do what you're saying, like enforce some max limit instead of an expiration time -- as is it can be a bit of a DDOS vector. But for now I feel this is ok, including as a backport to 22.2

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 677 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

What's the reasoning for not allowing 1.0 probability?

No good reason, removed.


pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 339 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I'm not sure that I follow what this test case is testing - given the name of the test case, I'd expect we'd enable continuous capture, run the same query twice, and assert that only one bundle was collected. Can you add some commentary here and for the next test case about the goals?

Done.

Code quote:

continuous capture disabled without sampling probability

@irfansharif irfansharif force-pushed the 220616.continuous-diagnostic-collection branch from 8c7d19d to 4a6b57f Compare September 24, 2022 00:11
@irfansharif irfansharif requested review from a team as code owners September 24, 2022 00:11
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 8 of 8 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)

@irfansharif
Copy link
Copy Markdown
Contributor Author

CI failure was fixed in #88613.

bors r+

@andreimatei
Copy link
Copy Markdown
Contributor

Please give me a chance to review this.

@irfansharif
Copy link
Copy Markdown
Contributor Author

Have at it.

bors r-

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 26, 2022

Canceled.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 8 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 66 at r2 (raw file):

// today (best you can do is `cockroach statement-diag delete --all`). This
// setting also captures multiple bundles for a single diagnostic request which
// does not fit super well with our current scheme of

s/super well/well

But add more words saying how it doesn't fit. Say something about the completed field on the request, and about what the UI will show.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 68 at r2 (raw file):

retrievable using cockroach statement-diag download <bundle-id>

They are retrievable like this or they're not?


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 133 at r2 (raw file):

// NewRegistry constructs a new Registry.
func NewRegistry(
	ie sqlutil.InternalExecutor, db *kv.DB, st *cluster.Settings, knobs *TestingKnobs,

pass the knobs by value


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 204 at r2 (raw file):

		case <-pollIntervalChanged:
			continue // go back around and maybe reset the timer
		case <-testingPollingCh:

this can be <-r.knobs.PollingCh; you don't need the conditional assignment of testingPollingCh


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 305 at r2 (raw file):

				samplingProbability)
		}
		if minExecutionLatency.Nanoseconds() == 0 {

do you need .Nanoseconds() ?


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 307 at r2 (raw file):

		if minExecutionLatency.Nanoseconds() == 0 {
			return 0, errors.AssertionFailedf(
				"malformed input: got non-zero sampling probability %f and empty min exec latency",

which "input" is malformed? Is this really an assertion error (i.e. programming error)?
Similarly above.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 614 at r2 (raw file):

				// capturing future traces, i.e. not mark this request as
				// completed.
				// - Requests need to be of the sampling sort (also implies

suggest why. Otherwise this comment just repeats the code.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 617 at r2 (raw file):

				//   there's a latency threshold);
				// - Requests need to have an expiration set.
				if req.samplingProbability > 0 && !req.expiresAt.IsZero() {

nit: shouldMarkCompleted = req.samplingProb == 0 || ...


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 624 at r2 (raw file):

				sessiondata.InternalExecutorOverride{User: username.RootUserName()},
				"UPDATE system.statement_diagnostics_requests "+
					"SET completed = $1, statement_diagnostics_id = $2 WHERE id = $3",

so requests with completed == false can now have statement_diagnostics_id set? And namely it is set to the last bundle that was collected? Why is doing this necessary, as opposed to not updating the row at all when we're not marking it as completed?
What does the UI show for such a request?


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 718 at r2 (raw file):

				samplingProbability = float64(*prob)
				if samplingProbability < 0 || samplingProbability > 1 {
					log.Warningf(ctx, "malformed sampling probability: %f (expected in range [0, 1]), resetting to 1.0",

include the request id in the message


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 738 at r2 (raw file):

// TestingFindRequest exports findRequest for testing purposes.
func (r *Registry) TestingFindRequest(requestID RequestID) bool {

move to helpers_test


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 748 at r2 (raw file):

	// CollectUntilExpirationOverride lets tests intercept+override the value
	// read from collectUntilExpiration cluster setting.
	CollectUntilExpirationOverride func() bool

Do you really need this? Can't the test set the cluster setting?


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 751 at r2 (raw file):

	// PollingCh lets tests directly induce registry-internal polling of statement
	// requests.
	PollingCh chan struct{}

nit: consider PollingC.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 751 at r2 (raw file):

	// PollingCh lets tests directly induce registry-internal polling of statement
	// requests.
	PollingCh chan struct{}

instead of this, can the test muck set sql.stmt_diagnostics.poll_interval ?


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 754 at r2 (raw file):

}

// ModuleTestingKnobs is part of the base.ModuleTestingKnobs interface.

nit: this is the only method in that interface, so the canonical comment is ModuleTK implements base.ModuleTestingKnobs.


pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 43 at r2 (raw file):

	defer log.Scope(t).Close(t)

	mu := struct {

syncutil.AtomicBool ?


pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 335 at r2 (raw file):

		samplingProbability, expiresAfter := 0.5, time.Second
		_, err := registry.InsertRequestInternal(ctx, "SELECT pg_sleep(_)",
			samplingProbability, 0*time.Nanosecond, expiresAfter)

Comment the 0*time.Nanosecond arg pls.
Do you need the *time.Nanosecond?


pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 342 at r2 (raw file):

		// We validate that continuous captures is disabled when a sampling
		// probability of 0.0 is used. We know that it's disabled given the
		// diagnostic request is marked as completed despite us not getting to

maybe spell it out that continuous capture doesn't mark the request as completed.


pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 365 at r2 (raw file):

	t.Run("continuous capture disabled without expiration timestamp", func(t *testing.T) {
		samplingProbability, minExecutionLatency, expiresAfter := 0.999, time.Microsecond, 0*time.Hour

point to the previous test, which has some commentary.


pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 431 at r2 (raw file):

		select {
		case <-time.After(30 * time.Second):
			t.Fatalf("timed out")

say more about what

@irfansharif irfansharif force-pushed the 220616.continuous-diagnostic-collection branch from 4a6b57f to 27cc345 Compare September 27, 2022 11:12
Copy link
Copy Markdown
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @yuzefovich)


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 66 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

s/super well/well

But add more words saying how it doesn't fit. Say something about the completed field on the request, and about what the UI will show.

Done.

Code quote:

does not fit super well with our current scheme of

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 68 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

retrievable using cockroach statement-diag download <bundle-id>

They are retrievable like this or they're not?

They are, rephrased.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 133 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

pass the knobs by value

Done.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 204 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

this can be <-r.knobs.PollingCh; you don't need the conditional assignment of testingPollingCh

Done.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 305 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

do you need .Nanoseconds() ?

Done.

Code quote:

minExecutionLatency.Nanoseconds() 

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 307 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

which "input" is malformed? Is this really an assertion error (i.e. programming error)?
Similarly above.

There are a few code paths to get to this point which is why the assertion sits here. I do think it's a programming error, or rather will be, if not appropriately sanitized in the UI. If you're hooking into this through the crdb_internal builtin, the error appears as is. I never remember what the right error type to use but I've downgraded it to the Newf kind and omitted the "malformed input" prefix.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 614 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

suggest why. Otherwise this comment just repeats the code.

Done.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 617 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: shouldMarkCompleted = req.samplingProb == 0 || ...

The current thing reads a bit more easily to me.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 624 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

so requests with completed == false can now have statement_diagnostics_id set? And namely it is set to the last bundle that was collected? Why is doing this necessary, as opposed to not updating the row at all when we're not marking it as completed?
What does the UI show for such a request?

Did it this way to make sure that if there was at least one bundle that was collected before the request completed/expired, it would be recorded. The UI ends up showing just the last bundle, and only once the request is completed. It shows the request as "Waiting" up until then.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 718 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

include the request id in the message

Done.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 738 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

move to helpers_test

Done.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 748 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Do you really need this? Can't the test set the cluster setting?

Very true, done.


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 751 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

instead of this, can the test muck set sql.stmt_diagnostics.poll_interval ?

Done (though it makes the test slightly slower).


pkg/sql/stmtdiagnostics/statement_diagnostics.go line 754 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

nit: this is the only method in that interface, so the canonical comment is ModuleTK implements base.ModuleTestingKnobs.

Done, but you need to start enforcing this canon on more PRs 😛

$ rg "ModuleTestingKnobs is part of the base.ModuleTestingKnobs interface." | wc -l
22

$ rg "ModuleTestingKnobs implements base.ModuleTestingKnobs" | wc -l
6

pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 43 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

syncutil.AtomicBool ?

TIL, done.


pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 335 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Comment the 0*time.Nanosecond arg pls.
Do you need the *time.Nanosecond?

Done and Done.

Code quote:

collectUntilExpirationOverride bool

pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 342 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

maybe spell it out that continuous capture doesn't mark the request as completed.

Done.


pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 365 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

point to the previous test, which has some commentary.

Done.


pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 431 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

say more about what

Done.

..until expiry. Informs cockroachdb#82896 (more specifically this is a short-term
alternative to the part pertaining to continuous tail capture). The
issue has more background, but we repeat some below for posterity.

It's desirable to draw from a set of tail execution traces collected
over time when investigating tail latencies. cockroachdb#82750 introduced a
probabilistic mechanism to capture a single tail event for a individual
stmts with bounded overhead (determined by the sampling probability,
trading off how long until a single capture is obtained). This PR
introduces a sql.stmt_diagnostics.collect_continuously.enabled to
collect captures continuously over some period of time for aggregate
analysis. To get some idea of how this can be used, consider the kinds
of experiments we're running as part of cockroachdb#75066. Specifically we have a
reproduction where we can observe spikes in latencies for foreground
traffic in the presence of concurrent backups (incremental/full). In an
experiment with incremental backups running every 10m, with full backups
running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`),
we observe latency spikes during overlap periods. With this cluster
setting we were able to set up trace captures over a 10h window to get a
set of representative outlier traces to investigate further.

    SELECT crdb_internal.request_statement_bundle(
      'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint
      0.05,                                  -- 5% sampling probability
      '30ms'::INTERVAL,                      -- 30ms target (p99.9)
      '10h'::INTERVAL                        -- capture window
    );

    WITH histogram AS
         (SELECT extract('minute', collected_at) AS minute,
                 count(*) FROM system.statement_diagnostics
          GROUP BY minute)
    SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq
    FROM histogram
    ORDER BY count DESC
    LIMIT 10;

      minute |              freq
    ---------+---------------------------------
          36 | ******************************
          38 | *********************
          35 | *********************
          00 | *********************
          37 | ********************
          30 | ********************
          40 | *****************
          20 | **************
          10 | *************
          50 | ***********
    (10 rows)

We see that we captured just the set of bundles/traces we were
interested in. Longer term we'd want:
- Controls over the maximum number of captures we'd want stored over
  some period of time;
- Eviction of older bundles, assuming they're less relevant, making room
  for newer captures.
To safeguard against misuse (in this current form we should only use it
for experiments or escalations under controlled environments), we only
act on this setting provided the diagnostics request has an expiration
timestamp and a specified probability, crude measures to prevent
unbounded growth.

Release note: None
@irfansharif irfansharif force-pushed the 220616.continuous-diagnostic-collection branch from 27cc345 to 0c9b96f Compare September 27, 2022 14:08
Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @yuzefovich)

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @yuzefovich)

@irfansharif
Copy link
Copy Markdown
Contributor Author

thx dude

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 27, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 27, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 28, 2022

Build succeeded:

@craig craig bot merged commit cdd8b6e into cockroachdb:master Sep 28, 2022
@irfansharif irfansharif deleted the 220616.continuous-diagnostic-collection branch September 28, 2022 03:06
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Oct 11, 2022
This commit introduces a backportable alternative to cockroachdb#82750 and cockroachdb#83020,
to help solve for cockroachdb#82896. It gives a palatable alternative to
sql.trace.stmt.enable_threshold. See those PRs/issues for verbose
commentary, but roughly we can do the following:

  Pick a stmt fingerprint, declare a sampling probability which controls
  when verbose tracing is enabled for it, and a latency threshold for
  which a trace is logged. With a given stmt rate (say 1000/s) and a
  given percentile we're trying to capture (say p99.9), we have N stmt/s
  in the 99.9th percentile (1 in our example). We should be able to set
  a sampling probability P such that with high likelihood (>95%) we
  capture at least one trace captured over the next S seconds. The
  sampling rate lets you control how the overhead you’re introducing for
  those statements in aggregate, which if dialed up higher lets you
  lower S. You might want to do such a thing for infrequently executed
  statements.

This commit does the simplest thing: ask for all input through shoddy
cluster settings and just log traces to our usual files. Below we
outline an example of how to use these settings to catch outlier
executions for writes to the history table in TPC-C. When using it, we
did not see an overall increase in p99s as a result of the sampling. It
also took only about 10s to capture this data, showing clearly that it
was due to latch waits.

  > SELECT
      encode(fingerprint_id, 'hex'),
      (statistics -> 'statistics' ->> 'cnt')::INT AS count,
      metadata ->> 'query' AS query
    FROM system.statement_statistics
    ORDER BY COUNT DESC limit 10;

           encode    | count |                                             query
    -----------------+-------+--------------------------------------------------------------------------------------------------------------------
    ...
    4e4214880f87d799 |  2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__)

    > SET CLUSTER SETTING trace.fingerprint = '4e4214880f87d799'; -- fingerprint
    > SET CLUSTER SETTING trace.fingerprint.probability = 0.01;   -- 1% sampling probability
    > SET CLUSTER SETTING trace.fingerprint.threshold = '90ms';   -- latency threshold

     0.521ms      0.005ms    event:kv/kvserver/concurrency/concurrency_manager.go:260 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] acquiring latches›
     0.532ms      0.011ms    event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] waiting to acquire read latch /Table/109/1/105/3/519/0@0,0, held by write latch /Table/109/1/105/3/519/0@1654849354.483984000,0›
    99.838ms     99.306ms    event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] scanning lock table for conflicting locks›
    99.851ms      0.013ms    event:kv/kvserver/replica_read.go:251 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] executing read-only batch›

Compare this to sql.trace.stmt.enable_threshold which enables verbose
tracing for all statements with 100% probability. It introduces far too
much overhead for it to be used in reasonable production clusters. The
overhead also often masks the problems we're looking to capture.

Release note (general change): We introduced a trifecta of three cluster
settings to collect trace data for outlier executions with low overhead.
This is only going to be available in 22.1; in 22.2 and beyond we have
other mechanisms to collect outlier traces. Traces come in handy when
looking to investigate latency spikes, and these three settings are
intended to supplant most uses of sql.trace.stmt.enable_threshold. That
setting enables verbose tracing for all statements with 100% probability
which can cause a lot of overhead in production clusters, and also a lot
of logging pressure. Instead we introduce the following:
- trace.fingerprint
- trace.fingerprint.probability
- trace.fingerprint.threshold
Put together (all have to be set) it only enables tracing for the statement
with the set hex-encoded fingerprint, does so probabilistically (where
probability is whatever trace.fingerprint.probability is set to),
logging it only if the latency threshold is exceeded (configured using
trace.fingerprint.threshold). To obtain a hex-encoded fingerprint,
consider looking at the contents of system.statement_statistics. For
example:

  > SELECT
      encode(fingerprint_id, 'hex'),
      (statistics -> 'statistics' ->> 'cnt')::INT AS count,
      metadata ->> 'query' AS query
    FROM system.statement_statistics
    ORDER BY COUNT DESC limit 10;

           encode    | count |                                             query
    -----------------+-------+--------------------------------------------------------------------------------------------------------------------
    ...
    4e4214880f87d799 |  2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants