stmtdiagnostics: support continuous bundle collection by irfansharif · Pull Request #83020 · cockroachdb/cockroach

irfansharif · 2022-06-16T21:32:21Z

..until expiry. Informs #82896 (more specifically this is a short-term
alternative to the part pertaining to continuous tail capture). The
issue has more background, but we repeat some below for posterity.

It's desirable to draw from a set of tail execution traces collected
over time when investigating tail latencies. #82750 introduced a
probabilistic mechanism to capture a single tail event for a individual
stmts with bounded overhead (determined by the sampling probability,
trading off how long until a single capture is obtained). This PR
introduces a sql.stmt_diagnostics.collect_continuously_until_expired
to collect captures continuously over some period of time for aggregate
analysis. Longer term we'd want:

Controls over the maximum number of captures we'd want stored over
some period of time;
Eviction of older bundles, assuming they're less relevant, making room
for newer captures.

To safeguard against misuse (in this current form we should only use it
for experiments or escalations under controlled environments), we only
act on this setting provided the diagnostics request has an expiration
timestamp and a specified probability, crude measures to prevent
unbounded growth.

To get some idea of how this can be used, consider the kinds of
experiments we're running as part of #75066. Specifically we have a
reproduction where we can observe spikes in latencies for foreground
traffic in the presence of concurrent backups (incremental/full). In an
experiment with incremental backups running every 10m, with full backups
running every 35m (RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'),
we observe latency spikes during overlap periods. With this cluster
setting we were able to set up trace captures over a 10h window to get a
set of representative outlier traces to investigate further.

> SELECT crdb_internal.request_statement_bundle(
  'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint
  0.05,                                  -- 5% sampling probability
  '30ms'::INTERVAL,                      -- 30ms target (p99.9)
  '10h'::INTERVAL                        -- capture window
);

> WITH histogram AS
     (SELECT extract('minute', collected_at) AS minute,
             count(*) FROM system.statement_diagnostics
      GROUP BY minute)
SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq
FROM histogram
ORDER BY count DESC
LIMIT 10;

  minute |              freq
---------+---------------------------------
      36 | ******************************
      38 | *********************
      35 | *********************
      00 | *********************
      37 | ********************
      30 | ********************
      40 | *****************
      20 | **************
      10 | *************
      50 | ***********
(10 rows)

We see that we captured just the set of bundles/traces we were interested in.

Release note: None

cockroach-teamcity · 2022-06-16T21:32:31Z

This change is

yuzefovich

Reviewed 8 of 8 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @irfansharif)

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 677 at r1 (raw file):

				//   there's a latency threshold);
				// - Requests need to have an expiration set;
				if (req.samplingProbability > 0 && req.samplingProbability < 1) &&

What's the reasoning for not allowing 1.0 probability?

pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 339 at r1 (raw file):

	})

	t.Run("continuous capture disabled without sampling probability", func(t *testing.T) {

I'm not sure that I follow what this test case is testing - given the name of the test case, I'd expect we'd enable continuous capture, run the same query twice, and assert that only one bundle was collected. Can you add some commentary here and for the next test case about the goals?

andreimatei

Longer term we'd want:
Controls over the maximum number of captures we'd want stored over
some period of time;

Why introduce a cluster setting for this, as opposed to a per-diagnostics request count of samples to store? I say let's go straight to that. To hedge against various columns in the diagnostics requests table not surviving the test of time, let's group some of these request params in a jSON column (I'd also put the sampling probability in there). I'm thinking a limit on the count of bundles to store, whereby the request is considered satisfied if this limit is reached, is good for now (like, no need for replacing "less interesting" samples with "more interesting" ones).

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)

irfansharif

I fell asleep with this PR but it's still something I actually want to have make it to 22.2. So I'll hold off on more invasive changes like those you're suggesting Andrei (also I'm a bit lazy). This whole thing is a bit of a hack, we're able to dump all bundles into this table since we never introduced a unique FK constraint (I'm not sure that was the intent given how it's wired up in the UI). Speaking of the UI, when we do collect additional bundles for a single request, it's not equipped currently to show all of them, but that's also something I'll also defer to others, including perhaps surfacing the sampling probability in the stmt diagnostic modal. This PR has a purely let-me-run-experiments agenda, perhaps under supervision on actually production clusters for latency sensitive customers. To make this more featureful I would do what you're saying, like enforce some max limit instead of an expiration time -- as is it can be a bit of a DDOS vector. But for now I feel this is ok, including as a backport to 22.2

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 677 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

What's the reasoning for not allowing 1.0 probability?

No good reason, removed.

pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 339 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I'm not sure that I follow what this test case is testing - given the name of the test case, I'd expect we'd enable continuous capture, run the same query twice, and assert that only one bundle was collected. Can you add some commentary here and for the next test case about the goals?

Done.

Code quote:

continuous capture disabled without sampling probability

yuzefovich

Reviewed 8 of 8 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)

irfansharif · 2022-09-26T16:55:18Z

CI failure was fixed in #88613.

bors r+

andreimatei · 2022-09-26T16:56:02Z

Please give me a chance to review this.

irfansharif · 2022-09-26T16:56:13Z

Have at it.

bors r-

craig · 2022-09-26T16:56:18Z

Canceled.

andreimatei

Reviewed 1 of 8 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 66 at r2 (raw file):

// today (best you can do is `cockroach statement-diag delete --all`). This
// setting also captures multiple bundles for a single diagnostic request which
// does not fit super well with our current scheme of

s/super well/well

But add more words saying how it doesn't fit. Say something about the completed field on the request, and about what the UI will show.

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 68 at r2 (raw file):

retrievable using cockroach statement-diag download <bundle-id>

They are retrievable like this or they're not?

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 133 at r2 (raw file):

// NewRegistry constructs a new Registry.
func NewRegistry(
	ie sqlutil.InternalExecutor, db *kv.DB, st *cluster.Settings, knobs *TestingKnobs,

pass the knobs by value

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 204 at r2 (raw file):

		case <-pollIntervalChanged:
			continue // go back around and maybe reset the timer
		case <-testingPollingCh:

this can be <-r.knobs.PollingCh; you don't need the conditional assignment of testingPollingCh

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 305 at r2 (raw file):

				samplingProbability)
		}
		if minExecutionLatency.Nanoseconds() == 0 {

do you need .Nanoseconds() ?

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 307 at r2 (raw file):

		if minExecutionLatency.Nanoseconds() == 0 {
			return 0, errors.AssertionFailedf(
				"malformed input: got non-zero sampling probability %f and empty min exec latency",

which "input" is malformed? Is this really an assertion error (i.e. programming error)?
Similarly above.

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 614 at r2 (raw file):

				// capturing future traces, i.e. not mark this request as
				// completed.
				// - Requests need to be of the sampling sort (also implies

suggest why. Otherwise this comment just repeats the code.

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 617 at r2 (raw file):

				//   there's a latency threshold);
				// - Requests need to have an expiration set.
				if req.samplingProbability > 0 && !req.expiresAt.IsZero() {

nit: shouldMarkCompleted = req.samplingProb == 0 || ...

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 624 at r2 (raw file):

				sessiondata.InternalExecutorOverride{User: username.RootUserName()},
				"UPDATE system.statement_diagnostics_requests "+
					"SET completed = $1, statement_diagnostics_id = $2 WHERE id = $3",

so requests with completed == false can now have statement_diagnostics_id set? And namely it is set to the last bundle that was collected? Why is doing this necessary, as opposed to not updating the row at all when we're not marking it as completed?
What does the UI show for such a request?

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 718 at r2 (raw file):

				samplingProbability = float64(*prob)
				if samplingProbability < 0 || samplingProbability > 1 {
					log.Warningf(ctx, "malformed sampling probability: %f (expected in range [0, 1]), resetting to 1.0",

include the request id in the message

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 738 at r2 (raw file):

// TestingFindRequest exports findRequest for testing purposes.
func (r *Registry) TestingFindRequest(requestID RequestID) bool {

move to helpers_test

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 748 at r2 (raw file):

	// CollectUntilExpirationOverride lets tests intercept+override the value
	// read from collectUntilExpiration cluster setting.
	CollectUntilExpirationOverride func() bool

Do you really need this? Can't the test set the cluster setting?

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 751 at r2 (raw file):

	// PollingCh lets tests directly induce registry-internal polling of statement
	// requests.
	PollingCh chan struct{}

nit: consider PollingC.

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 751 at r2 (raw file):

	// PollingCh lets tests directly induce registry-internal polling of statement
	// requests.
	PollingCh chan struct{}

instead of this, can the test muck set sql.stmt_diagnostics.poll_interval ?

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 754 at r2 (raw file):

}

// ModuleTestingKnobs is part of the base.ModuleTestingKnobs interface.

nit: this is the only method in that interface, so the canonical comment is ModuleTK implements base.ModuleTestingKnobs.

pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 43 at r2 (raw file):

	defer log.Scope(t).Close(t)

	mu := struct {

syncutil.AtomicBool ?

pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 335 at r2 (raw file):

		samplingProbability, expiresAfter := 0.5, time.Second
		_, err := registry.InsertRequestInternal(ctx, "SELECT pg_sleep(_)",
			samplingProbability, 0*time.Nanosecond, expiresAfter)

Comment the 0*time.Nanosecond arg pls.
Do you need the *time.Nanosecond?

pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 342 at r2 (raw file):

		// We validate that continuous captures is disabled when a sampling
		// probability of 0.0 is used. We know that it's disabled given the
		// diagnostic request is marked as completed despite us not getting to

maybe spell it out that continuous capture doesn't mark the request as completed.

pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 365 at r2 (raw file):

	t.Run("continuous capture disabled without expiration timestamp", func(t *testing.T) {
		samplingProbability, minExecutionLatency, expiresAfter := 0.999, time.Microsecond, 0*time.Hour

point to the previous test, which has some commentary.

pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 431 at r2 (raw file):

		select {
		case <-time.After(30 * time.Second):
			t.Fatalf("timed out")

say more about what

irfansharif

PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @yuzefovich)

pkg/sql/stmtdiagnostics/statement_diagnostics.go line 66 at r2 (raw file):