stmtdiagnostics: support continuous bundle collection#83020
Conversation
6ff8162 to
8c7d19d
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewed 8 of 8 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @irfansharif)
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 677 at r1 (raw file):
// there's a latency threshold); // - Requests need to have an expiration set; if (req.samplingProbability > 0 && req.samplingProbability < 1) &&
What's the reasoning for not allowing 1.0 probability?
pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 339 at r1 (raw file):
}) t.Run("continuous capture disabled without sampling probability", func(t *testing.T) {
I'm not sure that I follow what this test case is testing - given the name of the test case, I'd expect we'd enable continuous capture, run the same query twice, and assert that only one bundle was collected. Can you add some commentary here and for the next test case about the goals?
andreimatei
left a comment
There was a problem hiding this comment.
Longer term we'd want:
Controls over the maximum number of captures we'd want stored over
some period of time;
Why introduce a cluster setting for this, as opposed to a per-diagnostics request count of samples to store? I say let's go straight to that. To hedge against various columns in the diagnostics requests table not surviving the test of time, let's group some of these request params in a jSON column (I'd also put the sampling probability in there). I'm thinking a limit on the count of bundles to store, whereby the request is considered satisfied if this limit is reached, is good for now (like, no need for replacing "less interesting" samples with "more interesting" ones).
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @irfansharif)
irfansharif
left a comment
There was a problem hiding this comment.
I fell asleep with this PR but it's still something I actually want to have make it to 22.2. So I'll hold off on more invasive changes like those you're suggesting Andrei (also I'm a bit lazy). This whole thing is a bit of a hack, we're able to dump all bundles into this table since we never introduced a unique FK constraint (I'm not sure that was the intent given how it's wired up in the UI). Speaking of the UI, when we do collect additional bundles for a single request, it's not equipped currently to show all of them, but that's also something I'll also defer to others, including perhaps surfacing the sampling probability in the stmt diagnostic modal. This PR has a purely let-me-run-experiments agenda, perhaps under supervision on actually production clusters for latency sensitive customers. To make this more featureful I would do what you're saying, like enforce some max limit instead of an expiration time -- as is it can be a bit of a DDOS vector. But for now I feel this is ok, including as a backport to 22.2
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 677 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
What's the reasoning for not allowing 1.0 probability?
No good reason, removed.
pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 339 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I'm not sure that I follow what this test case is testing - given the name of the test case, I'd expect we'd enable continuous capture, run the same query twice, and assert that only one bundle was collected. Can you add some commentary here and for the next test case about the goals?
Done.
Code quote:
continuous capture disabled without sampling probability8c7d19d to
4a6b57f
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewed 8 of 8 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)
|
CI failure was fixed in #88613. bors r+ |
|
Please give me a chance to review this. |
|
Have at it. bors r- |
|
Canceled. |
andreimatei
left a comment
There was a problem hiding this comment.
Reviewed 1 of 8 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 66 at r2 (raw file):
// today (best you can do is `cockroach statement-diag delete --all`). This // setting also captures multiple bundles for a single diagnostic request which // does not fit super well with our current scheme of
s/super well/well
But add more words saying how it doesn't fit. Say something about the completed field on the request, and about what the UI will show.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 68 at r2 (raw file):
retrievable using
cockroach statement-diag download <bundle-id>
They are retrievable like this or they're not?
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 133 at r2 (raw file):
// NewRegistry constructs a new Registry. func NewRegistry( ie sqlutil.InternalExecutor, db *kv.DB, st *cluster.Settings, knobs *TestingKnobs,
pass the knobs by value
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 204 at r2 (raw file):
case <-pollIntervalChanged: continue // go back around and maybe reset the timer case <-testingPollingCh:
this can be <-r.knobs.PollingCh; you don't need the conditional assignment of testingPollingCh
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 305 at r2 (raw file):
samplingProbability) } if minExecutionLatency.Nanoseconds() == 0 {
do you need .Nanoseconds() ?
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 307 at r2 (raw file):
if minExecutionLatency.Nanoseconds() == 0 { return 0, errors.AssertionFailedf( "malformed input: got non-zero sampling probability %f and empty min exec latency",
which "input" is malformed? Is this really an assertion error (i.e. programming error)?
Similarly above.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 614 at r2 (raw file):
// capturing future traces, i.e. not mark this request as // completed. // - Requests need to be of the sampling sort (also implies
suggest why. Otherwise this comment just repeats the code.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 617 at r2 (raw file):
// there's a latency threshold); // - Requests need to have an expiration set. if req.samplingProbability > 0 && !req.expiresAt.IsZero() {
nit: shouldMarkCompleted = req.samplingProb == 0 || ...
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 624 at r2 (raw file):
sessiondata.InternalExecutorOverride{User: username.RootUserName()}, "UPDATE system.statement_diagnostics_requests "+ "SET completed = $1, statement_diagnostics_id = $2 WHERE id = $3",
so requests with completed == false can now have statement_diagnostics_id set? And namely it is set to the last bundle that was collected? Why is doing this necessary, as opposed to not updating the row at all when we're not marking it as completed?
What does the UI show for such a request?
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 718 at r2 (raw file):
samplingProbability = float64(*prob) if samplingProbability < 0 || samplingProbability > 1 { log.Warningf(ctx, "malformed sampling probability: %f (expected in range [0, 1]), resetting to 1.0",
include the request id in the message
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 738 at r2 (raw file):
// TestingFindRequest exports findRequest for testing purposes. func (r *Registry) TestingFindRequest(requestID RequestID) bool {
move to helpers_test
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 748 at r2 (raw file):
// CollectUntilExpirationOverride lets tests intercept+override the value // read from collectUntilExpiration cluster setting. CollectUntilExpirationOverride func() bool
Do you really need this? Can't the test set the cluster setting?
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 751 at r2 (raw file):
// PollingCh lets tests directly induce registry-internal polling of statement // requests. PollingCh chan struct{}
nit: consider PollingC.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 751 at r2 (raw file):
// PollingCh lets tests directly induce registry-internal polling of statement // requests. PollingCh chan struct{}
instead of this, can the test muck set sql.stmt_diagnostics.poll_interval ?
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 754 at r2 (raw file):
} // ModuleTestingKnobs is part of the base.ModuleTestingKnobs interface.
nit: this is the only method in that interface, so the canonical comment is ModuleTK implements base.ModuleTestingKnobs.
pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 43 at r2 (raw file):
defer log.Scope(t).Close(t) mu := struct {
syncutil.AtomicBool ?
pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 335 at r2 (raw file):
samplingProbability, expiresAfter := 0.5, time.Second _, err := registry.InsertRequestInternal(ctx, "SELECT pg_sleep(_)", samplingProbability, 0*time.Nanosecond, expiresAfter)
Comment the 0*time.Nanosecond arg pls.
Do you need the *time.Nanosecond?
pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 342 at r2 (raw file):
// We validate that continuous captures is disabled when a sampling // probability of 0.0 is used. We know that it's disabled given the // diagnostic request is marked as completed despite us not getting to
maybe spell it out that continuous capture doesn't mark the request as completed.
pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 365 at r2 (raw file):
t.Run("continuous capture disabled without expiration timestamp", func(t *testing.T) { samplingProbability, minExecutionLatency, expiresAfter := 0.999, time.Microsecond, 0*time.Hour
point to the previous test, which has some commentary.
pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 431 at r2 (raw file):
select { case <-time.After(30 * time.Second): t.Fatalf("timed out")
say more about what
4a6b57f to
27cc345
Compare
irfansharif
left a comment
There was a problem hiding this comment.
PTAL.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @yuzefovich)
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 66 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
s/super well/well
But add more words saying how it doesn't fit. Say something about the
completedfield on the request, and about what the UI will show.
Done.
Code quote:
does not fit super well with our current scheme ofpkg/sql/stmtdiagnostics/statement_diagnostics.go line 68 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
retrievable using
cockroach statement-diag download <bundle-id>They are retrievable like this or they're not?
They are, rephrased.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 133 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
pass the knobs by value
Done.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 204 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
this can be
<-r.knobs.PollingCh; you don't need the conditional assignment oftestingPollingCh
Done.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 305 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
do you need
.Nanoseconds()?
Done.
Code quote:
minExecutionLatency.Nanoseconds() pkg/sql/stmtdiagnostics/statement_diagnostics.go line 307 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
which "input" is malformed? Is this really an assertion error (i.e. programming error)?
Similarly above.
There are a few code paths to get to this point which is why the assertion sits here. I do think it's a programming error, or rather will be, if not appropriately sanitized in the UI. If you're hooking into this through the crdb_internal builtin, the error appears as is. I never remember what the right error type to use but I've downgraded it to the Newf kind and omitted the "malformed input" prefix.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 614 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
suggest why. Otherwise this comment just repeats the code.
Done.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 617 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit:
shouldMarkCompleted = req.samplingProb == 0 || ...
The current thing reads a bit more easily to me.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 624 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
so requests with
completed == falsecan now havestatement_diagnostics_idset? And namely it is set to the last bundle that was collected? Why is doing this necessary, as opposed to not updating the row at all when we're not marking it as completed?
What does the UI show for such a request?
Did it this way to make sure that if there was at least one bundle that was collected before the request completed/expired, it would be recorded. The UI ends up showing just the last bundle, and only once the request is completed. It shows the request as "Waiting" up until then.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 718 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
include the request id in the message
Done.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 738 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
move to
helpers_test
Done.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 748 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Do you really need this? Can't the test set the cluster setting?
Very true, done.
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 751 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
instead of this, can the test muck set
sql.stmt_diagnostics.poll_interval?
Done (though it makes the test slightly slower).
pkg/sql/stmtdiagnostics/statement_diagnostics.go line 754 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
nit: this is the only method in that interface, so the canonical comment is
ModuleTK implements base.ModuleTestingKnobs.
Done, but you need to start enforcing this canon on more PRs 😛
$ rg "ModuleTestingKnobs is part of the base.ModuleTestingKnobs interface." | wc -l
22
$ rg "ModuleTestingKnobs implements base.ModuleTestingKnobs" | wc -l
6
pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 43 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
syncutil.AtomicBool?
TIL, done.
pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 335 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Comment the
0*time.Nanosecondarg pls.
Do you need the*time.Nanosecond?
Done and Done.
Code quote:
collectUntilExpirationOverride boolpkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 342 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
maybe spell it out that continuous capture doesn't mark the request as completed.
Done.
pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 365 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
point to the previous test, which has some commentary.
Done.
pkg/sql/stmtdiagnostics/statement_diagnostics_test.go line 431 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
say more about what
Done.
..until expiry. Informs cockroachdb#82896 (more specifically this is a short-term alternative to the part pertaining to continuous tail capture). The issue has more background, but we repeat some below for posterity. It's desirable to draw from a set of tail execution traces collected over time when investigating tail latencies. cockroachdb#82750 introduced a probabilistic mechanism to capture a single tail event for a individual stmts with bounded overhead (determined by the sampling probability, trading off how long until a single capture is obtained). This PR introduces a sql.stmt_diagnostics.collect_continuously.enabled to collect captures continuously over some period of time for aggregate analysis. To get some idea of how this can be used, consider the kinds of experiments we're running as part of cockroachdb#75066. Specifically we have a reproduction where we can observe spikes in latencies for foreground traffic in the presence of concurrent backups (incremental/full). In an experiment with incremental backups running every 10m, with full backups running every 35m (`RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'`), we observe latency spikes during overlap periods. With this cluster setting we were able to set up trace captures over a 10h window to get a set of representative outlier traces to investigate further. SELECT crdb_internal.request_statement_bundle( 'INSERT INTO new_order(no_o_id, ...)', -- stmt fingerprint 0.05, -- 5% sampling probability '30ms'::INTERVAL, -- 30ms target (p99.9) '10h'::INTERVAL -- capture window ); WITH histogram AS (SELECT extract('minute', collected_at) AS minute, count(*) FROM system.statement_diagnostics GROUP BY minute) SELECT minute, repeat('*', (30 * count/(max(count) OVER ()))::INT8) AS freq FROM histogram ORDER BY count DESC LIMIT 10; minute | freq ---------+--------------------------------- 36 | ****************************** 38 | ********************* 35 | ********************* 00 | ********************* 37 | ******************** 30 | ******************** 40 | ***************** 20 | ************** 10 | ************* 50 | *********** (10 rows) We see that we captured just the set of bundles/traces we were interested in. Longer term we'd want: - Controls over the maximum number of captures we'd want stored over some period of time; - Eviction of older bundles, assuming they're less relevant, making room for newer captures. To safeguard against misuse (in this current form we should only use it for experiments or escalations under controlled environments), we only act on this setting provided the diagnostics request has an expiration timestamp and a specified probability, crude measures to prevent unbounded growth. Release note: None
27cc345 to
0c9b96f
Compare
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @yuzefovich)
andreimatei
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei and @yuzefovich)
|
thx dude bors r+ |
|
Build failed (retrying...): |
|
Build failed (retrying...): |
|
Build succeeded: |
This commit introduces a backportable alternative to cockroachdb#82750 and cockroachdb#83020, to help solve for cockroachdb#82896. It gives a palatable alternative to sql.trace.stmt.enable_threshold. See those PRs/issues for verbose commentary, but roughly we can do the following: Pick a stmt fingerprint, declare a sampling probability which controls when verbose tracing is enabled for it, and a latency threshold for which a trace is logged. With a given stmt rate (say 1000/s) and a given percentile we're trying to capture (say p99.9), we have N stmt/s in the 99.9th percentile (1 in our example). We should be able to set a sampling probability P such that with high likelihood (>95%) we capture at least one trace captured over the next S seconds. The sampling rate lets you control how the overhead you’re introducing for those statements in aggregate, which if dialed up higher lets you lower S. You might want to do such a thing for infrequently executed statements. This commit does the simplest thing: ask for all input through shoddy cluster settings and just log traces to our usual files. Below we outline an example of how to use these settings to catch outlier executions for writes to the history table in TPC-C. When using it, we did not see an overall increase in p99s as a result of the sampling. It also took only about 10s to capture this data, showing clearly that it was due to latch waits. > SELECT encode(fingerprint_id, 'hex'), (statistics -> 'statistics' ->> 'cnt')::INT AS count, metadata ->> 'query' AS query FROM system.statement_statistics ORDER BY COUNT DESC limit 10; encode | count | query -----------------+-------+-------------------------------------------------------------------------------------------------------------------- ... 4e4214880f87d799 | 2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__) > SET CLUSTER SETTING trace.fingerprint = '4e4214880f87d799'; -- fingerprint > SET CLUSTER SETTING trace.fingerprint.probability = 0.01; -- 1% sampling probability > SET CLUSTER SETTING trace.fingerprint.threshold = '90ms'; -- latency threshold 0.521ms 0.005ms event:kv/kvserver/concurrency/concurrency_manager.go:260 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] acquiring latches› 0.532ms 0.011ms event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] waiting to acquire read latch /Table/109/1/105/3/519/0@0,0, held by write latch /Table/109/1/105/3/519/0@1654849354.483984000,0› 99.838ms 99.306ms event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] scanning lock table for conflicting locks› 99.851ms 0.013ms event:kv/kvserver/replica_read.go:251 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] executing read-only batch› Compare this to sql.trace.stmt.enable_threshold which enables verbose tracing for all statements with 100% probability. It introduces far too much overhead for it to be used in reasonable production clusters. The overhead also often masks the problems we're looking to capture. Release note (general change): We introduced a trifecta of three cluster settings to collect trace data for outlier executions with low overhead. This is only going to be available in 22.1; in 22.2 and beyond we have other mechanisms to collect outlier traces. Traces come in handy when looking to investigate latency spikes, and these three settings are intended to supplant most uses of sql.trace.stmt.enable_threshold. That setting enables verbose tracing for all statements with 100% probability which can cause a lot of overhead in production clusters, and also a lot of logging pressure. Instead we introduce the following: - trace.fingerprint - trace.fingerprint.probability - trace.fingerprint.threshold Put together (all have to be set) it only enables tracing for the statement with the set hex-encoded fingerprint, does so probabilistically (where probability is whatever trace.fingerprint.probability is set to), logging it only if the latency threshold is exceeded (configured using trace.fingerprint.threshold). To obtain a hex-encoded fingerprint, consider looking at the contents of system.statement_statistics. For example: > SELECT encode(fingerprint_id, 'hex'), (statistics -> 'statistics' ->> 'cnt')::INT AS count, metadata ->> 'query' AS query FROM system.statement_statistics ORDER BY COUNT DESC limit 10; encode | count | query -----------------+-------+-------------------------------------------------------------------------------------------------------------------- ... 4e4214880f87d799 | 2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__)
..until expiry. Informs #82896 (more specifically this is a short-term
alternative to the part pertaining to continuous tail capture). The
issue has more background, but we repeat some below for posterity.
It's desirable to draw from a set of tail execution traces collected
over time when investigating tail latencies. #82750 introduced a
probabilistic mechanism to capture a single tail event for a individual
stmts with bounded overhead (determined by the sampling probability,
trading off how long until a single capture is obtained). This PR
introduces a
sql.stmt_diagnostics.collect_continuously_until_expiredto collect captures continuously over some period of time for aggregate
analysis. Longer term we'd want:
some period of time;
for newer captures.
To safeguard against misuse (in this current form we should only use it
for experiments or escalations under controlled environments), we only
act on this setting provided the diagnostics request has an expiration
timestamp and a specified probability, crude measures to prevent
unbounded growth.
To get some idea of how this can be used, consider the kinds of
experiments we're running as part of #75066. Specifically we have a
reproduction where we can observe spikes in latencies for foreground
traffic in the presence of concurrent backups (incremental/full). In an
experiment with incremental backups running every 10m, with full backups
running every 35m (
RECURRING '*/10 * * * *' FULL BACKUP '35 * * * *'),we observe latency spikes during overlap periods. With this cluster
setting we were able to set up trace captures over a 10h window to get a
set of representative outlier traces to investigate further.
We see that we captured just the set of bundles/traces we were interested in.
Release note: None