[prototype] sql,tracing: per-stmt probabilistic trace sampling#82864
Closed
irfansharif wants to merge 1 commit intocockroachdb:masterfrom
Closed
[prototype] sql,tracing: per-stmt probabilistic trace sampling#82864irfansharif wants to merge 1 commit intocockroachdb:masterfrom
irfansharif wants to merge 1 commit intocockroachdb:masterfrom
Conversation
Prototyping something that could help cockroachdb#80666. I'm hoping we can get to the following ability: Pick a stmt fingerprint, declare a sampling probability which controls when verbose tracing is enabled for it, and a latency threshold for which a trace is persisted/logged/etc. Maybe even a count of how many traces I actually want recorded? Some of these controls could be in the UI, or traces made visible in the UI, but I’d happily take a set of dubious cluster settings or table writes to install these to-be-traced-with-some-probability fingerprints, or just vanilla log outputs. With a given stmt rate (say 1000/s) and a given percentile I’m trying to capture (say p99.9), we have 1 stmt/s in the 99.9th percentile. I need to make this math less handwavey, but I should be able to set a sampling probability P such that with high likelihood (>95%) we capture at least one trace captured over the next S seconds. The sampling rate lets you control how the overhead you’re introducing for those statements in aggregate, which if dialed up higher lets you lower S. You might want to do such a thing for infrequently executed statements. Still not sure how to figure out what statement to trace (we don't capture per-stmt %-iles -- too expensive? Maybe we will in some form after cockroachdb#82473?). And also not sure what we can do better than just logging with captured traces (stash them under /tracez somehow?) or in some system table capped in # of rows + with row-level TTL to evict captures + make room for new, more recent captures? Anyway, this first stab does the simplest thing: ask for all input through shoddy cluster settings and just log stuff. A quick and dirty attempt to try and catch an outlier execution for the history table write in TPC-C. Didn't see an overall increase in p99s as a result of the sampling + took about 10s to capture this, showing it was due to latch waits. > SELECT encode(fingerprint_id, 'hex'), (statistics -> 'statistics' ->> 'cnt')::INT AS count, metadata ->> 'query' AS query FROM system.statement_statistics ORDER BY COUNT DESC limit 10; encode | count | query -----------------+-------+-------------------------------------------------------------------------------------------------------------------- ... 4e4214880f87d799 | 2680 | INSERT INTO history(h_c_id, h_c_d_id, h_c_w_id, h_d_id, h_w_id, h_amount, h_date, h_data) VALUES ($1, $2, __more6__) > SET CLUSTER SETTING trace.fingerprint = '4e4214880f87d799'; > SET CLUSTER SETTING trace.fingerprint.probability = 0.01; > SET CLUSTER SETTING trace.fingerprint.threshold = '90ms'; 0.521ms 0.005ms event:kv/kvserver/concurrency/concurrency_manager.go:260 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] acquiring latches› 0.532ms 0.011ms event:kv/kvserver/spanlatch/manager.go:532 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] waiting to acquire read latch /Table/109/1/105/3/519/0@0,0, held by write latch /Table/109/1/105/3/519/0@1654849354.483984000,0› 99.838ms 99.306ms event:kv/kvserver/concurrency/concurrency_manager.go:301 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] scanning lock table for conflicting locks› 99.851ms 0.013ms event:kv/kvserver/replica_read.go:251 [n1,s1,r105/1:/Table/109/1/{90/7/2…-105/10…}] executing read-only batch› Release note: None
Member
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is intended to be a backportable form of #82750 for 21.2 & 22.1.