-
Notifications
You must be signed in to change notification settings - Fork 4.1k
tracing,stmtdiagnostics: bounded overhead, continuous tail trace capture #82896
Description
Is your feature request related to a problem? Please describe.
It's difficult today to diagnose causes for high-tail executions. It's doubly difficult to do so when the high-tail events have occurred in the past as we don't do any form of always-on/continuous profiling.
Describe alternatives you've considered
Keep the status quo, we have two ways to get to verbose traces for tail events today:
sql.trace.stmt.enable_threshold, a somewhat blunt hammer that enables verbose tracing for every execution all the time. Because it's so coarse grained, enabling it imposes a large performance penalty. It's likely that enabling the setting itself drives high-tail events, which is not what we want. Secondly: since it's not scoped to specific statements (it's possible for outlier events to only apply to specific statements, and for different statements to have different tail behaviours), it creates a very verbose haystack to sift through in CRDB logs to find out the execution we're interested in. In previous escalations we approach this setting somewhat timidly, manually switching it on and off at specific points in time in the hopes of capturing the right tail event.- Specifying a minimum latency target when issuing a diagnostics bundle request. This is nice in that it's scoped to a specific fingerprint, but it has two limitations (exactly what this issue proposes we address):
- When setting a latency target, it enables verbose tracing for all future executions until that target is met or the request expires. This can be expensive, especially if trying to capture a tail event for a frequently occurring statement.
- It captures only the first event that meets the criteria (i.e. it's not continuous).
Describe the solution you'd like
As a first step, I want roughly the following ability:
Pick a stmt fingerprint, declare a sampling probability which controls
when verbose tracing is enabled for it, and a latency threshold for
which a trace is persisted/logged/etc. With a given stmt rate (say
1000/s) and a given percentile I’m trying to capture (say p99.9), we
have 0.001R stmt/s in the 99.9th percentile (1/s in our example). We
should be able to set a sampling probability P such that with high
likelihood (>95%) we capture at least one trace captured over the next
S seconds. The sampling rate lets you control how the overhead you’re
introducing for those statements in aggregate, which if dialed up
higher lets you lower S. You might want to do such a thing for
infrequently executed statements.
This gives use a mechanism to get to a bounded-overhead tail trace capture (bounded by probability and scope, targeting only a single stmt). A prototype of such a thing can be found here: #82750 (backportable form for 21.2, 22.1: #82864).
For a second step, I'd like for us to capture such outlier traces continuously to be able to have traces for high-tail events in the past (often what we want during escalations). For that I think we'd need to specify the maximum number of traces we want to capture over some period of time for a given finger print and some eviction policy to not accrue garbage indefinitely + keep more valuable traces around (probably just evict the oldest, all things equal we probably only care about recent traces). A crude approximation of this (this = capture multiple bundles for a given request as long as the request is not expired) can be found here: #83020.
Additional context
Very closely relates to #80666.