-
Notifications
You must be signed in to change notification settings - Fork 4.1k
tracing,*: always-on probabilistic tracing #90292
Description
Is your feature request related to a problem? Please describe.
It's difficult today to diagnose causes for high-tail executions. It's doubly difficult to do so when the high-tail events have occurred in the past as we don't do any form of always-on/continuous profiling. Given the primary latency metric chart we showcase includes executions against all databases/applications running in tenant (including CRDB internal ones), unless users have set up app-side monitoring, it's also difficult to know what specifically got slow -- just that something did. This is all a bit untenable.
Describe alternatives you've considered
With the work underneath #82896 we have manual mechanisms to probabilistically enable tracing for specific fingerprints (the probability is what controls the overhead) and collect them when exceeding some specified threshold. This lets you set up “traps” for specific fingerprints and enables tracing probabilistically for subsequent executions (<1% works well for some frequently occurring statement, bounding overhead) and continuously persist traces that exceed some latency threshold. Manual forms of this now exist through cluster settings in 22.1, and through builtins in 22.2. There are limitations:
- All traces that exceed the threshold are persisted. That’s no good, we should maintain some limit or rolling window -- during sustained periods of high latency we don’t want to hammer ourselves further by continuously collecting state/spamming logs. This is why this is currently so manual/only encouraged under L2 supervision.
- Traces are only collected after these “traps” are explicitly set, so we can only observe latency events after that point in time, not retroactively (when the incident occurred).
- Human-in-the-loop when selecting which statement to sample, at what probability, and for what latency threshold.
These are all things we can improve and what this issue is for.
Describe the solution you'd like
We've developed some machinery to automatically recognize what latencies are to be considered outliers for specific fingerprints over at #82473. We could combine this with probabilistic tracing to continuously record outlier executions. Traces collected here can't be done so unboundedly, and perhaps something we consume as obs-service events now that we're planning on embedding it in CRDB (#88194). Perhaps enabling always-on probabilistic tracing for some set of frequently occurring statement mix (read-only, write-only, read-write; segmented by app names?) gives us a ton of relevant coverage. In check list form:
- Making sure we have a sane retention policy like "keep the last N traces", or "at most N traces over Y hours". Anything better than "collect all traces that exceed specified threshold" which can grow unboundedly.
- Some plumbing with the outliers machinery to automatically figure out what latency threshold is interesting to capture ("3x the p99 over the last hour", so you capture the unexpected spikes). We can use any heuristic.
- Pick some small set of fingerprints to traces probabilistically ("5 most frequent stmt fingerprints in the last hour, where the stmt mix ideally includes a (i) read-only stmt, (ii) write-only stmt, (iii) a read-write stmt"). We can use any heuristic.
- Extending probabilistic tracing to entire txns, instead of just individual statements.
Additional context
See #90288, #80666 and issues linked above. See this internal doc for more surrounding context around latency observability.
Jira issue: CRDB-20677