tracing,*: always-on probabilistic tracing

**Is your feature request related to a problem? Please describe.**

It's difficult today to diagnose causes for high-tail executions. It's doubly difficult to do so when the high-tail events have occurred in the past as we don't do any form of always-on/continuous profiling. Given the primary latency metric chart we showcase includes executions against all databases/applications running in tenant (including CRDB internal ones), unless users have set up app-side monitoring, it's also difficult to know what specifically got slow -- just that something did. This is all a bit untenable.

**Describe alternatives you've considered**

With the work underneath https://github.com/cockroachdb/cockroach/issues/82896 we have manual mechanisms to probabilistically enable tracing for specific fingerprints (the probability is what controls the overhead) and collect them when exceeding some specified threshold. This lets you set up “traps” for specific fingerprints and enables tracing probabilistically for subsequent executions (<1% works well for some frequently occurring statement, bounding overhead) and continuously persist traces that exceed some latency threshold. Manual forms of this now exist [through cluster settings in 22.1](https://github.com/cockroachdb/cockroach/pull/89774), and [through builtins in 22.2](https://github.com/cockroachdb/cockroach/pull/83020). There are limitations: 
- All traces that exceed the threshold are persisted. That’s no good, we should maintain some limit or rolling window -- during sustained periods of high latency we don’t want to hammer ourselves further by continuously collecting state/spamming logs. This is why this is currently so manual/only encouraged under L2 supervision.
- Traces are only collected after these “traps” are explicitly set, so we can only observe latency events after that point in time, not retroactively (when the incident occurred).
- Human-in-the-loop when selecting which statement to sample, at what probability, and for what latency threshold.

These are all things we can improve and what this issue is for.

**Describe the solution you'd like**

We've developed some machinery to automatically recognize what latencies are to be considered outliers for specific fingerprints over at https://github.com/cockroachdb/cockroach/pull/82473. We could combine this with probabilistic tracing to continuously record outlier executions. Traces collected here can't be done so unboundedly, and perhaps something we consume as obs-service events now that we're planning on embedding it in CRDB (#88194). Perhaps enabling always-on probabilistic tracing for some set of frequently occurring statement mix (read-only, write-only, read-write; segmented by app names?) gives us a ton of relevant coverage. In check list form:

- [ ] Making sure we have a sane retention policy like "keep the last N traces", or "at most N traces over Y hours". Anything better than "collect all traces that exceed specified threshold" which can grow unboundedly.
- [ ] Some plumbing with the outliers machinery to automatically figure out what latency threshold is interesting to capture ("3x the p99 over the last hour", so you capture the unexpected spikes). We can use any heuristic.
- [ ] Pick some small set of fingerprints to traces probabilistically ("5 most frequent stmt fingerprints in the last hour, where the stmt mix ideally includes a (i) read-only stmt, (ii) write-only stmt, (iii) a read-write stmt"). We can use any heuristic.
- [ ] Extending probabilistic tracing to entire txns, instead of just individual statements.

**Additional context**

See #90288, #80666 and issues linked above. See this [internal doc](https://docs.google.com/document/d/1FhM9btSnqnfqUB1O0UzG-iPjCuJIo20wpzJfPDybQpg/edit#) for more surrounding context around latency observability.

Jira issue: CRDB-20677

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracing,*: always-on probabilistic tracing #90292

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tracing,*: always-on probabilistic tracing #90292

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions