Skip to content

Provide E2E SQL and/or KV latency & error rate time-series metrics that indicate issues with CRDB rather than poor workload given cluster provisioning #71169

@joshimhoff

Description

@joshimhoff

TLDR: Skip to the "Describe the solution you'd like" section!!!

Is your feature request related to a problem? Please describe.
In case of our cloud product:

  • We want to page CRL eng including SRE in case of urgent & actionable issues with CRDB.
  • Corollary to above: We do NOT want to page CRL eng including SRE in case a customer writes a poor workload given how the cluster is provisioned (which in dedicated the customer controls).

The need for above is clear in case of our cloud product. But on-prem would benefit too. Being able to quickly tell apart issues with CRDB apart from issues with a customer workload makes DB eng, SRE, TSE, and customers more efficient.

Describe alternatives you've considered
The alternative is the current CC monitoring stack. We alert on various signals of definite trouble. For example:

  1. If the sqlprober can't SELECT a test row, page SRE.
  2. If the kvprober can't read & shadow write a key dedicated to probing on a "randomly" chosen range, page SRE.
  3. If proposals are "stuck in raft", that is, they don't get committed in a timely manner, page SRE.

This is fine. If above signals indicate trouble, there is trouble, as the the above signals don't close on contention or other workload issues (in case of prober the workload is controlled by CRL eng thus not problematic). That is, the false positive rate is low [1].

[1] modulo that until admission control rolls out we will paged when a dedicated cluster is out of CPU

What's the problem then? The false negative rate is too high! This means we miss outages in production. It also means we don't have confidence that IF no alerts are firing, it is a workload issue. The latter possibly matters more to operational efficiency than the former, tho both are important.

Examples of outages above alerting config might miss:

  1. Issues with the SQL table leasing machinery (unless issue happened to affect the tiny sqlprober tables).
  2. Badly shaped LSMs (unless so badly shaped that kvprober writes slowed down).
  3. kvserver deadlocks (could be caught by kvprober but only slowly given randomized range selection)
  4. There are way way more surely...

Describe the solution you'd like
What matters to users is that SQL is available and served at low latencies. Many many many different concrete production issues lead to high SQL latency or error rates. Why don't we alert on SQL latency & error rate then? Doing this would lead to a low false negative rate, as if an issue actually matters to a customer, it translates to SQL to be either not available or at high latencies. Alerting on the symptoms the user cares about is great. Why alert on anything else? If we are not providing to users what they care about, of course we should page CRL eng including SRE!!!

The obvious issue is that doing this naively will lead to alerting on poor workload given how the cluster is provisioned, which will push ops load way way up & burn us all out. For example:

  1. If a cluster is out of CPU, latencies will shoot up.
  2. If a workload includes significant contention, latencies will shoot up.
  3. If a workload does lots of full table scans, latencies will shoot up.

The crux is: How can we measure E2E SQL latency & error rate via time-series metrics in a way that doesn't include the above workload issues? For now let's just consider latency.

For 1 & 2, we need to measure latency while excluding the portion of latency contributed by certain components:

  1. Don't include latency incurred sitting in admission control queues.
  2. Don't include latency incurred sitting in any of the KV concurrency machinery (lock table? etc.).

3 is harder! One idea is bucketing queries by their expected cost and setting different alerting thresholds on the buckets (e.g. full table scans wouldn't have SLA but single row reads would). Another idea is normalizing the measured latency by expected cost, e.g. you can imagine dividing each full table scan latency by the number of rows read or similar. These are half ideas only. To the best of my knowledge, the bucketing approach has worked well on a Google time-series database. That database does not provide an interface as flexible & thus easy to foot-gun yourself as SQL. But still I contend that 3 is possible for CRDB too.

CC @JuanLeon1 @bdarnell @tbg @andreimatei @jordanlewis @jreut @udnay @sumeerbhola for thoughts!

Additional context
If anyone wants links to a CC specific design doc about monitoring as extra context, plz say so.

Jira issue: CRDB-10421

Epic CRDB-32145

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-monitoringA-sql-observabilityRelated to observability of the SQL layerC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-sreFor issues SRE opened or otherwise cares about tracking.T-observability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions