sql: surface KV contention statistics

A high priority for the KV team this release (21.1) is to make it easier for users to understand what is happening with their queries at the KV layer. Since the SQL Execution team is also responsible for query observability, a goal for 21.1 is to surface contention statistics provided by the KV team.

At a general level, we aim to surface these contention statistics in two ways:
1) On a per-statement/transaction level accessed through the transactions page.
2) On a cluster level, using a “contention heatmap”.

# Design Proposal
This section outlines the work needed on the backend. It does not explicitly lay out what will happen on the frontend but takes as reference the customer needs outlined above to guide the work.

At a basic level, SQL Execution will have access to the following protobufs emitted after completion of a statement:

```
type ContentionEvent struct {
  Key roachpb.Key
  Txn roachpb.Transaction // the “other” transaction, i.e. what transaction the current transaction is contending with. 
  Duration time.Duration // time spent contending on the key against that txn in that one instance
  Outcome enum // pushed?
}
```

There is more detail about how this protobuf will be produced in https://github.com/cockroachdb/cockroach/issues/55583.

## Per-statement
These contention events will be used to enrich existing EXPLAIN ANALYZE plans. Any operator that sends KV requests will intercept these contention events and keep track of the cumulative time spent contending and output this as another execution stat. These contention seconds can be aggregated across a whole query much as other top-level stats are and can provide a percentage of time that the query spends contending with other transactions by calculating the ratio of cumulative time spent contending with respect to the cumulative execution time (i.e. addition of cumulative disk time + cumulative network time + cumulative execution time). This stat will be surfaced through EXPLAIN ANALYZE and the respective statement’s page.

The transaction IDs that the statement contended with will also be stored by the operator although these will not be displayed in EXPLAIN ANALYZE, only surfaced to the statements page with enough information to display a list where each row links to the transaction page that was contending with that statement. TODO(asubiotto): It’s unclear at this moment how to tie a transaction ID back to a transaction page since the transactions in the admin UI seem to be keyed by a hash of statement strings but I haven’t looked into it too much.

## Global View
Global contention statistics show the operator what the state of contention is in a cluster. Since each statement provides a small piece of this puzzle, the contention events will be sent back to the gateway as metadata and processed by the distsql receiver on that gateway, similar to what happens when the range cache is updated when a misplanned range is encountered.

### Contention View
The contention view is primarily a time series metric that gives the operator the data to determine whether the cluster is suffering from contention or not, but no more. The goal of this view is to be a litmus test for whether further investigation is needed by looking at the contention map and specific statement contention data. The proposal is to surface the number of contended queries per second. This should be relatively simple to plumb down as a timeseries metric to the distsql receiver to be incremented if the query receives at least one contention event.

### Contention Map
This map view is still pending design, but we can start work on the backend given the general idea. This “global” view should also be queryable via a virtual table. This kind of API is reminiscent of other features like `SHOW QUERIES`, where each server has a local view of its queries and there is a cluster-level API through the DB console as well as SQL shell to query the global state.

The proposal in this case is to keep an in-memory contention registry keyed by table ID/indexID pair that is updated by the DistSQL receiver when receiving contention events. When a `StatusServer` receives a request for a global contention view, it will broadcast a contention request to each node for its local contention view, and merge it with its own local view.

Diving a bit deeper, what should this contention registry look like? It depends on the questions we want to answer. At a high level, we want to be able to answer which tables are experiencing contention and allow the user to dive deeper into a table to understand what key/row range is experiencing contention and which transactions are responsible for this contention. Therefore, the proposal is a top-level map keyed by a table ID with a value struct that looks something like:
```
type tableContentionInfo struct {
    contentionEvents uint64
    cumulativeContentionTime time.Duration
    orderedKeyContentionInfo []keyContentionInfo
}

type keyContentionInfo struct {
    key roachpb.Key
    contendingTransactions []roachpb.Transaction
}
```

This map cannot grow unboundedly, so an eviction policy needs to be put into place. As a simple starter, we can also keep track of the last contention event for a given table and use an LRU policy with a maximum size for the map. In a given map entry, we could similarly use an LRU policy for keys.

Merging these structs should be relatively straightforward.

## Observations and remaining questions
When talking about a global contention view, we’ve talked about showing contended spans, but contention events only produce keys. I’ll follow up with KV to define what we’re going to do here better, otherwise I’m guessing we can slightly modify the contention event and have the table reader or cfetcher annotate the contention proto with the span that was passed to the fetcher and produced this contention event.
Need to figure out how to use transaction IDs to get richer transaction information.


# Work Items
- [x] Add per-operator sum of contention time to EXPLAIN ANALYZE + testing cluster setting that generates mock contention events (#56612).
- [x] Add the sum of these as a calculation to TraceAnalyzer for further bubbling these up to the DB console (#56797).
- [x] Add a global contention view timeseries metric that the distSQLReceiver will increment if at least one contention event is observed (#56798).
- [x] Implement contention registry (#57114).
- [x] Hook up real contention data instead of the mock events (#59639).
- [x] Expose a global contention view through a virtual table (#57114)

# Out of scope
Getting information for currently hanging requests is out of scope. This will be exposed to users through a debug page by the KV team, so the information will be available. The proposal here is to surface the contention statistics the KV team will provide through execution metadata. High contention is usually a performance problem, not a stuck queries problem so it will be enough for users to be able to visualize the contention information for queries that have completed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: surface KV contention statistics #55243

Design Proposal

Per-statement

Global View

Contention View

Contention Map

Observations and remaining questions

Work Items

Out of scope

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sql: surface KV contention statistics #55243

Description

Design Proposal

Per-statement

Global View

Contention View

Contention Map

Observations and remaining questions

Work Items

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions