kvserver: add cluster settings to log traces for slow lease transfers and splits

**Is your feature request related to a problem? Please describe.**

In an internal escalation (https://github.com/cockroachlabs/support/issues/1533) we attributed high tail latencies to KV-internal requests like lease transfers and splits taking a while. This presented as the following:

```
‹     0.974ms      0.024ms                    event:kv/kvserver/spanlatch/manager.go:517 [n250,s576,r625905/2:/Table/62/16/1498341{40…-63…}] waiting to acquire read latch /Table/62/16/14983414721{8-9}@1651156512.523480877,0, held by write latch /M{in-ax}@0,0›
‹   447.326ms    446.352ms                    event:kv/kvserver/concurrency/concurrency_manager.go:300 [n250,s576,r625905/2:/Table/62/16/1498341{40…-63…}] scanning lock table for conflicting locks›
```

Here we're latched behind the lease transfer request that declares a [non-MVCC latch over the entire keyspace](https://github.com/cockroachdb/cockroach/blob/044cc4493fd8e316015985a0790a74ee00d5167d/pkg/kv/kvserver/batcheval/cmd_lease_transfer.go#L55), i.e. `/M{in-ax}@0,0`. It's difficult however to go from this to an understanding of why the lease transfer request itself was taking that long? Was it waiting for inflight reads to complete? Was it something else?

The same is true for split requests (also declares a non-MVCC write latch over the RHS):

```
‹     1.121ms      0.030ms event:kv/kvserver/spanlatch/manager.go:517 [n220,s459,r625867/1:/Table/60/1{0/2022-…-1/2022-…}] waiting to acquire write latch /Table/60/10/2022-04-28T14:30:03.790531Z/17049428001/0@1651156203.790535407,0, held by write latch /Table/60/1{0/2022-04-28T08:04:21.873826Z/9576597821-1/2022-03-15T03:13:16.248485Z/85629368671}@0,0›
‹  1297.601ms   1296.480ms event:kv/kvserver/concurrency/concurrency_manager.go:300 [n220,s459,r625867/1:/Table/60/10/2022-04-28T0…] scanning lock table for conflicting locks›
‹  1298.162ms      0.561ms event:kv/kvserver/replica_send.go:120 [n220,s459,r625867/1:/Table/60/10/2022-04-28T0…] replica.Send got error: key range /Table/60/10/2022-04-28T14:30:03.790531Z/17049428001/0-/Table/60/10/2022-04-28T14:30:03.790531Z/17049428001/0/NULL outside of bounds of range /Table/60/10/2022-04-28T03:03:48.097657Z/19000424899-/Table/60/10/2022-04-28T08:04:21.873826Z/9576597821; suggested ranges: [desc: r625867:/Table/60/10/2022-04-28T0{3:03:48.097657Z/19000424899-8:04:21.873826Z/9576597821} [(n220,s459):1, (n254,s595):2, (n243,s551):3, (n232,s506):4, (n246,s562):5, next=6, gen=106966, sticky=1651115328.121799069,0], lease: repl=(n220,s459):1 seq=1980 start=1650759449.968454111,0 epo=3 pro=1650759449.967785279,0, closed_timestamp_policy: LAG_BY_CLUSTER_SETTING]›
```

Is the split waiting for long-running inflight reads to finish? Is stats evaluation taking a while?

**Describe the solution you'd like**

We have `sql.trace.{stmt,txn}.enable_threshold` to log instances where individual statements take longer than some threshold we're trying to filter out. We could do something similar here for internal requests (I'm specifically interested in lease transfers and splits). 

**Additional Context**

Arguably `sql.trace.{stmt,txn}.enable_threshold` aren't great given they apply to all statements/txns (we don't have something like https://github.com/cockroachdb/cockroach/issues/70614 to target specific tables/indexes, or a specific stmt fingerprint). Setting thresholds for splits/lease transfers in the same way doesn't seem that bad though given they're down in KV and (should) apply uniformly across ranges and finish in a timely manner (if they don't, say for slow stats computations, that's something we'd want to capture threshold).

+cc @cockroachdb/kv-notifications.

Jira issue: CRDB-15123

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: add cluster settings to log traces for slow lease transfers and splits #81152

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvserver: add cluster settings to log traces for slow lease transfers and splits #81152

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions