-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kv: operations on a range can stall during lease changes when under heavy load #32367
Description
Description
When a lease range is under very heavy load and a lease change occurs, throughput on the range can stall for minutes at a time. It seems like requests which pile up are too high to
To Reproduce
The stall was observed while running the kv workload at high concurrency on large machines. The stall can be reliably reproduced with the following steps:
- Set up CockroachDB with 3 database nodes on large nodes (I was using 64 core, doesn't seem to really reproduce on 4 core machines)
- Put load on the cluster using the following workload:
roachprod run ${CLUSTER}:4 -- ./workload run kv '{pgurl:1-3}' --init --read-percent=95 --concurrency=1024 - While the cluster in under load, execute the following SQL command to trigger lease change:
ALTER TABLE kv.kv SCATTER; - Watch throughput tank and the command hang
Expected behavior
The scatter should complete in a timely manner and the throughput should not be adversely affected for a long period of time.
Additional data / screenshots
The hypothesis is that when load is to high, by the time the newly elected leader catches up to recognize that it is the leader, there are too many outstanding requests for it to catch up. Adding backoff in DistSender when NotLeaseHolderError is encountered seems to resolve the problem. The below small diff has been verified as resolving the problem for the above reproduction steps.
diff --git a/pkg/kv/dist_sender.go b/pkg/kv/dist_sender.go
index 1ccf87b11a..8822551bdd 100644
--- a/pkg/kv/dist_sender.go
+++ b/pkg/kv/dist_sender.go
@@ -1324,6 +1324,7 @@ func (ds *DistSender) sendToReplicas(
}
br, err := transport.SendNext(ctx, ba)
+ r := retry.StartWithCtx(ctx, ds.rpcRetryOptions)
// This loop will retry operations that fail with errors that reflect
// per-replica state and may succeed on other replicas.
for {
@@ -1388,6 +1389,7 @@ func (ds *DistSender) sendToReplicas(
// via the NotLeaseHolderError or nil error paths, both of which update the
// leaseholder cache.
case *roachpb.NotLeaseHolderError:
+ r.Next()
ds.metrics.NotLeaseHolderErrCount.Inc(1)
if lh := tErr.LeaseHolder; lh != nil {
// Update the leaseholder cache. Naively this would also happen when the
@tbg, @nvanbenschoten mentioned you were making changes here recently and might want to take a look.
Environment:
- CockroachDB version v2.2.0-alpha.00000000-2316-ga25f97f
- Server OS: Ubuntu 16.04
- Client app: workload
- Server: GCE n1-standard-64 machines
Additional context
The problem was initially thought to relate to #22837 but this seems to be more somewhat more specific.