kvserver: add setting to use expiration-based leases#94457
kvserver: add setting to use expiration-based leases#94457craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
andrewbaptist
left a comment
There was a problem hiding this comment.
This is really neat!
I'm not sure about the cost of the TODO comment. Does this get called on every request to a range?
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)
pkg/kv/kvserver/replica_range_lease.go line 1433 at r1 (raw file):
r.requiresExpiringLeaseRLocked() if isExpirationLease != shouldUseExpirationLease {
Is this logic correct - if the current lease is an expiration and the setting is unset (using epoch) - this will return true. But I think that is incorrect. I might just be misreading this.
erikgrinaker
left a comment
There was a problem hiding this comment.
I'm not sure about the cost of the TODO comment. Does this get called on every request to a range?
Yes, it's called on every request. I was running some rough benchmarks with an earlier prototype that called this during request processing and Raft ticks, and it showed up prominently on CPU profiles (about 5%). I'm going to run more benchmarks before merging the final version, so this is mostly a reminder to myself to see if it still shows up in profiles.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)
pkg/kv/kvserver/replica_range_lease.go line 1433 at r1 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
Is this logic correct - if the current lease is an expiration and the setting is unset (using epoch) - this will return true. But I think that is incorrect. I might just be misreading this.
That's correct, this will cause the lease to be asynchronously renewed as an epoch lease when the setting is turned off. Otherwise, the lease will remain an expiration lease until the leaseholder fails to extend it, which may takes a very long time. The intent is for the setting change to take effect across all ranges within a reasonable time (a minute or so).
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)
pkg/kv/kvserver/replica_range_lease.go line 1433 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
That's correct, this will cause the lease to be asynchronously renewed as an epoch lease when the setting is turned off. Otherwise, the lease will remain an expiration lease until the leaseholder fails to extend it, which may takes a very long time. The intent is for the setting change to take effect across all ranges within a reasonable time (a minute or so).
The other option (and maybe this is what you meant) is to allow the expiration lease to expire when the setting is turned off, and reacquire it as an epoch lease afterwards. However, this will affect request latencies, so we instead renew it asynchronously as soon as possible while the previous lease is still valid.
andrewbaptist
left a comment
There was a problem hiding this comment.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/replica_range_lease.go line 1433 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
The other option (and maybe this is what you meant) is to allow the expiration lease to expire when the setting is turned off, and reacquire it as an epoch lease afterwards. However, this will affect request latencies, so we instead renew it asynchronously as soon as possible while the previous lease is still valid.
OK - I understand now, I originally thought that if this returned true it would renew as an expiration lease, but the caller of this checks what to do and handles this correctly. Thanks for the clarification.
cf32e6a to
ea64a02
Compare
|
Seems like replicas aren't properly quiescing here. On an idle 3-node 8-core cluster with 100.000 ranges, we're only quiescing 1/3 of replicas, even though there are only about 500 valid expiration leases. This in turn causes the nodes to run at ~80% CPU when idle, just to tick and heartbeat the replicas. I initially expected this to be because we can't quiesce ranges with invalid leases, and we didn't get around to quiescing them before the leases expired, but logspy showed that the majority of replicas failed to quiesce because there was outstanding replication quota. There were occasional lease requests, but otherwise no other proposals on these ranges, so it's still unclear why there's outstanding quota, but I'll look into it. This is all to say that further work is required here regardless of which approach we choose: if we allow idle ranges to quiesce then we need to make sure they properly do so, and if we instead eagerly renew the leases then we need to optimize ticking/stepping of unquiesced ranges. |
|
I think the outstanding proposal quota thing is likely a red herring, caused by this condition which will claim pending proposal quota on followers (where cockroach/pkg/kv/kvserver/replica_raft.go Lines 522 to 525 in 14ce324 It seems like this condition should return cockroach/pkg/kv/kvserver/replica_raft_quiesce.go Lines 321 to 326 in 14ce324 However, given that we can't quiesce on followers, this shouldn't really affect quiescence either way. Submitted #94558 to address this. |
ea64a02 to
00c67c5
Compare
|
The problem was indeed that we didn't allow quiescing ranges with invalid leases. I added in a condition to allow quiescing with expired expiration-based leases, and we now properly quiesce idle ranges such that the CPU usage on an idle cluster with 100.000 ranges is about the same with the setting enabled and disabled (8% vs 6%). |
00c67c5 to
91b624e
Compare
|
I ran some quick 3-node 8-CPU kv95 benchmarks with 192 concurrency to look at large-scale changes in behavior/performance. At 10.000 ranges, enabling the setting had a significant impact on performance (~20% throughput reduction). This appeared to mostly be due to the additional writes needed for lease extensions, with a doubling of IOPS (2k to 4k per node) and write bandwidth (8 MB/s to 16MB/s). At 1.000 ranges, however, the workload impact was negligible due to the lower lease extension load. So the additional cost appears to mostly be due to the lease extension writes when we allow idle ranges to quiesce. With a read-only workload, I could run 10.000 ranges without a significant impact to the workload, likely because we had more IO headroom. We'll need to run some more rigorous benchmarks, but I think this shows that the approach here (without eager lease extension and allowing quiescence) is viable. This is compelling because it avoids having to optimize idle unquiesced ranges and come up with a more sophisticated scheme that can eagerly renew thousands of leases per second. Interested to get your thoughts on this before taking this further. |
94558: kvserver: don't consider followers to have pending proposal quota r=erikgrinaker a=erikgrinaker `Replica.hasPendingProposalQuotaRLocked()` considered a replica with `proposalQuota == nil` to have pending proposal quota, which in turn prevented quiescence. This condition is only possible on followers, which won't be able to quiesce anyway since we explicitly check for leadership. This can be misleading in vmodule logs, which claim "not quiescing: replication quota outstanding" on all followers. This patch instead considers followers to have no pending proposal quota, thus vmodule logs will emit the more accurate "not quiescing: not leader". This may lead to higher CPU usage on followers, because we now fall through to a `raftStatusRLocked` check which is more expensive. This should be optimized separately. Touches #94457. Epic: none Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
|
@tbg points out that this will result in high latencies for table scans across idle ranges, e.g. |
91b624e to
f12cee0
Compare
31eb1c8 to
e55a02d
Compare
979b58e to
dbd5e03
Compare
a4ddd4b to
d28d4fa
Compare
cedffa7 to
5aabd50
Compare
7bef504 to
a0cabaa
Compare
a0cabaa to
f8db1e9
Compare
andrewbaptist
left a comment
There was a problem hiding this comment.
Just one comment related to the use of the requiresExpirationLeaseRLocked vs shouldUseExpirationLeaseRLocked in two places.
Reviewed 7 of 10 files at r14.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @nvanbenschoten)
pkg/kv/kvserver/replica_range_lease.go line 779 at r15 (raw file):
// epoch-based leases, but may temporarily use expiration based leases during // lease transfers. func (r *Replica) requiresExpirationLeaseRLocked() bool {
This is a little hard to follow the cases where code should use requiresExpirationLeaseRLocked vs shouldUseExpirationLeaseRLocked. The only "real" difference seems to be in leasePostApplyLocked. For simpler usage, it might be better to have this method be renamed to requireActiveLeaseRenewal and change the next method to useExpirationLeaseRLocked. The two other places where this method are used seem incorrect (maybeCampaignOnWakeLocked and shouldCampaignOnRedirect). It seems like we want expiration leases to campaign in both of these cases.
Expiration leases is the common term for these. Epic: none Release note: None
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)
pkg/kv/kvserver/replica_range_lease.go line 779 at r15 (raw file):
Previously, andrewbaptist (Andrew Baptist) wrote…
This is a little hard to follow the cases where code should use
requiresExpirationLeaseRLockedvsshouldUseExpirationLeaseRLocked. The only "real" difference seems to be inleasePostApplyLocked. For simpler usage, it might be better to have this method be renamed torequireActiveLeaseRenewaland change the next method touseExpirationLeaseRLocked. The two other places where this method are used seem incorrect (maybeCampaignOnWakeLockedandshouldCampaignOnRedirect). It seems like we want expiration leases to campaign in both of these cases.
The campaign cases also need requiresExpirationLeaseRLocked, it's used to avoid a circular dependency on liveness here:
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 2000 to 2004 in 81a9f1d
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 2079 to 2089 in 81a9f1d
requiresExpirationLeaseRLocked specifies that a range can't, under any circumstances, use an epoch lease because it causes a circular dependency on liveness and stuff breaks. It requires an expiration lease. I feel like that's clearer than requireActiveLeaseRenewal -- in particular because all leases require active renewal, but epoch leases do so via node heartbeats.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)
pkg/kv/kvserver/replica_range_lease.go line 779 at r15 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
The campaign cases also need
requiresExpirationLeaseRLocked, it's used to avoid a circular dependency on liveness here:cockroach/pkg/kv/kvserver/replica_raft.go
Lines 2000 to 2004 in a64df6f
cockroach/pkg/kv/kvserver/replica_raft.go
Lines 2079 to 2089 in 81a9f1d
requiresExpirationLeaseRLockedspecifies that a range can't, under any circumstances, use an epoch lease because it causes a circular dependency on liveness and stuff breaks. It requires an expiration lease. I feel like that's clearer thanrequireActiveLeaseRenewal-- in particular because all leases require active renewal, but epoch leases do so via node heartbeats.
Btw, the leasePostApplyLocked case for requiresExpirationLeaseRLocked actually goes away once we implement the lease scheduler. See e.g. 0593c45#diff-488a090afc4b6eaf56cd6d13b347bac67cb3313ce11c49df9ee8cd95fd73b3e8L385.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)
pkg/kv/kvserver/replica_range_lease.go line 779 at r15 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Btw, the
leasePostApplyLockedcase forrequiresExpirationLeaseRLockedactually goes away once we implement the lease scheduler. See e.g. 0593c45#diff-488a090afc4b6eaf56cd6d13b347bac67cb3313ce11c49df9ee8cd95fd73b3e8L385.
Nevermind, I reread the campaign logic and you're right -- we should always bail out if the range should be using an expiration lease, since this logic is only relevant for epoch leases. I'll update the PR. Thanks for catching this.
I think I have a mild preference for just using the existing naming for now, and we'll remove requiresExpirationLeaseRLocked entirely when we implement the lease scheduler.
|
Agreed - the names are fine for now, maybe just add a TODO comment to rename/clean-up later. Thanks! |
f8db1e9 to
386e0f9
Compare
This patch adds the experimental cluster setting `kv.expiration_leases_only.enabled`, which uses expiration-based leases for all ranges. The setting is marked as experimental because, while we believe the system will function correctly, it has performance implications that need to be mapped out and optimized. Expiration-based leases are compelling because they are much more robust than epoch-based leases, and better handle failure modes such as partial/asymmetric network partitions, disk stalls, deadlocks, etc. They require a Raft roundtrip to extend a lease, which ensures that the lease must be functional, while epoch leases only require an RPC request to the liveness leaseholder regardless of whether the lease actually works. Except for the meta and liveness ranges, expiration leases are only extended when a request is processed in the last half of the lease interval (i.e. in the last 3 seconds of the 6 second lease). Otherwise, we allow the lease to expire. This reduces the cost of idle ranges in large clusters, since we avoid the periodic lease extension writes for every range, and can let the ranges quiesce as usual. However, it incurs a synchronous lease acquisition on the next request to the range. Because expiration leases incur one Raft write per range per lease extension, as well as a lease acquisition for idle ranges, they currently come with significant performance overhead. In TPC-E experiments at 100.000 customers with various transaction types, p50 latencies increased 5-70%, p99 latencies increased 20-80%, and pMax latencies increased 0-1000%. A kv95 workload on 10.000 ranges with active leases showed a throughput reduction of about 10%, most likely due to the ~3.000 lease extension writes per second. When the setting is changed, leases are asynchronously switched to the appropriate type (either expiration-based or epoch-based) via the replicate queue. This can take up to the `ScanInterval` to complete, 10 minutes by default (more on nodes with large numbers or ranges). Epic: none Release note: None
386e0f9 to
04f4284
Compare
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)
pkg/kv/kvserver/replica_range_lease.go line 779 at r15 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Nevermind, I reread the campaign logic and you're right -- we should always bail out if the range should be using an expiration lease, since this logic is only relevant for epoch leases. I'll update the PR. Thanks for catching this.
I think I have a mild preference for just using the existing naming for now, and we'll remove
requiresExpirationLeaseRLockedentirely when we implement the lease scheduler.
I realized that we do want to check requiresExpirationLease in shouldCampaignOnWake, because it allows us to more quickly detect a dead leader and call an election when unquiescing a range that uses an expiration-based lease. But I changed shouldCampaignOnLeaseRequestRedirect to use shouldUseExpirationLease, because the logic only applies to epoch leases. I've updated the code and comments, PTAL.
andrewbaptist
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)
|
Thanks for the quick turnaround on the changes! I'm glad we are getting this in. |
|
TFTR! bors r+ |
|
Build failed: |
|
bors retry |
|
Build failed: |
|
bors retry |
|
Build succeeded: |
kvserver: rename
Replica.requiresExpiringLease()toExpirationExpiration leases is the common term for these.
Epic: none
Release note: None
kvserver: add setting to use expiration-based leases
This patch adds the experimental cluster setting
kv.expiration_leases_only.enabled, which uses expiration-based leases for all ranges. The setting is marked as experimental because, while we believe the system will function correctly, it has performance implications that need to be mapped out and optimized.Expiration-based leases are compelling because they are much more robust than epoch-based leases, and better handle failure modes such as partial/asymmetric network partitions, disk stalls, deadlocks, etc. They require a Raft roundtrip to extend a lease, which ensures that the lease must be functional, while epoch leases only require an RPC request to the liveness leaseholder regardless of whether the lease actually works.
Except for the meta and liveness ranges, expiration leases are only extended when a request is processed in the last half of the lease interval (i.e. in the last 3 seconds of the 6 second lease). Otherwise, we allow the lease to expire. This reduces the cost of idle ranges in large clusters, since we avoid the periodic lease extension writes for every range, and can let the ranges quiesce as usual. However, it incurs a synchronous lease acquisition on the next request to the range.
Because expiration leases incur one Raft write per range per lease extension, as well as a lease acquisition for idle ranges, they currently come with significant performance overhead. In TPC-E experiments at 100.000 customers with various transaction types, p50 latencies increased 5-70%, p99 latencies increased 20-80%, and pMax latencies increased 0-1000%. A kv95 workload on 10.000 ranges with active leases showed a throughput reduction of about 10%, most likely due to the ~3.000 lease extension writes per second.
When the setting is changed, leases are asynchronously switched to the appropriate type (either expiration-based or epoch-based) via the replicate queue. This can take up to the
ScanIntervalto complete, 10 minutes by default (more on nodes with large numbers or ranges).Resolves #93903.
Epic: none
Release note: None