Skip to content

kvserver: add setting to use expiration-based leases#94457

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
erikgrinaker:expiration-lease-setting
Mar 17, 2023
Merged

kvserver: add setting to use expiration-based leases#94457
craig[bot] merged 2 commits intocockroachdb:masterfrom
erikgrinaker:expiration-lease-setting

Conversation

@erikgrinaker
Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker commented Dec 30, 2022

kvserver: rename Replica.requiresExpiringLease() to Expiration

Expiration leases is the common term for these.

Epic: none
Release note: None

kvserver: add setting to use expiration-based leases

This patch adds the experimental cluster setting kv.expiration_leases_only.enabled, which uses expiration-based leases for all ranges. The setting is marked as experimental because, while we believe the system will function correctly, it has performance implications that need to be mapped out and optimized.

Expiration-based leases are compelling because they are much more robust than epoch-based leases, and better handle failure modes such as partial/asymmetric network partitions, disk stalls, deadlocks, etc. They require a Raft roundtrip to extend a lease, which ensures that the lease must be functional, while epoch leases only require an RPC request to the liveness leaseholder regardless of whether the lease actually works.

Except for the meta and liveness ranges, expiration leases are only extended when a request is processed in the last half of the lease interval (i.e. in the last 3 seconds of the 6 second lease). Otherwise, we allow the lease to expire. This reduces the cost of idle ranges in large clusters, since we avoid the periodic lease extension writes for every range, and can let the ranges quiesce as usual. However, it incurs a synchronous lease acquisition on the next request to the range.

Because expiration leases incur one Raft write per range per lease extension, as well as a lease acquisition for idle ranges, they currently come with significant performance overhead. In TPC-E experiments at 100.000 customers with various transaction types, p50 latencies increased 5-70%, p99 latencies increased 20-80%, and pMax latencies increased 0-1000%. A kv95 workload on 10.000 ranges with active leases showed a throughput reduction of about 10%, most likely due to the ~3.000 lease extension writes per second.

When the setting is changed, leases are asynchronously switched to the appropriate type (either expiration-based or epoch-based) via the replicate queue. This can take up to the ScanInterval to complete, 10 minutes by default (more on nodes with large numbers or ranges).

Resolves #93903.

Epic: none
Release note: None

@erikgrinaker erikgrinaker self-assigned this Dec 30, 2022
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really neat!

I'm not sure about the cost of the TODO comment. Does this get called on every request to a range?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @nvanbenschoten)


pkg/kv/kvserver/replica_range_lease.go line 1433 at r1 (raw file):

		r.requiresExpiringLeaseRLocked()

	if isExpirationLease != shouldUseExpirationLease {

Is this logic correct - if the current lease is an expiration and the setting is unset (using epoch) - this will return true. But I think that is incorrect. I might just be misreading this.

Copy link
Copy Markdown
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the cost of the TODO comment. Does this get called on every request to a range?

Yes, it's called on every request. I was running some rough benchmarks with an earlier prototype that called this during request processing and Raft ticks, and it showed up prominently on CPU profiles (about 5%). I'm going to run more benchmarks before merging the final version, so this is mostly a reminder to myself to see if it still shows up in profiles.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvserver/replica_range_lease.go line 1433 at r1 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

Is this logic correct - if the current lease is an expiration and the setting is unset (using epoch) - this will return true. But I think that is incorrect. I might just be misreading this.

That's correct, this will cause the lease to be asynchronously renewed as an epoch lease when the setting is turned off. Otherwise, the lease will remain an expiration lease until the leaseholder fails to extend it, which may takes a very long time. The intent is for the setting change to take effect across all ranges within a reasonable time (a minute or so).

Copy link
Copy Markdown
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvserver/replica_range_lease.go line 1433 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

That's correct, this will cause the lease to be asynchronously renewed as an epoch lease when the setting is turned off. Otherwise, the lease will remain an expiration lease until the leaseholder fails to extend it, which may takes a very long time. The intent is for the setting change to take effect across all ranges within a reasonable time (a minute or so).

The other option (and maybe this is what you meant) is to allow the expiration lease to expire when the setting is turned off, and reacquire it as an epoch lease afterwards. However, this will affect request latencies, so we instead renew it asynchronously as soon as possible while the previous lease is still valid.

Copy link
Copy Markdown

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/replica_range_lease.go line 1433 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

The other option (and maybe this is what you meant) is to allow the expiration lease to expire when the setting is turned off, and reacquire it as an epoch lease afterwards. However, this will affect request latencies, so we instead renew it asynchronously as soon as possible while the previous lease is still valid.

OK - I understand now, I originally thought that if this returned true it would renew as an expiration lease, but the caller of this checks what to do and handles this correctly. Thanks for the clarification.

@erikgrinaker erikgrinaker force-pushed the expiration-lease-setting branch from cf32e6a to ea64a02 Compare December 31, 2022 11:24
@erikgrinaker
Copy link
Copy Markdown
Contributor Author

Seems like replicas aren't properly quiescing here. On an idle 3-node 8-core cluster with 100.000 ranges, we're only quiescing 1/3 of replicas, even though there are only about 500 valid expiration leases. This in turn causes the nodes to run at ~80% CPU when idle, just to tick and heartbeat the replicas.

I initially expected this to be because we can't quiesce ranges with invalid leases, and we didn't get around to quiescing them before the leases expired, but logspy showed that the majority of replicas failed to quiesce because there was outstanding replication quota. There were occasional lease requests, but otherwise no other proposals on these ranges, so it's still unclear why there's outstanding quota, but I'll look into it.

This is all to say that further work is required here regardless of which approach we choose: if we allow idle ranges to quiesce then we need to make sure they properly do so, and if we instead eagerly renew the leases then we need to optimize ticking/stepping of unquiesced ranges.

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

erikgrinaker commented Dec 31, 2022

I think the outstanding proposal quota thing is likely a red herring, caused by this condition which will claim pending proposal quota on followers (where proposalQuota is nil):

if r.mu.proposalQuota == nil {
return true
}
return !r.mu.proposalQuota.Full()

It seems like this condition should return false instead. We have another condition which will check that the replica is a leader:

if status.SoftState.RaftState != raft.StateLeader {
if log.V(4) {
log.Infof(ctx, "not quiescing: not leader")
}
return nil, nil, false
}

However, given that we can't quiesce on followers, this shouldn't really affect quiescence either way.

Submitted #94558 to address this.

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

The problem was indeed that we didn't allow quiescing ranges with invalid leases. I added in a condition to allow quiescing with expired expiration-based leases, and we now properly quiesce idle ranges such that the CPU usage on an idle cluster with 100.000 ranges is about the same with the setting enabled and disabled (8% vs 6%).

@erikgrinaker erikgrinaker force-pushed the expiration-lease-setting branch from 00c67c5 to 91b624e Compare January 1, 2023 15:16
@erikgrinaker
Copy link
Copy Markdown
Contributor Author

I ran some quick 3-node 8-CPU kv95 benchmarks with 192 concurrency to look at large-scale changes in behavior/performance. At 10.000 ranges, enabling the setting had a significant impact on performance (~20% throughput reduction). This appeared to mostly be due to the additional writes needed for lease extensions, with a doubling of IOPS (2k to 4k per node) and write bandwidth (8 MB/s to 16MB/s). At 1.000 ranges, however, the workload impact was negligible due to the lower lease extension load. So the additional cost appears to mostly be due to the lease extension writes when we allow idle ranges to quiesce. With a read-only workload, I could run 10.000 ranges without a significant impact to the workload, likely because we had more IO headroom.

We'll need to run some more rigorous benchmarks, but I think this shows that the approach here (without eager lease extension and allowing quiescence) is viable. This is compelling because it avoids having to optimize idle unquiesced ranges and come up with a more sophisticated scheme that can eagerly renew thousands of leases per second. Interested to get your thoughts on this before taking this further.

craig bot pushed a commit that referenced this pull request Jan 3, 2023
94558: kvserver: don't consider followers to have pending proposal quota r=erikgrinaker a=erikgrinaker

`Replica.hasPendingProposalQuotaRLocked()` considered a replica with `proposalQuota == nil` to have pending proposal quota, which in turn prevented quiescence. This condition is only possible on followers, which won't be able to quiesce anyway since we explicitly check for leadership. This can be misleading in vmodule logs, which claim "not quiescing: replication quota outstanding" on all followers.

This patch instead considers followers to have no pending proposal quota, thus vmodule logs will emit the more accurate "not quiescing: not leader".

This may lead to higher CPU usage on followers, because we now fall through to a `raftStatusRLocked` check which is more expensive. This should be optimized separately.

Touches #94457.

Epic: none
Release note: None

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@erikgrinaker
Copy link
Copy Markdown
Contributor Author

erikgrinaker commented Jan 5, 2023

@tbg points out that this will result in high latencies for table scans across idle ranges, e.g. count(*), because we're acquiring the leases sequentially. A quick test with 50k empty ranges took 4 seconds with active epoch leases, and 43 seconds with expired extension leases.

@erikgrinaker erikgrinaker force-pushed the expiration-lease-setting branch from 91b624e to f12cee0 Compare January 5, 2023 12:13
@erikgrinaker erikgrinaker force-pushed the expiration-lease-setting branch from 31eb1c8 to e55a02d Compare January 15, 2023 16:01
@erikgrinaker erikgrinaker force-pushed the expiration-lease-setting branch 3 times, most recently from 979b58e to dbd5e03 Compare February 1, 2023 11:57
@erikgrinaker erikgrinaker force-pushed the expiration-lease-setting branch 2 times, most recently from a4ddd4b to d28d4fa Compare February 7, 2023 09:57
@erikgrinaker erikgrinaker force-pushed the expiration-lease-setting branch 4 times, most recently from cedffa7 to 5aabd50 Compare February 18, 2023 13:06
@erikgrinaker erikgrinaker marked this pull request as ready for review February 18, 2023 13:06
Copy link
Copy Markdown

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment related to the use of the requiresExpirationLeaseRLocked vs shouldUseExpirationLeaseRLocked in two places.

Reviewed 7 of 10 files at r14.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @nvanbenschoten)


pkg/kv/kvserver/replica_range_lease.go line 779 at r15 (raw file):

// epoch-based leases, but may temporarily use expiration based leases during
// lease transfers.
func (r *Replica) requiresExpirationLeaseRLocked() bool {

This is a little hard to follow the cases where code should use requiresExpirationLeaseRLocked vs shouldUseExpirationLeaseRLocked. The only "real" difference seems to be in leasePostApplyLocked. For simpler usage, it might be better to have this method be renamed to requireActiveLeaseRenewal and change the next method to useExpirationLeaseRLocked. The two other places where this method are used seem incorrect (maybeCampaignOnWakeLocked and shouldCampaignOnRedirect). It seems like we want expiration leases to campaign in both of these cases.

Expiration leases is the common term for these.

Epic: none
Release note: None
Copy link
Copy Markdown
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvserver/replica_range_lease.go line 779 at r15 (raw file):

Previously, andrewbaptist (Andrew Baptist) wrote…

This is a little hard to follow the cases where code should use requiresExpirationLeaseRLocked vs shouldUseExpirationLeaseRLocked. The only "real" difference seems to be in leasePostApplyLocked. For simpler usage, it might be better to have this method be renamed to requireActiveLeaseRenewal and change the next method to useExpirationLeaseRLocked. The two other places where this method are used seem incorrect (maybeCampaignOnWakeLocked and shouldCampaignOnRedirect). It seems like we want expiration leases to campaign in both of these cases.

The campaign cases also need requiresExpirationLeaseRLocked, it's used to avoid a circular dependency on liveness here:

// Avoid a circular dependency on liveness and skip the is leader alive
// check for ranges that always use expiration based leases.
if requiresExpiringLease {
return false
}

// Avoid a circular dependency on liveness and skip the is leader alive check
// for ranges that always use expiration based leases. These ranges don't need
// to campaign based on liveness state because there can never be a case where
// a node can retain Raft leadership but still be unable to acquire the lease.
// This is possible on ranges that use epoch-based leases because the Raft
// leader may be partitioned from the liveness range.
// See TestRequestsOnFollowerWithNonLiveLeaseholder for an example of a test
// that demonstrates this case.
if requiresExpiringLease {
return false
}

requiresExpirationLeaseRLocked specifies that a range can't, under any circumstances, use an epoch lease because it causes a circular dependency on liveness and stuff breaks. It requires an expiration lease. I feel like that's clearer than requireActiveLeaseRenewal -- in particular because all leases require active renewal, but epoch leases do so via node heartbeats.

Copy link
Copy Markdown
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvserver/replica_range_lease.go line 779 at r15 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

The campaign cases also need requiresExpirationLeaseRLocked, it's used to avoid a circular dependency on liveness here:

// Avoid a circular dependency on liveness and skip the is leader alive
// check for ranges that always use expiration based leases.
if requiresExpirationLease {
return false
}

// Avoid a circular dependency on liveness and skip the is leader alive check
// for ranges that always use expiration based leases. These ranges don't need
// to campaign based on liveness state because there can never be a case where
// a node can retain Raft leadership but still be unable to acquire the lease.
// This is possible on ranges that use epoch-based leases because the Raft
// leader may be partitioned from the liveness range.
// See TestRequestsOnFollowerWithNonLiveLeaseholder for an example of a test
// that demonstrates this case.
if requiresExpiringLease {
return false
}

requiresExpirationLeaseRLocked specifies that a range can't, under any circumstances, use an epoch lease because it causes a circular dependency on liveness and stuff breaks. It requires an expiration lease. I feel like that's clearer than requireActiveLeaseRenewal -- in particular because all leases require active renewal, but epoch leases do so via node heartbeats.

Btw, the leasePostApplyLocked case for requiresExpirationLeaseRLocked actually goes away once we implement the lease scheduler. See e.g. 0593c45#diff-488a090afc4b6eaf56cd6d13b347bac67cb3313ce11c49df9ee8cd95fd73b3e8L385.

Copy link
Copy Markdown
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvserver/replica_range_lease.go line 779 at r15 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Btw, the leasePostApplyLocked case for requiresExpirationLeaseRLocked actually goes away once we implement the lease scheduler. See e.g. 0593c45#diff-488a090afc4b6eaf56cd6d13b347bac67cb3313ce11c49df9ee8cd95fd73b3e8L385.

Nevermind, I reread the campaign logic and you're right -- we should always bail out if the range should be using an expiration lease, since this logic is only relevant for epoch leases. I'll update the PR. Thanks for catching this.

I think I have a mild preference for just using the existing naming for now, and we'll remove requiresExpirationLeaseRLocked entirely when we implement the lease scheduler.

@andrewbaptist
Copy link
Copy Markdown

Agreed - the names are fine for now, maybe just add a TODO comment to rename/clean-up later. Thanks!

@erikgrinaker erikgrinaker force-pushed the expiration-lease-setting branch from f8db1e9 to 386e0f9 Compare March 16, 2023 16:39
This patch adds the experimental cluster setting
`kv.expiration_leases_only.enabled`, which uses expiration-based leases
for all ranges. The setting is marked as experimental because, while we
believe the system will function correctly, it has performance
implications that need to be mapped out and optimized.

Expiration-based leases are compelling because they are much more robust
than epoch-based leases, and better handle failure modes such as
partial/asymmetric network partitions, disk stalls, deadlocks, etc. They
require a Raft roundtrip to extend a lease, which ensures that the lease
must be functional, while epoch leases only require an RPC request to
the liveness leaseholder regardless of whether the lease actually works.

Except for the meta and liveness ranges, expiration leases are only
extended when a request is processed in the last half of the lease
interval (i.e. in the last 3 seconds of the 6 second lease). Otherwise,
we allow the lease to expire. This reduces the cost of idle ranges in
large clusters, since we avoid the periodic lease extension writes for
every range, and can let the ranges quiesce as usual. However, it incurs
a synchronous lease acquisition on the next request to the range.

Because expiration leases incur one Raft write per range per lease
extension, as well as a lease acquisition for idle ranges, they
currently come with significant performance overhead. In TPC-E
experiments at 100.000 customers with various transaction types, p50
latencies increased 5-70%, p99 latencies increased 20-80%, and pMax
latencies increased 0-1000%. A kv95 workload on 10.000 ranges with
active leases showed a throughput reduction of about 10%, most likely
due to the ~3.000 lease extension writes per second.

When the setting is changed, leases are asynchronously switched to the
appropriate type (either expiration-based or epoch-based) via the
replicate queue. This can take up to the `ScanInterval` to complete, 10
minutes by default (more on nodes with large numbers or ranges).

Epic: none
Release note: None
@erikgrinaker erikgrinaker force-pushed the expiration-lease-setting branch from 386e0f9 to 04f4284 Compare March 16, 2023 16:45
Copy link
Copy Markdown
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvserver/replica_range_lease.go line 779 at r15 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Nevermind, I reread the campaign logic and you're right -- we should always bail out if the range should be using an expiration lease, since this logic is only relevant for epoch leases. I'll update the PR. Thanks for catching this.

I think I have a mild preference for just using the existing naming for now, and we'll remove requiresExpirationLeaseRLocked entirely when we implement the lease scheduler.

I realized that we do want to check requiresExpirationLease in shouldCampaignOnWake, because it allows us to more quickly detect a dead leader and call an election when unquiescing a range that uses an expiration-based lease. But I changed shouldCampaignOnLeaseRequestRedirect to use shouldUseExpirationLease, because the logic only applies to epoch leases. I've updated the code and comments, PTAL.

Copy link
Copy Markdown

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@andrewbaptist
Copy link
Copy Markdown

Thanks for the quick turnaround on the changes! I'm glad we are getting this in.

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 17, 2023

Build failed:

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

bors retry

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 17, 2023

Build failed:

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

bors retry

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 17, 2023

Build succeeded:

@craig craig bot merged commit 0a4b383 into cockroachdb:master Mar 17, 2023
@erikgrinaker erikgrinaker deleted the expiration-lease-setting branch March 21, 2023 08:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: explore setting to only use expiration-based leases

3 participants