release-21.2: multitenant: implement fallback rate and pass next live instance ID#70727
Conversation
This change adjusts the tenant cost controller logic to try to maintain a "buffer" of 5000 RUs. This is useful to prevent waiting for more RUs if an otherwise lightly loaded pod suddenly gets a spike of traffic. Release note: None
This change implements a fallback throttling rate that a SQL pod can use if it stops being able to complete token bucket requests. The goal is keep tenants without burst RUs throttled and tenants with lots of RUs unthrottled (or throttled at a high rate). To achieve this, we calculate a rate at which the tenant would burn through all their available RUs within 1 hour. The premise here is that if we have some kind of infrastructure problem, 1 hour is a reasonable time frame to address it. Beyond 1 hour, the tenant will continue at the same rate, consuming more RUs than they had available. Informs cockroachdb#68479. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.
This commit plumbs the SQL Instance ID and corresponding sqlliveness.SessionID to the tenant cost controller. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.
This commit adds a monotonic request sequence number for the TokenBucket API. This is used to detect duplicate requests and avoid double-charging. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.
This commit implements the server-side logic for cleaning up stale instances from the tenant_usage table, according to the following scheme: - each tenant sends the ID of the next instance in circular order. The live instance set is maintained on the tenant side by a separate subsystem. - the server uses this information as a "hint" that some instances might be stale. When the next ID does not match the expected value, a cleanup for a specific instance ID range is triggered. The cleanup ultimately checks that the last update is stale, so that stale information from the tenant-side doesn't cause incorrect removals. Instances are cleaned up one at a time, with a limit of 10 instance removals per request. This solution avoids queries that scan ranges of the table which may contain a lot of tombstones. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.
This change adds reporting of the next live instance ID to the tenant cost controller. For now we query the sqlinstance.Provider at most once a minute. This is temporary until the Provider is changed to a range-feed-driven cache (#cockroachdb#69976). Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.
|
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
andy-kimball
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained
ajwerner
left a comment
There was a problem hiding this comment.
Reviewed 2 of 6 files at r1, 8 of 15 files at r2, 7 of 11 files at r3, 5 of 12 files at r4, 10 of 10 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @RaduBerinde)
pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 72 at r6 (raw file):
} } if string(instance.Lease) != string(in.InstanceLease) {
nit I didn't catch on the main review, could be !bytes.Equal.
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 72 at r6 (raw file):
Previously, ajwerner wrote…
nit I didn't catch on the main review, could be
!bytes.Equal.
Is the compiler not smart enough to just compare bytes in this case? Anyway, I'll keep that in mind and incorporate it in one of the next changes.
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @RaduBerinde)
pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 72 at r6 (raw file):
Previously, RaduBerinde wrote…
Is the compiler not smart enough to just compare bytes in this case? Anyway, I'll keep that in mind and incorporate it in one of the next changes.
Heh I got sniped and the answer is a good one. Nothing to do here.
// Equal reports whether a and b
// are the same length and contain the same bytes.
// A nil argument is equivalent to an empty slice.
func Equal(a, b []byte) bool {
// Neither cmd/compile nor gccgo allocates for these string conversions.
return string(a) == string(b)
}https://cs.opensource.google/go/go/+/refs/tags/go1.17.1:src/bytes/bytes.go;l=18-21
Backport #70163 and #70520, both needed for Serverless cost control.
/cc @cockroachdb/release
tenantcostclient: maintain a "buffer" of RUs
This change adjusts the tenant cost controller logic to try to
maintain a "buffer" of 5000 RUs. This is useful to prevent waiting for
more RUs if an otherwise lightly loaded pod suddenly gets a spike of
traffic.
Release note: None
multitenant: implement a fallback rate
This change implements a fallback throttling rate that a SQL pod can
use if it stops being able to complete token bucket
requests.
The goal is keep tenants without burst RUs throttled and tenants with
lots of RUs unthrottled (or throttled at a high rate). To achieve
this, we calculate a rate at which the tenant would burn through all
their available RUs within 1 hour. The premise here is that if we have
some kind of infrastructure problem, 1 hour is a reasonable time frame
to address it. Beyond 1 hour, the tenant will continue at the same
rate, consuming more RUs than they had available.
Informs #68479.
Release note: None
Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
tenantcostclient: plumb SQL instance and session ID
This commit plumbs the SQL Instance ID and corresponding
sqlliveness.SessionID to the tenant cost controller.
Release note: None
Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
tenantcost: implement request sequence number
This commit adds a monotonic request sequence number for the
TokenBucket API. This is used to detect duplicate requests and avoid
double-charging.
Release note: None
Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
tenantcostserver: cleanup stale instances
This commit implements the server-side logic for cleaning up stale
instances from the tenant_usage table, according to the following
scheme:
each tenant sends the ID of the next instance in circular order.
The live instance set is maintained on the tenant side by a
separate subsystem.
the server uses this information as a "hint" that some instances
might be stale. When the next ID does not match the expected value,
a cleanup for a specific instance ID range is triggered. The
cleanup ultimately checks that the last update is stale, so that
stale information from the tenant-side doesn't cause incorrect
removals.
Instances are cleaned up one at a time, with a limit of 10 instance
removals per request. This solution avoids queries that scan ranges of
the table which may contain a lot of tombstones.
Release note: None
Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
tenantcostclient: pass next live instance ID
This change adds reporting of the next live instance ID to the tenant
cost controller.
For now we query the sqlinstance.Provider at most once a minute. This
is temporary until the Provider is changed to a range-feed-driven
cache (##69976).
Release note: None
Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.