release-21.2: multitenant: implement fallback rate and pass next live instance ID by RaduBerinde · Pull Request #70727 · cockroachdb/cockroach

RaduBerinde · 2021-09-24T21:29:37Z

Backport #70163 and #70520, both needed for Serverless cost control.

/cc @cockroachdb/release

tenantcostclient: maintain a "buffer" of RUs

This change adjusts the tenant cost controller logic to try to
maintain a "buffer" of 5000 RUs. This is useful to prevent waiting for
more RUs if an otherwise lightly loaded pod suddenly gets a spike of
traffic.

Release note: None

multitenant: implement a fallback rate

This change implements a fallback throttling rate that a SQL pod can
use if it stops being able to complete token bucket
requests.

The goal is keep tenants without burst RUs throttled and tenants with
lots of RUs unthrottled (or throttled at a high rate). To achieve
this, we calculate a rate at which the tenant would burn through all
their available RUs within 1 hour. The premise here is that if we have
some kind of infrastructure problem, 1 hour is a reasonable time frame
to address it. Beyond 1 hour, the tenant will continue at the same
rate, consuming more RUs than they had available.

Informs #68479.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

tenantcostclient: plumb SQL instance and session ID

This commit plumbs the SQL Instance ID and corresponding
sqlliveness.SessionID to the tenant cost controller.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

tenantcost: implement request sequence number

This commit adds a monotonic request sequence number for the
TokenBucket API. This is used to detect duplicate requests and avoid
double-charging.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

tenantcostserver: cleanup stale instances

This commit implements the server-side logic for cleaning up stale
instances from the tenant_usage table, according to the following
scheme:

each tenant sends the ID of the next instance in circular order.
The live instance set is maintained on the tenant side by a
separate subsystem.
the server uses this information as a "hint" that some instances
might be stale. When the next ID does not match the expected value,
a cleanup for a specific instance ID range is triggered. The
cleanup ultimately checks that the last update is stale, so that
stale information from the tenant-side doesn't cause incorrect
removals.

Instances are cleaned up one at a time, with a limit of 10 instance
removals per request. This solution avoids queries that scan ranges of
the table which may contain a lot of tombstones.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

tenantcostclient: pass next live instance ID

This change adds reporting of the next live instance ID to the tenant
cost controller.

For now we query the sqlinstance.Provider at most once a minute. This
is temporary until the Provider is changed to a range-feed-driven
cache (##69976).

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

This change adjusts the tenant cost controller logic to try to maintain a "buffer" of 5000 RUs. This is useful to prevent waiting for more RUs if an otherwise lightly loaded pod suddenly gets a spike of traffic. Release note: None

This change implements a fallback throttling rate that a SQL pod can use if it stops being able to complete token bucket requests. The goal is keep tenants without burst RUs throttled and tenants with lots of RUs unthrottled (or throttled at a high rate). To achieve this, we calculate a rate at which the tenant would burn through all their available RUs within 1 hour. The premise here is that if we have some kind of infrastructure problem, 1 hour is a reasonable time frame to address it. Beyond 1 hour, the tenant will continue at the same rate, consuming more RUs than they had available. Informs cockroachdb#68479. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.

This commit plumbs the SQL Instance ID and corresponding sqlliveness.SessionID to the tenant cost controller. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.

This commit adds a monotonic request sequence number for the TokenBucket API. This is used to detect duplicate requests and avoid double-charging. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.

This commit implements the server-side logic for cleaning up stale instances from the tenant_usage table, according to the following scheme: - each tenant sends the ID of the next instance in circular order. The live instance set is maintained on the tenant side by a separate subsystem. - the server uses this information as a "hint" that some instances might be stale. When the next ID does not match the expected value, a cleanup for a specific instance ID range is triggered. The cleanup ultimately checks that the last update is stale, so that stale information from the tenant-side doesn't cause incorrect removals. Instances are cleaned up one at a time, with a limit of 10 instance removals per request. This solution avoids queries that scan ranges of the table which may contain a lot of tombstones. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.

This change adds reporting of the next live instance ID to the tenant cost controller. For now we query the sqlinstance.Provider at most once a minute. This is temporary until the Provider is changed to a range-feed-driven cache (#cockroachdb#69976). Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.

blathers-crl · 2021-09-24T21:29:41Z

cockroach-teamcity · 2021-09-24T21:29:49Z

This change is

andy-kimball

Reviewable status: complete! 1 of 0 LGTMs obtained

ajwerner

Reviewed 2 of 6 files at r1, 8 of 15 files at r2, 7 of 11 files at r3, 5 of 12 files at r4, 10 of 10 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @RaduBerinde)

pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 72 at r6 (raw file):

			}
		}
		if string(instance.Lease) != string(in.InstanceLease) {

nit I didn't catch on the main review, could be !bytes.Equal.

RaduBerinde

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 72 at r6 (raw file):

Previously, ajwerner wrote…

nit I didn't catch on the main review, could be !bytes.Equal.

Is the compiler not smart enough to just compare bytes in this case? Anyway, I'll keep that in mind and incorporate it in one of the next changes.

ajwerner

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @RaduBerinde)

pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 72 at r6 (raw file):

Previously, RaduBerinde wrote…

Is the compiler not smart enough to just compare bytes in this case? Anyway, I'll keep that in mind and incorporate it in one of the next changes.

Heh I got sniped and the answer is a good one. Nothing to do here.

// Equal reports whether a and b
// are the same length and contain the same bytes.
// A nil argument is equivalent to an empty slice.
func Equal(a, b []byte) bool {
	// Neither cmd/compile nor gccgo allocates for these string conversions.
	return string(a) == string(b)
}

https://cs.opensource.google/go/go/+/refs/tags/go1.17.1:src/bytes/bytes.go;l=18-21

RaduBerinde added 6 commits September 24, 2021 14:21

tenantcostclient: maintain a "buffer" of RUs

091e3ef

This change adjusts the tenant cost controller logic to try to maintain a "buffer" of 5000 RUs. This is useful to prevent waiting for more RUs if an otherwise lightly loaded pod suddenly gets a spike of traffic. Release note: None

RaduBerinde requested a review from andy-kimball September 24, 2021 21:29

RaduBerinde requested a review from a team as a code owner September 24, 2021 21:29

andy-kimball approved these changes Sep 24, 2021

View reviewed changes

ajwerner approved these changes Sep 27, 2021

View reviewed changes

RaduBerinde commented Sep 27, 2021

View reviewed changes

ajwerner approved these changes Sep 27, 2021

View reviewed changes

RaduBerinde merged commit 218fbf5 into cockroachdb:release-21.2 Sep 28, 2021

RaduBerinde deleted the backport21.2-70163-70520 branch September 30, 2021 01:34

RaduBerinde mentioned this pull request Sep 30, 2021

multitenant: tasklist for cost control MVP #68479

Closed

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-21.2: multitenant: implement fallback rate and pass next live instance ID#70727

release-21.2: multitenant: implement fallback rate and pass next live instance ID#70727
RaduBerinde merged 6 commits intocockroachdb:release-21.2from
RaduBerinde:backport21.2-70163-70520

RaduBerinde commented Sep 24, 2021

Uh oh!

blathers-crl bot commented Sep 24, 2021

Uh oh!

cockroach-teamcity commented Sep 24, 2021

Uh oh!

andy-kimball left a comment

Uh oh!

ajwerner left a comment

Uh oh!

RaduBerinde left a comment

Uh oh!

ajwerner left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

RaduBerinde commented Sep 24, 2021

tenantcostclient: maintain a "buffer" of RUs

multitenant: implement a fallback rate

tenantcostclient: plumb SQL instance and session ID

tenantcost: implement request sequence number

tenantcostserver: cleanup stale instances

tenantcostclient: pass next live instance ID

Uh oh!

blathers-crl bot commented Sep 24, 2021

Uh oh!

cockroach-teamcity commented Sep 24, 2021

Uh oh!

andy-kimball left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants