Skip to content

release-21.2: multitenant: implement fallback rate and pass next live instance ID#70727

Merged
RaduBerinde merged 6 commits intocockroachdb:release-21.2from
RaduBerinde:backport21.2-70163-70520
Sep 28, 2021
Merged

release-21.2: multitenant: implement fallback rate and pass next live instance ID#70727
RaduBerinde merged 6 commits intocockroachdb:release-21.2from
RaduBerinde:backport21.2-70163-70520

Conversation

@RaduBerinde
Copy link
Copy Markdown
Member

Backport #70163 and #70520, both needed for Serverless cost control.

/cc @cockroachdb/release

tenantcostclient: maintain a "buffer" of RUs

This change adjusts the tenant cost controller logic to try to
maintain a "buffer" of 5000 RUs. This is useful to prevent waiting for
more RUs if an otherwise lightly loaded pod suddenly gets a spike of
traffic.

Release note: None

multitenant: implement a fallback rate

This change implements a fallback throttling rate that a SQL pod can
use if it stops being able to complete token bucket
requests.

The goal is keep tenants without burst RUs throttled and tenants with
lots of RUs unthrottled (or throttled at a high rate). To achieve
this, we calculate a rate at which the tenant would burn through all
their available RUs within 1 hour. The premise here is that if we have
some kind of infrastructure problem, 1 hour is a reasonable time frame
to address it. Beyond 1 hour, the tenant will continue at the same
rate, consuming more RUs than they had available.

Informs #68479.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

tenantcostclient: plumb SQL instance and session ID

This commit plumbs the SQL Instance ID and corresponding
sqlliveness.SessionID to the tenant cost controller.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

tenantcost: implement request sequence number

This commit adds a monotonic request sequence number for the
TokenBucket API. This is used to detect duplicate requests and avoid
double-charging.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

tenantcostserver: cleanup stale instances

This commit implements the server-side logic for cleaning up stale
instances from the tenant_usage table, according to the following
scheme:

  • each tenant sends the ID of the next instance in circular order.
    The live instance set is maintained on the tenant side by a
    separate subsystem.

  • the server uses this information as a "hint" that some instances
    might be stale. When the next ID does not match the expected value,
    a cleanup for a specific instance ID range is triggered. The
    cleanup ultimately checks that the last update is stale, so that
    stale information from the tenant-side doesn't cause incorrect
    removals.

Instances are cleaned up one at a time, with a limit of 10 instance
removals per request. This solution avoids queries that scan ranges of
the table which may contain a lot of tombstones.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

tenantcostclient: pass next live instance ID

This change adds reporting of the next live instance ID to the tenant
cost controller.

For now we query the sqlinstance.Provider at most once a minute. This
is temporary until the Provider is changed to a range-feed-driven
cache (##69976).

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

This change adjusts the tenant cost controller logic to try to
maintain a "buffer" of 5000 RUs. This is useful to prevent waiting for
more RUs if an otherwise lightly loaded pod suddenly gets a spike of
traffic.

Release note: None
This change implements a fallback throttling rate that a SQL pod can
use if it stops being able to complete token bucket
requests.

The goal is keep tenants without burst RUs throttled and tenants with
lots of RUs unthrottled (or throttled at a high rate). To achieve
this, we calculate a rate at which the tenant would burn through all
their available RUs within 1 hour. The premise here is that if we have
some kind of infrastructure problem, 1 hour is a reasonable time frame
to address it. Beyond 1 hour, the tenant will continue at the same
rate, consuming more RUs than they had available.

Informs cockroachdb#68479.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
This commit plumbs the SQL Instance ID and corresponding
sqlliveness.SessionID to the tenant cost controller.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
This commit adds a monotonic request sequence number for the
TokenBucket API. This is used to detect duplicate requests and avoid
double-charging.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
This commit implements the server-side logic for cleaning up stale
instances from the tenant_usage table, according to the following
scheme:

 - each tenant sends the ID of the next instance in circular order.
   The live instance set is maintained on the tenant side by a
   separate subsystem.

 - the server uses this information as a "hint" that some instances
   might be stale. When the next ID does not match the expected value,
   a cleanup for a specific instance ID range is triggered. The
   cleanup ultimately checks that the last update is stale, so that
   stale information from the tenant-side doesn't cause incorrect
   removals.

Instances are cleaned up one at a time, with a limit of 10 instance
removals per request. This solution avoids queries that scan ranges of
the table which may contain a lot of tombstones.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
This change adds reporting of the next live instance ID to the tenant
cost controller.

For now we query the sqlinstance.Provider at most once a minute. This
is temporary until the Provider is changed to a range-feed-driven
cache (#cockroachdb#69976).

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
@RaduBerinde RaduBerinde requested a review from a team as a code owner September 24, 2021 21:29
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Sep 24, 2021

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 6 files at r1, 8 of 15 files at r2, 7 of 11 files at r3, 5 of 12 files at r4, 10 of 10 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @RaduBerinde)


pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 72 at r6 (raw file):

			}
		}
		if string(instance.Lease) != string(in.InstanceLease) {

nit I didn't catch on the main review, could be !bytes.Equal.

Copy link
Copy Markdown
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 72 at r6 (raw file):

Previously, ajwerner wrote…

nit I didn't catch on the main review, could be !bytes.Equal.

Is the compiler not smart enough to just compare bytes in this case? Anyway, I'll keep that in mind and incorporate it in one of the next changes.

Copy link
Copy Markdown
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @RaduBerinde)


pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 72 at r6 (raw file):

Previously, RaduBerinde wrote…

Is the compiler not smart enough to just compare bytes in this case? Anyway, I'll keep that in mind and incorporate it in one of the next changes.

Heh I got sniped and the answer is a good one. Nothing to do here.

// Equal reports whether a and b
// are the same length and contain the same bytes.
// A nil argument is equivalent to an empty slice.
func Equal(a, b []byte) bool {
	// Neither cmd/compile nor gccgo allocates for these string conversions.
	return string(a) == string(b)
}

https://cs.opensource.google/go/go/+/refs/tags/go1.17.1:src/bytes/bytes.go;l=18-21

@RaduBerinde RaduBerinde merged commit 218fbf5 into cockroachdb:release-21.2 Sep 28, 2021
@RaduBerinde RaduBerinde deleted the backport21.2-70163-70520 branch September 30, 2021 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants