tenantcostclient: pass next live instance ID by RaduBerinde · Pull Request #70520 · cockroachdb/cockroach

RaduBerinde · 2021-09-21T19:01:01Z

This set of changes adds tracking instances on the server side. The main motivator at this stage is to detect duplicate requests and avoid double-charging in some cases. A secondary motivator is that doing this later will cause headaches during upgrade.

Informs #68479.

tenantcostclient: plumb SQL instance and session ID

This commit plumbs the SQL Instance ID and corresponding
sqlliveness.SessionID to the tenant cost controller.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

tenantcost: implement request sequence number

This commit adds a monotonic request sequence number for the
TokenBucket API. This is used to detect duplicate requests and avoid
double-charging.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

tenantcostserver: cleanup stale instances

This commit implements the server-side logic for cleaning up stale
instances from the tenant_usage table, according to the following
scheme:

each tenant sends the ID of the next instance in circular order.
The live instance set is maintained on the tenant side by a
separate subsystem.
the server uses this information as a "hint" that some instances
might be stale. When the next ID does not match the expected value,
a cleanup for a specific instance ID range is triggered. The
cleanup ultimately checks that the last update is stale, so that
stale information from the tenant-side doesn't cause incorrect
removals.

Instances are cleaned up one at a time, with a limit of 10 instance
removals per request. This solution avoids queries that scan ranges of
the table which may contain a lot of tombstones.

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

tenantcostclient: pass next live instance ID

This change adds reporting of the next live instance ID to the tenant
cost controller.

For now we query the sqlinstance.Provider at most once a minute. This
is temporary until the Provider is changed to a range-feed-driven
cache (##69976).

Release note: None

Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.

cockroach-teamcity · 2021-09-21T19:01:13Z

This change is

ajwerner

mod getting the next instance ID, this didn't turn out to be so bad

Reviewed 11 of 11 files at r1, 12 of 12 files at r2, 10 of 10 files at r3, 1 of 4 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @RaduBerinde)

pkg/ccl/multitenantccl/tenantcostclient/tenant_side.go, line 356 at r1 (raw file):

		TenantID:                    c.tenantID.ToUint64(),
		InstanceID:                  uint32(c.instanceID),
		InstanceLease:               []byte(c.sessionID),

UnsafeBytes() if you want it

pkg/ccl/multitenantccl/tenantcostserver/server_test.go, line 267 at r3 (raw file):

	//testutils.SucceedsSoon(t, func() error {
	testutils.SucceedsWithin(t, func() error {

detritus?

pkg/ccl/multitenantccl/tenantcostserver/system_table.go, line 403 at r3 (raw file):

func (h *sysTableHelper) maybeCleanupStaleInstance(
	cutoff time.Time, instanceID base.SQLInstanceID,
) (ok bool, nextInstance base.SQLInstanceID, _ error) {

uber-nit: more descriptive name than ok, maybe deleted or cleanedUp?

pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 74 at r2 (raw file):

		if string(instance.Lease) != string(in.InstanceLease) {
			// This must be a different incarnation of the same ID. Clear the sequence
			// number.

nit: consider noting also that the client is going to issue its first request at sequence 1, making the choice to set this to 0 work well with the below logic.

pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 83 at r2 (raw file):

		// will still consume RUs; we rely on a higher level control loop that
		// periodically reconfigures the token bucket to correct such errors.
		if instance.Seq == 0 || instance.Seq < in.SeqNum {

do we need the zero check given the choice to start the client at 1?

pkg/server/server_sql.go, line 133 at r1 (raw file):

	sqlMemMetrics           sql.MemoryMetrics
	stmtDiagnosticsRegistry *stmtdiagnostics.Registry
	sqlLivenessSessionID    sqlliveness.SessionID

comment that this will only be populated with a non-zero value on secondary tenants?

pkg/server/tenant.go, line 490 at r4 (raw file):

}

func makeNextLiveInstanceIDFn(

cc @rimadeodhar for awareness and to highlight the value of the caching work (which I'll review next, sorry).

pkg/server/tenant.go, line 515 at r4 (raw file):

		}

		if now := timeutil.Now(); lastRetrieval.Before(now.Add(-interval)) {

bad things are going to happen if it takes more than interval to run the fetch, but still it seems like we don't want to kick off concurrent retrievals.

pkg/server/tenant.go, line 518 at r4 (raw file):

			lastRetrieval = now

			_ = stopper.RunAsyncTask(ctx, "get-next-live-instance-id", func(ctx context.Context) {

how do you feel about pulling this closure out above the returned closure. I don't think tying the context together is a particularly good thing. If anything, give it the server's outer context.

andy-kimball

Reviewed 11 of 11 files at r1, 12 of 12 files at r2, 10 of 10 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @RaduBerinde)

pkg/ccl/multitenantccl/tenantcostserver/server.go, line 37 at r3 (raw file):

	"tenant_usage_instance_inactivity",
	"instances that have not reported consumption for longer than this value are cleaned up; "+
		"should be at least four times higher than the tenant_cost_control_period of any tenant",

Instead of making this independently settable, should we derive it from tenant_cost_control_period (since it sounds like it's very tied to it)? That'd prevent errors where this gets set to a value < 4x that value. It'd also be one less setting that we have to worry about.

pkg/ccl/multitenantccl/tenantcostserver/system_table.go, line 406 at r3 (raw file):

	ts := tree.MustMakeDTimestamp(cutoff, time.Microsecond)
	row, err := h.ex.QueryRowEx(
		h.ctx, "tenant-usage-upsert", h.txn,

Isn't this a delete, not an upsert?

pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 81 at r2 (raw file):

		// Only update consumption if we are sure this is not a duplicate request
		// that we already counted. Note that if this is a duplicate request, it
		// will still consume RUs; we rely on a higher level control loop that

I'm a bit confused by "if this is a duplicate request, it will still consume RUs", and yet the if statement is not calling tenant.Consumption.Add, so how is it consuming RUs?

RaduBerinde

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @RaduBerinde)

pkg/ccl/multitenantccl/tenantcostserver/server.go, line 37 at r3 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Instead of making this independently settable, should we derive it from tenant_cost_control_period (since it sounds like it's very tied to it)? That'd prevent errors where this gets set to a value < 4x that value. It'd also be one less setting that we have to worry about.

It's tricky - tenant_cost_control_period comes from the tenant side

pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 81 at r2 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

I'm a bit confused by "if this is a duplicate request, it will still consume RUs", and yet the if statement is not calling tenant.Consumption.Add, so how is it consuming RUs?

It won't show up as consumption, but it consumes burst RUs. Nothing that wouldn't be corrected by the higher level closed loop though

andy-kimball · 2021-09-22T03:40:20Z

pkg/ccl/multitenantccl/tenantcostserver/server.go, line 37 at r3 (raw file):

Previously, RaduBerinde wrote…

It's tricky - tenant_cost_control_period comes from the tenant side

Ah, makes sense.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @andy-kimball, and @RaduBerinde)

pkg/ccl/multitenantccl/tenantcostclient/tenant_side.go, line 356 at r1 (raw file):

Previously, ajwerner wrote…

UnsafeBytes() if you want it

Done

pkg/ccl/multitenantccl/tenantcostserver/system_table.go, line 403 at r3 (raw file):

Previously, ajwerner wrote…

uber-nit: more descriptive name than ok, maybe deleted or cleanedUp?

Changed to deleted

pkg/ccl/multitenantccl/tenantcostserver/system_table.go, line 406 at r3 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Isn't this a delete, not an upsert?

Done.

pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 83 at r2 (raw file):

Previously, ajwerner wrote…

do we need the zero check given the choice to start the client at 1?

I was thinking of reasonable behavior with tenants before this change, where the sequence number would stay 0 always. Probably doesn't matter but it was easy enough to do.

pkg/server/server_sql.go, line 133 at r1 (raw file):

Previously, ajwerner wrote…

comment that this will only be populated with a non-zero value on secondary tenants?

Done.

pkg/server/tenant.go, line 515 at r4 (raw file):

Previously, ajwerner wrote…

bad things are going to happen if it takes more than interval to run the fetch, but still it seems like we don't want to kick off concurrent retrievals.

Good catch, fixed.

pkg/server/tenant.go, line 518 at r4 (raw file):

Previously, ajwerner wrote…

how do you feel about pulling this closure out above the returned closure. I don't think tying the context together is a particularly good thing. If anything, give it the server's outer context.

I extracted most of it. Let me know if this is what you had in mind for the contexts.

ajwerner

Reviewed 2 of 21 files at r5, 2 of 10 files at r7, 3 of 4 files at r8.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @RaduBerinde)

pkg/server/tenant.go, line 518 at r4 (raw file):

Previously, RaduBerinde wrote…

I extracted most of it. Let me know if this is what you had in mind for the contexts.

Yeah, if I had one more nit, it would be to annotate that server context with a log tag like serverCtx = logtags.AddTag(serverCtx, "get-next-live-instance-id", nil)

This commit plumbs the SQL Instance ID and corresponding sqlliveness.SessionID to the tenant cost controller. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.

This commit adds a monotonic request sequence number for the TokenBucket API. This is used to detect duplicate requests and avoid double-charging. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.

This commit implements the server-side logic for cleaning up stale instances from the tenant_usage table, according to the following scheme: - each tenant sends the ID of the next instance in circular order. The live instance set is maintained on the tenant side by a separate subsystem. - the server uses this information as a "hint" that some instances might be stale. When the next ID does not match the expected value, a cleanup for a specific instance ID range is triggered. The cleanup ultimately checks that the last update is stale, so that stale information from the tenant-side doesn't cause incorrect removals. Instances are cleaned up one at a time, with a limit of 10 instance removals per request. This solution avoids queries that scan ranges of the table which may contain a lot of tombstones. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.

This change adds reporting of the next live instance ID to the tenant cost controller. For now we query the sqlinstance.Provider at most once a minute. This is temporary until the Provider is changed to a range-feed-driven cache (#cockroachdb#69976). Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @andy-kimball, and @RaduBerinde)

pkg/server/tenant.go, line 518 at r4 (raw file):

Previously, ajwerner wrote…

Yeah, if I had one more nit, it would be to annotate that server context with a log tag like serverCtx = logtags.AddTag(serverCtx, "get-next-live-instance-id", nil)

Good idea, done. Feels like RunAsyncTask should do that internally..

RaduBerinde · 2021-09-23T19:23:15Z

bors r+

craig · 2021-09-23T21:32:25Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2021-09-23T21:32:33Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from a4e1af5 to blathers/backport-release-21.2-70520: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

RaduBerinde requested review from ajwerner and andy-kimball September 21, 2021 19:01

RaduBerinde requested a review from a team as a code owner September 21, 2021 19:01

ajwerner approved these changes Sep 22, 2021

View reviewed changes

andy-kimball reviewed Sep 22, 2021

View reviewed changes

RaduBerinde commented Sep 22, 2021

View reviewed changes

RaduBerinde force-pushed the idempotent-consumption branch from 20cf25c to 74700dd Compare September 22, 2021 03:36

RaduBerinde commented Sep 22, 2021

View reviewed changes

ajwerner approved these changes Sep 22, 2021

View reviewed changes

RaduBerinde added 4 commits September 22, 2021 11:56

RaduBerinde force-pushed the idempotent-consumption branch from 74700dd to a1970f8 Compare September 22, 2021 19:08

RaduBerinde commented Sep 22, 2021

View reviewed changes

RaduBerinde added the backport-21.2.x label Sep 22, 2021

craig bot merged commit 5cc0f19 into cockroachdb:master Sep 23, 2021

This was referenced Sep 24, 2021

multitenant: tasklist for cost control MVP #68479

Closed

release-21.2: multitenant: implement fallback rate and pass next live instance ID #70727

Merged

RaduBerinde deleted the idempotent-consumption branch September 30, 2021 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tenantcostclient: pass next live instance ID#70520

tenantcostclient: pass next live instance ID#70520
craig[bot] merged 4 commits intocockroachdb:masterfrom
RaduBerinde:idempotent-consumption

RaduBerinde commented Sep 21, 2021 •

edited

Loading

Uh oh!

cockroach-teamcity commented Sep 21, 2021

Uh oh!

ajwerner left a comment

Uh oh!

andy-kimball left a comment

Uh oh!

RaduBerinde left a comment

Uh oh!

andy-kimball commented Sep 22, 2021

Uh oh!

RaduBerinde left a comment

Uh oh!

ajwerner left a comment

Uh oh!

RaduBerinde left a comment

Uh oh!

RaduBerinde commented Sep 23, 2021

Uh oh!

craig bot commented Sep 23, 2021

Uh oh!

blathers-crl bot commented Sep 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

RaduBerinde commented Sep 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tenantcostclient: plumb SQL instance and session ID

tenantcost: implement request sequence number

tenantcostserver: cleanup stale instances

tenantcostclient: pass next live instance ID

Uh oh!

cockroach-teamcity commented Sep 21, 2021

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

andy-kimball left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

andy-kimball commented Sep 22, 2021

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

RaduBerinde commented Sep 23, 2021

Uh oh!

craig bot commented Sep 23, 2021

Uh oh!

blathers-crl bot commented Sep 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RaduBerinde commented Sep 21, 2021 •

edited

Loading