tenantcostclient: pass next live instance ID#70520
tenantcostclient: pass next live instance ID#70520craig[bot] merged 4 commits intocockroachdb:masterfrom
Conversation
ajwerner
left a comment
There was a problem hiding this comment.
mod getting the next instance ID, this didn't turn out to be so bad
Reviewed 11 of 11 files at r1, 12 of 12 files at r2, 10 of 10 files at r3, 1 of 4 files at r4.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @RaduBerinde)
pkg/ccl/multitenantccl/tenantcostclient/tenant_side.go, line 356 at r1 (raw file):
TenantID: c.tenantID.ToUint64(), InstanceID: uint32(c.instanceID), InstanceLease: []byte(c.sessionID),
UnsafeBytes() if you want it
pkg/ccl/multitenantccl/tenantcostserver/server_test.go, line 267 at r3 (raw file):
//testutils.SucceedsSoon(t, func() error { testutils.SucceedsWithin(t, func() error {
detritus?
pkg/ccl/multitenantccl/tenantcostserver/system_table.go, line 403 at r3 (raw file):
func (h *sysTableHelper) maybeCleanupStaleInstance( cutoff time.Time, instanceID base.SQLInstanceID, ) (ok bool, nextInstance base.SQLInstanceID, _ error) {
uber-nit: more descriptive name than ok, maybe deleted or cleanedUp?
pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 74 at r2 (raw file):
if string(instance.Lease) != string(in.InstanceLease) { // This must be a different incarnation of the same ID. Clear the sequence // number.
nit: consider noting also that the client is going to issue its first request at sequence 1, making the choice to set this to 0 work well with the below logic.
pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 83 at r2 (raw file):
// will still consume RUs; we rely on a higher level control loop that // periodically reconfigures the token bucket to correct such errors. if instance.Seq == 0 || instance.Seq < in.SeqNum {
do we need the zero check given the choice to start the client at 1?
pkg/server/server_sql.go, line 133 at r1 (raw file):
sqlMemMetrics sql.MemoryMetrics stmtDiagnosticsRegistry *stmtdiagnostics.Registry sqlLivenessSessionID sqlliveness.SessionID
comment that this will only be populated with a non-zero value on secondary tenants?
pkg/server/tenant.go, line 490 at r4 (raw file):
} func makeNextLiveInstanceIDFn(
cc @rimadeodhar for awareness and to highlight the value of the caching work (which I'll review next, sorry).
pkg/server/tenant.go, line 515 at r4 (raw file):
} if now := timeutil.Now(); lastRetrieval.Before(now.Add(-interval)) {
bad things are going to happen if it takes more than interval to run the fetch, but still it seems like we don't want to kick off concurrent retrievals.
pkg/server/tenant.go, line 518 at r4 (raw file):
lastRetrieval = now _ = stopper.RunAsyncTask(ctx, "get-next-live-instance-id", func(ctx context.Context) {
how do you feel about pulling this closure out above the returned closure. I don't think tying the context together is a particularly good thing. If anything, give it the server's outer context.
andy-kimball
left a comment
There was a problem hiding this comment.
Reviewed 11 of 11 files at r1, 12 of 12 files at r2, 10 of 10 files at r3.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @RaduBerinde)
pkg/ccl/multitenantccl/tenantcostserver/server.go, line 37 at r3 (raw file):
"tenant_usage_instance_inactivity", "instances that have not reported consumption for longer than this value are cleaned up; "+ "should be at least four times higher than the tenant_cost_control_period of any tenant",
Instead of making this independently settable, should we derive it from tenant_cost_control_period (since it sounds like it's very tied to it)? That'd prevent errors where this gets set to a value < 4x that value. It'd also be one less setting that we have to worry about.
pkg/ccl/multitenantccl/tenantcostserver/system_table.go, line 406 at r3 (raw file):
ts := tree.MustMakeDTimestamp(cutoff, time.Microsecond) row, err := h.ex.QueryRowEx( h.ctx, "tenant-usage-upsert", h.txn,
Isn't this a delete, not an upsert?
pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 81 at r2 (raw file):
// Only update consumption if we are sure this is not a duplicate request // that we already counted. Note that if this is a duplicate request, it // will still consume RUs; we rely on a higher level control loop that
I'm a bit confused by "if this is a duplicate request, it will still consume RUs", and yet the if statement is not calling tenant.Consumption.Add, so how is it consuming RUs?
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @RaduBerinde)
pkg/ccl/multitenantccl/tenantcostserver/server.go, line 37 at r3 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
Instead of making this independently settable, should we derive it from
tenant_cost_control_period(since it sounds like it's very tied to it)? That'd prevent errors where this gets set to a value < 4x that value. It'd also be one less setting that we have to worry about.
It's tricky - tenant_cost_control_period comes from the tenant side
pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 81 at r2 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
I'm a bit confused by "if this is a duplicate request, it will still consume RUs", and yet the if statement is not calling
tenant.Consumption.Add, so how is it consuming RUs?
It won't show up as consumption, but it consumes burst RUs. Nothing that wouldn't be corrected by the higher level closed loop though
20cf25c to
74700dd
Compare
|
pkg/ccl/multitenantccl/tenantcostserver/server.go, line 37 at r3 (raw file): Previously, RaduBerinde wrote…
Ah, makes sense. |
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @andy-kimball, and @RaduBerinde)
pkg/ccl/multitenantccl/tenantcostclient/tenant_side.go, line 356 at r1 (raw file):
Previously, ajwerner wrote…
UnsafeBytes()if you want it
Done
pkg/ccl/multitenantccl/tenantcostserver/system_table.go, line 403 at r3 (raw file):
Previously, ajwerner wrote…
uber-nit: more descriptive name than
ok, maybedeletedorcleanedUp?
Changed to deleted
pkg/ccl/multitenantccl/tenantcostserver/system_table.go, line 406 at r3 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
Isn't this a delete, not an upsert?
Done.
pkg/ccl/multitenantccl/tenantcostserver/token_bucket.go, line 83 at r2 (raw file):
Previously, ajwerner wrote…
do we need the zero check given the choice to start the client at 1?
I was thinking of reasonable behavior with tenants before this change, where the sequence number would stay 0 always. Probably doesn't matter but it was easy enough to do.
pkg/server/server_sql.go, line 133 at r1 (raw file):
Previously, ajwerner wrote…
comment that this will only be populated with a non-zero value on secondary tenants?
Done.
pkg/server/tenant.go, line 515 at r4 (raw file):
Previously, ajwerner wrote…
bad things are going to happen if it takes more than interval to run the fetch, but still it seems like we don't want to kick off concurrent retrievals.
Good catch, fixed.
pkg/server/tenant.go, line 518 at r4 (raw file):
Previously, ajwerner wrote…
how do you feel about pulling this closure out above the returned closure. I don't think tying the context together is a particularly good thing. If anything, give it the server's outer context.
I extracted most of it. Let me know if this is what you had in mind for the contexts.
ajwerner
left a comment
There was a problem hiding this comment.
Reviewed 2 of 21 files at r5, 2 of 10 files at r7, 3 of 4 files at r8.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, and @RaduBerinde)
pkg/server/tenant.go, line 518 at r4 (raw file):
Previously, RaduBerinde wrote…
I extracted most of it. Let me know if this is what you had in mind for the contexts.
Yeah, if I had one more nit, it would be to annotate that server context with a log tag like serverCtx = logtags.AddTag(serverCtx, "get-next-live-instance-id", nil)
This commit plumbs the SQL Instance ID and corresponding sqlliveness.SessionID to the tenant cost controller. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.
This commit adds a monotonic request sequence number for the TokenBucket API. This is used to detect duplicate requests and avoid double-charging. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.
This commit implements the server-side logic for cleaning up stale instances from the tenant_usage table, according to the following scheme: - each tenant sends the ID of the next instance in circular order. The live instance set is maintained on the tenant side by a separate subsystem. - the server uses this information as a "hint" that some instances might be stale. When the next ID does not match the expected value, a cleanup for a specific instance ID range is triggered. The cleanup ultimately checks that the last update is stale, so that stale information from the tenant-side doesn't cause incorrect removals. Instances are cleaned up one at a time, with a limit of 10 instance removals per request. This solution avoids queries that scan ranges of the table which may contain a lot of tombstones. Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.
This change adds reporting of the next live instance ID to the tenant cost controller. For now we query the sqlinstance.Provider at most once a minute. This is temporary until the Provider is changed to a range-feed-driven cache (#cockroachdb#69976). Release note: None Release justification: Necessary fix for the distributed rate limiting functionality, which is vital for the upcoming Serverless MVP release. It allows CRDB to throttle clusters that have run out of free or paid request units (which measure CPU and I/O usage). This functionality is only enabled in multi-tenant scenarios and should have no impact on our dedicated customers.
74700dd to
a1970f8
Compare
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @andy-kimball, and @RaduBerinde)
pkg/server/tenant.go, line 518 at r4 (raw file):
Previously, ajwerner wrote…
Yeah, if I had one more nit, it would be to annotate that server context with a log tag like
serverCtx = logtags.AddTag(serverCtx, "get-next-live-instance-id", nil)
Good idea, done. Feels like RunAsyncTask should do that internally..
|
bors r+ |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from a4e1af5 to blathers/backport-release-21.2-70520: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
This set of changes adds tracking instances on the server side. The main motivator at this stage is to detect duplicate requests and avoid double-charging in some cases. A secondary motivator is that doing this later will cause headaches during upgrade.
Informs #68479.
tenantcostclient: plumb SQL instance and session ID
This commit plumbs the SQL Instance ID and corresponding
sqlliveness.SessionID to the tenant cost controller.
Release note: None
Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
tenantcost: implement request sequence number
This commit adds a monotonic request sequence number for the
TokenBucket API. This is used to detect duplicate requests and avoid
double-charging.
Release note: None
Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
tenantcostserver: cleanup stale instances
This commit implements the server-side logic for cleaning up stale
instances from the tenant_usage table, according to the following
scheme:
each tenant sends the ID of the next instance in circular order.
The live instance set is maintained on the tenant side by a
separate subsystem.
the server uses this information as a "hint" that some instances
might be stale. When the next ID does not match the expected value,
a cleanup for a specific instance ID range is triggered. The
cleanup ultimately checks that the last update is stale, so that
stale information from the tenant-side doesn't cause incorrect
removals.
Instances are cleaned up one at a time, with a limit of 10 instance
removals per request. This solution avoids queries that scan ranges of
the table which may contain a lot of tombstones.
Release note: None
Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.
tenantcostclient: pass next live instance ID
This change adds reporting of the next live instance ID to the tenant
cost controller.
For now we query the sqlinstance.Provider at most once a minute. This
is temporary until the Provider is changed to a range-feed-driven
cache (##69976).
Release note: None
Release justification: Necessary fix for the distributed rate limiting
functionality, which is vital for the upcoming Serverless MVP release.
It allows CRDB to throttle clusters that have run out of free or paid
request units (which measure CPU and I/O usage). This functionality is
only enabled in multi-tenant scenarios and should have no impact on
our dedicated customers.