migrations: introduce clusterversion.SeedTenantSpanConfigs#73623
migrations: introduce clusterversion.SeedTenantSpanConfigs#73623irfansharif wants to merge 3 commits intocockroachdb:masterfrom
Conversation
For newly created tenants, we want to ensure hard splits on tenant boundaries. The source of truth for split points in the span configs subsystem is the contents of `system.span_configurations`. To ensure hard splits, we insert a single key record at the tenant prefix. In a future commit we'll introduce the spanconfig.Reconciler process, which runs per tenant and governs all config entries under each tenant's purview. This has implications for this initial record we're talking about (this record might get cleaned up for e.g.); we'll explore it in tests for the Reconciler itself. Creating a single key record is easy enough -- we could've written directly to `system.span_configurations`. When a tenant is GC-ed however, we need to clear out all tenant state transactionally. To that end we plumb in a txn-scoped KVAccessor into the planner where `crdb_internal.destroy_tenant` is executed. This lets us easily delete all abandoned tenant span config records. Note: We get rid of `spanconfig.experimental_kvaccessor.enabled`. Access to spanconfigs infrastructure is already sufficiently gated through COCKROACH_EXPERIMENTAL_SPAN_CONFIGS. Now that crdb_internal.create_tenant attempts to write through the KVAccessor, it's cumbersome to have to enable the setting manually in every multi-tenant test (increasingly the default) enabling some part of the span configs infrastructure. --- This commit introduces the need for a migration -- for existing clusters with secondary tenants, when upgrading we need to install this initial record at the tenant prefix for all extant tenants (and make sure to continue doing so for subsequent newly created tenants). This is to preserve the hard-split-on-tenant-boundary invariant we wish to provide. It's possible for an upgraded multi-tenant cluster to have dormant sql pods that have never reconciled. If KV switches over to the span configs subsystem, splitting only on the contents of `system.span_configurations`, we'll fail to split on all tenant boundaries. We'll introduce this migration in a future PR (before enabling span configs by default). Release note: None
Reconciler is responsible for reconciling a tenant's zone configs (SQL
construct) with the cluster's span configs (KV construct). It's the
central engine for the span configs infrastructure; a single Reconciler
instance is active for every tenant in the system.
type Reconciler interface {
// Reconcile starts the incremental reconciliation process from
// the given checkpoint. If it does not find MVCC history going
// far back enough[1], it falls back to a scan of all
// descriptors and zone configs before being able to do more
// incremental work. The provided callback is invoked with
// timestamps that can be safely checkpointed. A future
// Reconciliation attempt can make use of this timestamp to
// reduce the amount of necessary work (provided the MVCC
// history is still available).
//
// [1]: It's possible for system.{zones,descriptor} to have been
// GC-ed away; think suspended tenants.
Reconcile(
ctx context.Context,
checkpoint hlc.Timestamp,
callback func(checkpoint hlc.Timestamp) error,
) error
}
Let's walk through what it does. At a high-level, we maintain an
in-memory data structure that's up-to-date with the contents of the KV
(at least the subset of spans we have access to, i.e. the keyspace
carved out for our tenant ID). We watch for changes to SQL state
(descriptors, zone configs), translate the SQL updates to the flattened
span+config form, "diff" the updates against our data structure to see
if there are any changes we need to inform KV of. If so, we do, and
ensure that our data structure is kept up-to-date. We continue watching
for future updates and repeat as necessary.
There's only single instance of the Reconciler running for a given
tenant at a given point it time (mutual exclusion/leasing is provided by
the jobs subsystem). We needn't worry about contending writers, or the
KV state being changed from underneath us. What we do have to worry
about, however, is suspended tenants' not being reconciling while
suspended. It's possible for a suspended tenant's SQL state to be GC-ed
away at older MVCC timestamps; when watching for changes, we could fail
to observe tables/indexes/partitions getting deleted. Left as is, this
would result in us never issuing a corresponding deletion requests for
the dropped span configs -- we'd be leaving orphaned span configs lying
around (taking up storage space and creating pointless empty ranges). A
"full reconciliation pass" is our attempt to find all these extraneous
entries in KV and to delete them.
We can use our span config data structure here too, one that's
pre-populated with the contents of KV. We translate the entire SQL state
into constituent spans and configs, diff against our data structure to
generate KV updates that we then apply. We follow this with clearing out
all these spans in our data structure, leaving behind all extraneous
entries to be found in KV -- entries we can then simply issue deletes
for.
Release note: None
ajwerner
left a comment
There was a problem hiding this comment.
Migration looks fine to me.
Reviewed 13 of 13 files at r1, 1 of 29 files at r2, 1 of 10 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, @irfansharif, and @nvanbenschoten)
pkg/sql/tenant.go, line 155 at r1 (raw file):
Span: roachpb.Span{ Key: tenantPrefix, EndKey: tenantPrefix.Next(),
I think I prefer PrefixEnd() to Next(). Any problems that causes?
SeedTenantSpanConfigs populates system.span_configurations with seed data for secondary tenants. This state is what ensures that we always split on tenant boundaries when using the span configs infrastructure. This version also comes with a migration to populate the same seed data for existing tenants. Release note: None
40f430f to
f148f39
Compare
irfansharif
left a comment
There was a problem hiding this comment.
I'll merge this after #71994 lands.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @arulajmani, and @nvanbenschoten)
pkg/sql/tenant.go, line 155 at r1 (raw file):
Previously, ajwerner wrote…
I think I prefer
PrefixEnd()toNext(). Any problems that causes?
No problems per se (see [1] in the comment above), but it does mean that in the first ever reconciliation attempt by every SQL pod we'll:
- Delete
[tenantPrefix, tenantPrefix.PrefixEnd()) - Add spans for actual tables (
[/tenant/<id>/Table/ID, /tenant/<id>/Table/ID+1), etc.) - Add spans + configs for gaps between actual tables, containing the config found in (1)
- Delete all the "extraneous" span configs added in (3).
Using Next() just avoids steps 3 and 4, shaving 1 RTT (steps 2 and 3 happen in the same request).
|
Just merged this code into #71994. |
SeedTenantSpanConfigs populates system.span_configurations with seed
data for secondary tenants. This state is what ensures that we always
split on tenant boundaries when using the span configs infrastructure.
This version also comes with a migration to populate the same seed data
for existing tenants.
Release note: None
First two commits are from #71994.