config/kv/storage: teach split queue / merge queue about tenant boundaries#49892
Conversation
tbg
left a comment
There was a problem hiding this comment.
Have you considered the alternative way of implementing this that does not rely on the system.tenants table to be available on all nodes? I think you would only have had to modify shouldSplitRange with code that looks at the start and end key, and if the range is nonempty (as mvccstats can tell cheaply) we split at the first tenant boundary for which we see a key in the range. This would also handle the merge queue (which has a similar check), though we would want to avoid having to read data in the case in which the merge won't be allowed. That unfortunately seems hard, since the decision is made on the left but we would want to not merge if the RHS is empty. Hmm. Maybe we would merge only if the LHS is empty? Then we would end up with permanent "buffer ranges" between tenants, so at most double the number of empty ranges as tenants in the cluster. Actually the doubling goes away if we do allow a merge of an empty LHS with a nonempty RHS (in which case that tenant's range will "reach out to the empty left", which may be fine).
Anyway, just rambling. I think it's worthwhile to document the alternative and the decision in the commit message and perhaps even in the system config code.
Reviewed 23 of 23 files at r1, 8 of 8 files at r2, 4 of 4 files at r3, 4 of 4 files at r4, 2 of 2 files at r5.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pbardea)
pkg/config/system.go, line 209 at r3 (raw file):
While you're here mind adding a comment to {low,high}Index?
lowIndex (in s.Values) is the first and
highIndexone past the last KV pair in the descriptor table.
pkg/config/system.go, line 608 at r3 (raw file):
return nil } // MakeTenantPrefix(lowTenIDExcl) is either the start key of this key
This comment would be helpful where lowTenIDExcl is defined. I stared at that a bit to understand the Excl and it also helps understand the special casing of MaxTenantID.
pkg/config/system.go, line 624 at r3 (raw file):
} if keys.MakeTenantPrefix(highTenIDExcl).Equal(searchSpan.EndKey) { // MakeTenantPrefix(highTenIDExcl) is the end key of this key range.
Comment on why there isn't an exception for MinTenantID - are we splitting at /Tenant/1? That should not be necessary, right? /Tenant/1/* is empty due to the system tenant's encoding, so /Tenant/2 should be the first reasonable split point. I'm fine still splitting at /Tenant/1 to make sure that tenantID 2 looks "exactly" like any other tenant, though, which is perhaps what you were thinking in the first place.
pkg/config/system.go, line 643 at r3 (raw file):
lowBound := keys.SystemSQLCodec.TenantMetadataKey(lowTenID) lowIndex := s.getIndexBound(lowBound) // NOTE: we Next() the tenant key because highTenID is inclusive but the
Can't you do this in a simpler way? You just care about lowIndex and whether it points at a key with a tenant ID smaller than highTenID:
lowBound := keys.SystemSQLCodec.TenantMetadataKey(lowTenID)
lowIndex := s.getIndexBound(lowBound)
if lowIndex == len(s.Values) {
// No keys within range found.
return nil
}
// Choose the first key in this range. Extract its tenant ID.
splitKey := s.Values[lowIndex].Key
splitTenID, err := keys.SystemSQLCodec.DecodeTenantMetadataID(splitKey)
if err != nil {
log.Errorf(ctx, "unable to decode tenant ID from system config: %s", err)
return nil
}
if splitTenID > highBound {
// No keys within range found.
return nil
}
return roachpb.RKey(keys.MakeTenantPrefix(splitTenID))I'm fine keeping as-is as well, I just thought you'd maybe want to sidestep the Uint64 complications.
This commit adapts the logic introduced in cd11a75 for multi-tenancy. This is a critical precursor for later commits that create split points at tenant boundaries. Before those changes, logic tests were getting lucky that tenants were not splitting off ranges, so their system tables were still part of a single, cluster-wide range that met the requirements for enabling rangefeed. After the following commits began splitting off tenant ranges at tenant boundaries, logic tests that performed schema changes began to stall. To fix this, we need to enable rangefeed on tenant system ranges as well, which was always the intention.
This argument was added in 9234460 as a fix for cockroachdb#16008 and a short-term workaround for the cleanup in cockroachdb#16344. That cleanup was addressed in cockroachdb#16649, which removed the one place where spanKey and splitKey diverged (https://github.com/cockroachdb/cockroach/pull/16649/files#diff-6955fb0c3216b1d7d4292ccdf2e18dcf). They have been identical in all uses of AdminSplit ever since.
Straightforward refactor.
Fixes cockroachdb#48774. This commit updates `SystemConfig.NeedsSplit` and `SystemConfig.ComputeSplitKey` to take into account tenant rows in the `system.tenants` table and mandate split points at tenant keyspace boundaries. This ensures that the splitQueue will create splits between tenants and that the mergeQueue will avoid merging ranges for different tenants together. An alternative to this approach would be to have the split queue and merge queue detect tenant boundaries based on the start and end key of their range descriptor and then scan the key range for a tenant boundary. This has the benefit of not rely on the `system.tenants` table to be available on all nodes. However, it was decided against because it would introduce new complexity (scanning a range's keys to determine necessary split points) instead of extending existing complexity (using the gossiped SystemConfig to determine necessary split points). Scanning keys in a range just to determine if that range needs to split at all might also cause performance issues that we'd like to avoid.
The previous commit taught the split and merge queues about tenant boundaries, which means that split points will be created asynchronously when a new tenant is created. However, it is valuable from an isolation perspective to split off tenant ranges synchronously upon the creation of a new tenant. This commit adds in such synchronous range splitting to the `crdb_internal.create_tenant` builtin.
Informs cockroachdb#48123. `keys.EnsureSafeSplitKey` itself already knew about tenant prefixes, but `storage.MVCCFindSplitKey` had a bit of extra logic that assumed the system tenant. This commit addresses this and removes the only use of `TODOSQLCodec` in `pkg/storage` while doing so.
a4cd3ad to
be002ac
Compare
nvb
left a comment
There was a problem hiding this comment.
TFTR!
Have you considered the alternative way of implementing this that does not rely on the system.tenants table to be available on all nodes?
Briefly, but only before I refreshed myself on how we enforce splits between table boundaries. After realizing that these table splits points are a pure function of the SystemConfig alone, I decided that it would be a mistake to do things differently and add brand new complexity for tenant keyspace boundaries. The two concepts are too closely analogous, especially given how we store tenant metadata, for the alternative to seem like the right approach. And if it was, then shouldn't we be doing the same thing with table boundaries?
That admittedly would not be a particularly strong argument if we were building this from scratch, but given that this needs to play nicely with the existing structure and there aren't any blatant downsides to the approach here, I think the consistency argument wins out for me.
Past all that, I also think there's some complication with such a proposal when performing the first tenant split. A descriptor will look like {/Table/50, /Max}. How do we decide when to add the first tenant split? Is the last range in the system scanning over all of its keys whenever it calls shouldSplitRange?
I think it's worthwhile to document the alternative and the decision in the commit message
Agreed. Added to the commit message.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/config/system.go, line 209 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
While you're here mind adding a comment to
{low,high}Index?lowIndex (in s.Values) is the first and
highIndexone past the last KV pair in the descriptor table.
Done.
pkg/config/system.go, line 608 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This comment would be helpful where
lowTenIDExclis defined. I stared at that a bit to understand theExcland it also helps understand the special casing ofMaxTenantID.
Done.
pkg/config/system.go, line 624 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Comment on why there isn't an exception for
MinTenantID- are we splitting at/Tenant/1? That should not be necessary, right?/Tenant/1/*is empty due to the system tenant's encoding, so/Tenant/2should be the first reasonable split point. I'm fine still splitting at/Tenant/1to make sure that tenantID 2 looks "exactly" like any other tenant, though, which is perhaps what you were thinking in the first place.
Done. It's actually a bit subtle why we don't need to special-case DecodeTenantPrefix(searchSpan.EndKey) == MinTenantID, so it's nice to point that out explicitly.
I think you're aware, but just so we're on the same page though, MinTenantID is 2. So we
pkg/config/system.go, line 643 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can't you do this in a simpler way? You just care about
lowIndexand whether it points at a key with a tenant ID smaller than highTenID:lowBound := keys.SystemSQLCodec.TenantMetadataKey(lowTenID) lowIndex := s.getIndexBound(lowBound) if lowIndex == len(s.Values) { // No keys within range found. return nil } // Choose the first key in this range. Extract its tenant ID. splitKey := s.Values[lowIndex].Key splitTenID, err := keys.SystemSQLCodec.DecodeTenantMetadataID(splitKey) if err != nil { log.Errorf(ctx, "unable to decode tenant ID from system config: %s", err) return nil } if splitTenID > highBound { // No keys within range found. return nil } return roachpb.RKey(keys.MakeTenantPrefix(splitTenID))I'm fine keeping as-is as well, I just thought you'd maybe want to sidestep the Uint64 complications.
I like that much better! It also avoids a key encoding. Done.
|
bors r+ |
Build succeeded |
Fixes #48774.
This PR updates
SystemConfig.NeedsSplitandSystemConfig.ComputeSplitKeyto take into account tenant rows in thesystem.tenantstable and mandate split points at tenant keyspace boundaries. This ensures that the splitQueue will create splits between tenants and that the mergeQueue will avoid merging ranges for different tenants together.It then updates the
crdb_internal.create_tenantbuiltin function so that it splits ranges synchronously when creating a new tenant. The prior commit taught the split and merge queues about tenant boundaries, which means that split points will be created asynchronously when a new tenant is created. However, it is valuable from an isolation perspective to split off tenant ranges synchronously upon the creation of a new tenant.Finally, the PR teaches MVCCFindSplitKey about tenant prefixes, ensuring that all split points that are chosen within secondary tenant keyspaces obey the same rules as those chosen within the system tenant keyspace. Namely, it ensures that splits are never performed between two column families in the same SQL row (this was already the case) and that we never split off an empty LHS range due to a first row that exceeds half of the range size (this is new).