Skip to content

config/kv/storage: teach split queue / merge queue about tenant boundaries#49892

Merged
craig[bot] merged 6 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/tenantRealSplit
Jun 9, 2020
Merged

config/kv/storage: teach split queue / merge queue about tenant boundaries#49892
craig[bot] merged 6 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/tenantRealSplit

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Jun 5, 2020

Fixes #48774.

This PR updates SystemConfig.NeedsSplit and SystemConfig.ComputeSplitKey to take into account tenant rows in the system.tenants table and mandate split points at tenant keyspace boundaries. This ensures that the splitQueue will create splits between tenants and that the mergeQueue will avoid merging ranges for different tenants together.

It then updates the crdb_internal.create_tenant builtin function so that it splits ranges synchronously when creating a new tenant. The prior commit taught the split and merge queues about tenant boundaries, which means that split points will be created asynchronously when a new tenant is created. However, it is valuable from an isolation perspective to split off tenant ranges synchronously upon the creation of a new tenant.

Finally, the PR teaches MVCCFindSplitKey about tenant prefixes, ensuring that all split points that are chosen within secondary tenant keyspaces obey the same rules as those chosen within the system tenant keyspace. Namely, it ensures that splits are never performed between two column families in the same SQL row (this was already the case) and that we never split off an empty LHS range due to a first row that exceeds half of the range size (this is new).

@nvb nvb requested review from a team, pbardea and tbg and removed request for a team June 5, 2020 02:29
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered the alternative way of implementing this that does not rely on the system.tenants table to be available on all nodes? I think you would only have had to modify shouldSplitRange with code that looks at the start and end key, and if the range is nonempty (as mvccstats can tell cheaply) we split at the first tenant boundary for which we see a key in the range. This would also handle the merge queue (which has a similar check), though we would want to avoid having to read data in the case in which the merge won't be allowed. That unfortunately seems hard, since the decision is made on the left but we would want to not merge if the RHS is empty. Hmm. Maybe we would merge only if the LHS is empty? Then we would end up with permanent "buffer ranges" between tenants, so at most double the number of empty ranges as tenants in the cluster. Actually the doubling goes away if we do allow a merge of an empty LHS with a nonempty RHS (in which case that tenant's range will "reach out to the empty left", which may be fine).

Anyway, just rambling. I think it's worthwhile to document the alternative and the decision in the commit message and perhaps even in the system config code.

Reviewed 23 of 23 files at r1, 8 of 8 files at r2, 4 of 4 files at r3, 4 of 4 files at r4, 2 of 2 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @pbardea)


pkg/config/system.go, line 209 at r3 (raw file):
While you're here mind adding a comment to {low,high}Index?

lowIndex (in s.Values) is the first and highIndex one past the last KV pair in the descriptor table.


pkg/config/system.go, line 608 at r3 (raw file):

			return nil
		}
		// MakeTenantPrefix(lowTenIDExcl) is either the start key of this key

This comment would be helpful where lowTenIDExcl is defined. I stared at that a bit to understand the Excl and it also helps understand the special casing of MaxTenantID.


pkg/config/system.go, line 624 at r3 (raw file):

		}
		if keys.MakeTenantPrefix(highTenIDExcl).Equal(searchSpan.EndKey) {
			// MakeTenantPrefix(highTenIDExcl) is the end key of this key range.

Comment on why there isn't an exception for MinTenantID - are we splitting at /Tenant/1? That should not be necessary, right? /Tenant/1/* is empty due to the system tenant's encoding, so /Tenant/2 should be the first reasonable split point. I'm fine still splitting at /Tenant/1 to make sure that tenantID 2 looks "exactly" like any other tenant, though, which is perhaps what you were thinking in the first place.


pkg/config/system.go, line 643 at r3 (raw file):

	lowBound := keys.SystemSQLCodec.TenantMetadataKey(lowTenID)
	lowIndex := s.getIndexBound(lowBound)
	// NOTE: we Next() the tenant key because highTenID is inclusive but the

Can't you do this in a simpler way? You just care about lowIndex and whether it points at a key with a tenant ID smaller than highTenID:

lowBound := keys.SystemSQLCodec.TenantMetadataKey(lowTenID)
lowIndex := s.getIndexBound(lowBound)
if lowIndex == len(s.Values) {
   // No keys within range found.
   return nil
}
// Choose the first key in this range. Extract its tenant ID.
splitKey := s.Values[lowIndex].Key
splitTenID, err := keys.SystemSQLCodec.DecodeTenantMetadataID(splitKey)
if err != nil {
  log.Errorf(ctx, "unable to decode tenant ID from system config: %s", err)
  return nil
}
if splitTenID > highBound {
  // No keys within range found.
  return nil
}
return roachpb.RKey(keys.MakeTenantPrefix(splitTenID))

I'm fine keeping as-is as well, I just thought you'd maybe want to sidestep the Uint64 complications.

@pbardea pbardea removed their request for review June 5, 2020 21:11
nvb added 6 commits June 8, 2020 23:26
This commit adapts the logic introduced in cd11a75 for multi-tenancy.
This is a critical precursor for later commits that create split points
at tenant boundaries. Before those changes, logic tests were getting
lucky that tenants were not splitting off ranges, so their system tables
were still part of a single, cluster-wide range that met the
requirements for enabling rangefeed. After the following commits began
splitting off tenant ranges at tenant boundaries, logic tests that
performed schema changes began to stall. To fix this, we need to enable
rangefeed on tenant system ranges as well, which was always the intention.
This argument was added in 9234460 as a fix for cockroachdb#16008 and a short-term
workaround for the cleanup in cockroachdb#16344. That cleanup was addressed in cockroachdb#16649,
which removed the one place where spanKey and splitKey diverged
(https://github.com/cockroachdb/cockroach/pull/16649/files#diff-6955fb0c3216b1d7d4292ccdf2e18dcf).
They have been identical in all uses of AdminSplit ever since.
Fixes cockroachdb#48774.

This commit updates `SystemConfig.NeedsSplit` and
`SystemConfig.ComputeSplitKey` to take into account tenant rows in the
`system.tenants` table and mandate split points at tenant keyspace
boundaries. This ensures that the splitQueue will create splits between
tenants and that the mergeQueue will avoid merging ranges for different
tenants together.

An alternative to this approach would be to have the split queue and
merge queue detect tenant boundaries based on the start and end key
of their range descriptor and then scan the key range for a tenant
boundary. This has the benefit of not rely on the `system.tenants`
table to be available on all nodes. However, it was decided against
because it would introduce new complexity (scanning a range's keys
to determine necessary split points) instead of extending existing
complexity (using the gossiped SystemConfig to determine necessary
split points). Scanning keys in a range just to determine if that
range needs to split at all might also cause performance issues
that we'd like to avoid.
The previous commit taught the split and merge queues about tenant
boundaries, which means that split points will be created asynchronously
when a new tenant is created. However, it is valuable from an isolation
perspective to split off tenant ranges synchronously upon the creation
of a new tenant. This commit adds in such synchronous range splitting to
the `crdb_internal.create_tenant` builtin.
Informs cockroachdb#48123.

`keys.EnsureSafeSplitKey` itself already knew about tenant prefixes, but
`storage.MVCCFindSplitKey` had a bit of extra logic that assumed the system
tenant. This commit addresses this and removes the only use of `TODOSQLCodec`
in `pkg/storage` while doing so.
@nvb nvb force-pushed the nvanbenschoten/tenantRealSplit branch from a4cd3ad to be002ac Compare June 9, 2020 04:02
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Have you considered the alternative way of implementing this that does not rely on the system.tenants table to be available on all nodes?

Briefly, but only before I refreshed myself on how we enforce splits between table boundaries. After realizing that these table splits points are a pure function of the SystemConfig alone, I decided that it would be a mistake to do things differently and add brand new complexity for tenant keyspace boundaries. The two concepts are too closely analogous, especially given how we store tenant metadata, for the alternative to seem like the right approach. And if it was, then shouldn't we be doing the same thing with table boundaries?

That admittedly would not be a particularly strong argument if we were building this from scratch, but given that this needs to play nicely with the existing structure and there aren't any blatant downsides to the approach here, I think the consistency argument wins out for me.

Past all that, I also think there's some complication with such a proposal when performing the first tenant split. A descriptor will look like {/Table/50, /Max}. How do we decide when to add the first tenant split? Is the last range in the system scanning over all of its keys whenever it calls shouldSplitRange?

I think it's worthwhile to document the alternative and the decision in the commit message

Agreed. Added to the commit message.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/config/system.go, line 209 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

While you're here mind adding a comment to {low,high}Index?

lowIndex (in s.Values) is the first and highIndex one past the last KV pair in the descriptor table.

Done.


pkg/config/system.go, line 608 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This comment would be helpful where lowTenIDExcl is defined. I stared at that a bit to understand the Excl and it also helps understand the special casing of MaxTenantID.

Done.


pkg/config/system.go, line 624 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Comment on why there isn't an exception for MinTenantID - are we splitting at /Tenant/1? That should not be necessary, right? /Tenant/1/* is empty due to the system tenant's encoding, so /Tenant/2 should be the first reasonable split point. I'm fine still splitting at /Tenant/1 to make sure that tenantID 2 looks "exactly" like any other tenant, though, which is perhaps what you were thinking in the first place.

Done. It's actually a bit subtle why we don't need to special-case DecodeTenantPrefix(searchSpan.EndKey) == MinTenantID, so it's nice to point that out explicitly.

I think you're aware, but just so we're on the same page though, MinTenantID is 2. So we


pkg/config/system.go, line 643 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can't you do this in a simpler way? You just care about lowIndex and whether it points at a key with a tenant ID smaller than highTenID:

lowBound := keys.SystemSQLCodec.TenantMetadataKey(lowTenID)
lowIndex := s.getIndexBound(lowBound)
if lowIndex == len(s.Values) {
   // No keys within range found.
   return nil
}
// Choose the first key in this range. Extract its tenant ID.
splitKey := s.Values[lowIndex].Key
splitTenID, err := keys.SystemSQLCodec.DecodeTenantMetadataID(splitKey)
if err != nil {
  log.Errorf(ctx, "unable to decode tenant ID from system config: %s", err)
  return nil
}
if splitTenID > highBound {
  // No keys within range found.
  return nil
}
return roachpb.RKey(keys.MakeTenantPrefix(splitTenID))

I'm fine keeping as-is as well, I just thought you'd maybe want to sidestep the Uint64 complications.

I like that much better! It also avoids a key encoding. Done.

@lunevalex lunevalex added the A-multitenancy Related to multi-tenancy label Jun 9, 2020
@nvb
Copy link
Copy Markdown
Contributor Author

nvb commented Jun 9, 2020

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 9, 2020

Build succeeded

@craig craig bot merged commit 86c4fbd into cockroachdb:master Jun 9, 2020
@nvb nvb deleted the nvanbenschoten/tenantRealSplit branch June 9, 2020 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-multitenancy Related to multi-tenancy

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kv: teach split queue / merge queue about tenant boundaries and tables

4 participants