Skip to content

storage: reconcile manual Range splits and automatic Range merges #37487

@nvb

Description

@nvb

Background reading:

CockroachDB's distributed keyspace is split into a series of contiguous Ranges, each of which attempts to remain between 32 MB and 64 MB in size. These Ranges are automatically split by the splitQueue based on their size and foreground load. To complement this process, the recent 19.1 release introduced automatic Range merging, where previously split Ranges are merged back together by the mergeQueue if they fall below a size and load threshold.

In addition to these automated processes, CockroachDB has long supported manual hooks into data distribution. The ALTER TABLE ... SPLIT AT allows an operator to manually override Range splitting decisions and inject their own split point into the keyspace. There are a number of reasons why an operator might want to do this, which are discussed here.

The current design of automatic Range merges doesn't play well with these manual Range splits. The mergeQueue is not aware of which Ranges were automatically split and which were manually split and will happily merge away a manual split. For this reason, using manual splits while automatic merges are enabled is currently disallowed. An operator that attempts to perform a manual Range split without disabling automatic Range merging will be greeted with the following error:

splits would be immediately discarded by merge queue; disable the merge queue first by running 'SET CLUSTER SETTING kv.range_merge.queue_enabled = false'

Manual Range splits and automatic Range merges should be fixed to work with one another.

The proposed solution to this was originally discussed in the Range Merges RFC. The RFC mentions:

To preserve these user-created split points, we need to store state. Every range
to the right of a split point created by an ALTER TABLE ... SPLIT AT command
will have a special range local key set that indicates it is not to be merged
into the range to its left. I'll hereafter refer to this key as the "sticky"
bit.

This "sticky" bit is the first part of the solution. A manual split will need to denote somewhere that a Range was split manually. The best discussion about this is in the range merges RFC pull request. It's still not clear to me whether a range local key is a better solution than marking the bit directly on the Range descriptor. Perhaps @bdarnell has an opinion on this.

Once manual splits begin writing this sticky bit, we'll make the mergeQueue search for the sticky bit before deciding on whether to merge a Range into its left neighbor.

Once this is complete, we can allow manual Range splits and automatic Range merges.

Finally, we'll want to introduce a new ALTER TABLE ... UNSPLIT AT command that simply removes a sticky bit if one exists. This again was discussed most thoroughly in the range merges RFC pull request.

Stages:

  • make manual splits write sticky bit
  • make merge queue respects sticky bit
  • remove "splits would be immediately discarded by merge queue" error
  • expose functionality to check which splits are manually split (crdb_internal.ranges and SHOW EXPERIMENTAL_RANGES)
  • introduce ALTER TABLE ... UNSPLIT AT syntax to manually remove sticky bit

Metadata

Metadata

Assignees

Labels

A-kv-distributionRelating to rebalancing and leasing.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions