-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kv,*: non-contiguous ranges #65726
Description
NOTE: this is issue a speculative exploration that I do not expect to be pursued for a long time.
Is your feature request related to a problem? Please describe.
Cockroach was designed to be a scalable database which can scale to handle large volumes of data. However, there was an assumption that the data would be scattered over a number of ranges which was (much) larger than the number tables (or, say, partitions which carry their own configuration).
The problem at hand is that a client can force the system to carry difference configurations for different spans of data. For example, imagine a client create 1000 tables, each of which is partitioned using REGIONAL BY ROW syntax and each of which contains 1 secondary index (on average) and the database contains 5 regions. In that setting, the schema would require at least 10000 (1000 * 2 * 5) ranges. This is true because each table has two indexes which each have five partitions which carry different constraints. Any adjacent key-span with a different configuration implies a need to split. This is true even if these tables contain zero bytes. The overhead to having a range is non-trivial; each range requires background work and memory. See Additional context for why this scenario is actually quite common. In practice, there are only 5 unique configurations in use throughout the entire keyspace.
Describe the solution you'd like
The proposal in this issue is that we extend the range structure to allow for a range to consist
of multiple, discontiguous spans. This proposal is not without precedent; Spanner allows precisely
this (even if motivated differently, see Additional context).
A directory is the unit of data placement. All data in a directory has the same replication configuration. When data is moved between Paxos groups, it is moved directory by directory, as shown in Figure 3. Spanner might move a directory to shed load from a Paxos group; to put directories that are frequently accessed together into the same group; or to move a directory into a group that is closer to its accessors.
The fact that a Paxos group may contain multiple directories implies that a Spanner tablet is different from a Bigtable tablet: the former is not necessarily a single lexicographically contiguous partition of the row space. Instead, a Spanner tablet is a container that may encapsulate multiple partitions of the row space. We made this decision so that it would be possible to colocate multiple directories that are frequently accessed together.
Meta layout and KV changes
The changes required to support non-contiguous ranges are surprisingly small:
-
Change the
RangeDescriptorto haverepeatedstart_keyandend_key(or change it to have a repeated spans field) as parallel arrays. -
Change the meta encoding.
Currently range descriptors are stored in two place: meta2 and a range-local replicated key. These descriptors are addressed by the start key in the range-local space and (for horrible reasons, by the end key in meta2). The value in both places is updated transactionally. If we made non-contiguous ranges, we would not want to store another copy of the descriptor to keep in sync (also it would bloat the space). Instead, we'd store, in meta, at the end key for each subsequent span (not the lowest), a pointer to the start key of lowest span.
This would mean that resolving range addressing may require one more hop in order to determine addresses, but that's fine.
I'm not sure there's much more too it. There's some bounds checking which would need to be dealt with for certain data structures (think rditer).
Allocation Decisions
The biggest problem with this whole scheme is how to choose when to merge non-contiguous spans. The merge queue today only ever considers the range immediately adjacent for merge. Now, the merge queue could consider all ranges which have a left-most span earlier than the current range that carry the same configuration.
Importantly, there would be zero involvement of the SQL layer in dictating or controlling this merging; it would be constrained to KV entirely.
There's a lot of sophistication would could be used to make very good decisions here and to achieve the goal laid out by Spanner. Fortunately, I believe there's some hope that dumb solutions will lead to good results. Namely, if all of the tables in a given region would fit into a single range, then a greedy merge solution would co-locate them. That would mean that that greedy solution might co-locate secondary index partitions with their primary index partitions. That'd be wonderful.
One risk is that we create a fragmented keyspace with lots of spans and very few ranges, which, if shuffled around could be a similarly small number of contiguous ranges. I don't have intuition on the ways in which this fragmentation is likely to arise. Perhaps some limits on the number of spans in a range would help.
Another problem is the split queue and split decision-making. Today we split far more eagerly than we need to. In cases where there are separate partitions or separate tables which are contiguous, we split them even when they carry an identical config. We hope to address this as part of project described in (#62128).
Describe alternatives you've considered
Another way to reduce the number of ranges is to page out their raft groups entirely to some sort of external storage. This can work if the data is mostly cold and not being written to.
Additional context
Schema-per-customer (sometimes called multi-tenancy)
A common paradigm for building SaaS products is to isolate individual customers with their own set of tables. We've seen this come up a number of times. Today this approach is completely stymied by the bottlenecks due to creating such a set up (#63206). Soon we'll hopefully fix these bottlenecks and unlock the possibility of clients actually telling the system to create these very large numbers of ranges. Once that happens, I anticipate we'll be exposed to a new class of problems due to having large numbers of ranges.
Interleaved tables and collocated data
In #52009 we decided to remove interleaved tables, a feature which allowed users to collocate data, and potentially, achieve serious access locality benefits from that collaction. In practice, these optimizations didn't work out well for SQL and carried quite a bit of complexity. Nevertheless, access locality is an important property. If we had a smart (likely global) placement driver, we could place segments of indexes which are accessed together in the same range.
Jira issue: CRDB-7732