-
Notifications
You must be signed in to change notification settings - Fork 4.1k
schema changes: pre-backfill without online traffic then incremental backfill after #36850
Description
TL;DR
Pre-backfill a new index before it gets any online traffic, then start online traffic and catch it up using a new "incremental backfill" that is cheaper, so online traffic mixes with less backfill traffic.
Background
Currently our schema change process follows the online schema change process documented in the F1 paper pretty closely: we advance though a series of states, but most importantly, we ensure all nodes have advanced to a state in which they update the new schema element during all online traffic before we start backfilling. We do this so we can be sure we correctly process every row: if every node is sure to update the new schema element for any added, changed, or removed row before the backfiller starts, we don't have to worry about the backfiller missing a row added or changed while it runs.
Motivation
However, a bulk operation like a backfill is often tuned differently and behaves differently from online traffic and mixing the two can hurt both: backfill operations often want to be larger and like to get lots of work done during a given request (to get more work done for some fixed request overhead) while online operations are small and tend to be concerned with returning quickly. Thus user-observed tail latencies may suffer when small, quick online requests are serialized with big, slow bulk operations. At the same time, having even just few point writes from online traffic on disk in a range might mean all bulk ingestion to that range is forced to go into L0 instead a lower level. All that data may then need to be completely rewritten for compaction, increasing the cost of the backfill significantly. Thus neither the backfill nor the online traffic is particularly happy with the other happening at the same time on in the same range (TODO: measure this. see point (1) below).
We mix this traffic for the reason mentioned above: the fact that the online updates are happening while we backfill is how we ensure the backfill doesn't miss a new or changed row during the backfill. However, we already have a built-in way to tell what, if anything changed, in a table since a given time and indeed have built incremental BACKUP on precisely this: MVCC iteration with a lower time-bound.
Proposal
We could utilize this in schema changes as well: before altering a table to send online traffic writes to our new index or column, we could instead just run the backfill of it into otherwise disused keyspace. After that "prebackfill", we could use the same mechanism as incremental BACKUP, but instead of just writing what changed to files, feed it back into out backfiller, thus making an "incremental backfill" that can catch us up.
In practice, this could then look something like this:
a) run prebackfill, reading at time t0, writing to its own key space.
b) advance schema states to start online traffic writing to new key space.
c) wait until schema states / leases ensure all online traffic is running updates. pick time t1
d) run incremental backfill from time t0 to t1
e) make descs public
Rough Implementation
Given that we have most of the pieces already in place, I think this could be a pretty straightforward project. I imagine the implementation plan looks something like this:
- get a speed-of-light number: Actually measure this cost by just skipping the current idex states and going straight to backfill, then adding the new element as directly as public, with heavy traffic on the table. If it isn't a win, we should focus further efforts elsewhere -- the rest of these steps are just for getting correctness back after making such a change.
- Add a new state for pre-backfilling that doesn't trigger online writes.
- Add a lower-time bound table reader (or teach the chunkbackfiller to read tables a different way). This will require either adding a lower-time-bound to the
Scanit runs or teachingTableReaderto sendExportRequestsand polishing upExportRequestsome. - Update the schema changer to start in pre-backfill and to run the backfiller, with no time bound, there before stepping though current states, then add the time-bound to the current backfill after
delete_and_write_only.
More random details/thoughts:
This looks a lot like CDC if you squint, but I suspect we don't actually want to use actual CDC: what we actually want -- no changes while we're working, then all the changes -- seem like it might actually be the opposite of what CDC wants to give you -- i.e. a nearly realtime feed of changes. We probably could subscribe to the whole table but we'd just need to buffer everything during the potentially long and unbounded backfill. I think incremental backup is closer to what we're looking for.
If for some reason we a) see a big benefit to this and b) discover that under heavy traffic so much changes during the online part, we could potentially add more incremental passes that hopefully shrink in size as we catch up.
The reader+writer w/ incremental-reader might also be how we do "almost online" table rewrites (CREATE AS, change pk, repartition, etc).
This assumes we can run the whole change within the gc window. This is probably fine for a PoC but we shouldn't assume this long-term --if we're actively running a job that depends on mvcc in a given window, we should prevent GC from removing those revisions.
Jira issue: CRDB-4482