Skip to content

*: long-running migrations #39182

@danhhz

Description

@danhhz

Backward compatibility is a necessary part of storage infrastructure, but we'd like to minimize the associated technical debt. Initially our migrations were one-off, required complicated reasoning, and under tested (example: the format version stuff that was added for column families and interleaved tables). Over time, we've added frameworks to help with this, but there's one notable gap in our story: long-running data migrations.

An example that's come up a number of times is rewriting all table descriptors in kv. An extreme example that we may have to do one day might be rewriting all data on disk.

Note that the format versions mentioned above do a mini-migration on each table descriptor as it enters the node (from kv, gossip, etc). Nothing guarantees that all table descriptors eventually get rewritten. So even though this has been around since before 1.0, the format version migrations have to stay around as permanent technical debt. The FK migration currently underway will have a similar problem.

The format version code is a small amount of debt, but it'd be nice to get rid of it. Other examples are not so simple. The learner replicas work in 19.2 replaces preemptive snapshots, but after we stop using preemptive snapshots, we need to completely flush them out of the system before the code can be deleted. One of these places is an interim state between when a preemptive snapshot has been written to rocksdb and when raft has caught up enough to make it into a real replica. To flush these out, after we stop sending preemptive snapshots, we'll need to iterate each range descriptor in kv and ensure that it is finished being added or is removed.

More examples from @tbg:

  • GenerationComparable (make sure that 20.1 can assume that there are no legacy generations out there) - not yet sure how exactly this is achieved but should be doable
  • the various Raft migrations (RangeAppliedState, unreplicated truncated state, etc) which all boil down to "run something through Raft until they're all active on all ranges but 99.99% sure they're all active already anyway"

We currently have two main frameworks for migrations. They go by different names in various places, but I'll call them startup migrations and cluster versions.

Startup Migrations

Startup migrations (in package pkg/sqlmigrations and called "Cluster Upgrade Tool" by the RFC) are used to ensure that some backwards-compatible hook is run before the code that needs it. An example of this is adding a new system table. A node that doesn't know about a system table simply doesn't use it, so it's safe to add one in a mixed version cluster.

When the first node of the new version starts up and tries to join the cluster, it notices that the migration hasn't run, runs it, and saves the result durably in kv. Any nodes that start after see that the migration has been run and skip it.

If a second node of the new version tries to join the cluster before the migration has completed, it blocks until the migration finishes. This means that startup migrations need to be relatively quick. In an ideal world, every user would do version upgrades with a rolling restart, waiting for each node to be healthy and settled before moving on. But it's not Making Data Easy if a non-rolling restart blocks every node on some extremely long startup migration, causing a total outage.

On the other hand, by running at startup, all code in the new version can assume that startup migrations have been run. This means there doesn't need to be a fallback path and that any old code can be deleted immediately.

Cluster Versions

Cluster versions are a special case of our gossip-based configuration settings. They allow for backward-incompatible behaviors to be gated by a promise from the user that they'll never downgrade. An example of this is adding a field to an existing rpc request protobuf. A node doesn't want to send such a request until it is sure that it will be handled correctly on the other end. Because of protobuf semantics, if it went to an old node, the field would be silently ignored.

Cluster versions tie together two concepts. First, a promise to the user that we don't have to worry about rpc or disk compatibility with old versions anymore. Second, a feature gate that is unlocked by that promise. There is an ordered list of these, one for each feature being gated.

Because the feature gate is initially off when a new version is rolled onto, each check of the gate needs to have a fallback. New features can return an error informing the user to finish the upgrade and bump the version, but other uses need to keep old code around for a release cycle.

Aside: To make it easier for users that don't want the control, the cluster version is automatically bumped some period of time after all nodes have been upgraded. A setting is available to disable this for users that want to keep the option to roll back until they've manually signed off on the new version.

Data Migrations

Summary: I propose a partial unification of these two systems. To avoid having three migration frameworks, they will be based on and replace Cluster Versions. Separate the two ClusterVersion concepts described above so that we can execute potentially long-running hooks in between.

The interface presented to the user is essentially the same, but slightly expanded. After rolling onto a new version of cockroach, some features are not available until the cluster version is bumped. Either the user does this manually or the auto-upgrade does. Instead of the version bump finishing immediately, a system job is created to do the upgrade, enabling the feature gates as it progresses. This is modeled as a system job both to make sure only one coordinator is running as well as exposing progress to the user. Once the job finishes (technically as each step finishes) the gated features become available to the user.

The interface presented to CockroachDB developers is also mostly the same. Each version in the list is now given the option of including a hook, which is allowed to take a while (TODO be more specific) to finish.

Details:

  • The hook, if given, runs after the feature gate is enabled. This simple building block can be used to build more complex migrations by using multiple gate+hook features.
  • To make it simpler to reason about the code that goes in the hook, the system guarantees that the gate has been pushed to every node in the system. That is, every IsActive(feature) check will always return true on every node from now on. This was not previously guaranteed and the implementation is mostly tricky around handling nodes that are unavailable when the gate is pushed.
  • Like startup migrations, the hook must be idempotent.
  • Complications such as flushing caches, etc are left to the hooks. If any compelling patterns emerge, they could be baked into the framework in the future.
  • Startup migrations then conceptually become very similar to this. They currently implement their own coordinator leasing system and there is an opportunity for code unification by moving them to this new system job type.

Side node: This is all very nearly the long-term vision laid out in https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20161024_cluster_upgrade_tool.md

Note to self: https://reviewable.io/reviews/cockroachdb/cockroach/38932#-LlaULyp9sd2JS_tyaKi:-LlaULyp9sd2JS_tyaKj:bfh2kib

Metadata

Metadata

Labels

A-kvAnything in KV that doesn't belong in a more specific category.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions