kvserver: improve below-Raft migrations

Long-running migrations can send a `MigrateRequest` for migrations that must be applied below Raft. This request is special in that it only succeeds once it has been _applied_ to _all_ known replicas of the range -- it is not sufficient simply to commit it to the Raft log following acknowledgement from a quorum of replicas.

https://github.com/cockroachdb/cockroach/blob/455cdddc6d75c03645f486b22970e5c6198a8d56/pkg/kv/kvserver/replica_write.go#L254-L257

This requirement is in order to guarantee that no state machine replicas rely on legacy, unmigrated state. However, this requires all replicas for all ranges in a cluster to be available and up-to-date, with a 5-second timeout before giving up. Any retries are currently left to the migration code itself. For example, the `postSeparatedIntentsMigration` uses 5 retries for a given range and then fails the entire migration, having to restart:

https://github.com/cockroachdb/cockroach/blob/4df8ac262e72c58cb4e09dcf1beb0b8ff6fdde27/pkg/migration/migrations/separated_intents.go#L557-L563

This could be improved in several ways:

- [x] ~Consider whether the requirement that all replicas for all ranges are available and up-to-date is necessary, or even viable in large clusters.~
- [ ] Introduce migration helpers to do automatic batching, checkpointing and retries of such migrations across ranges, to avoid having to implement this in each separate migration. It should also optimistically continue applying it to additional ranges even when one fails.
- [ ] Introduce knob to control fan out for how many ranges are migrated at a time, to allow operators to speed up migrations with acceptable hit on foreground traffic.
- [ ] Use smaller txns (with low priority) when iterating over full set of range descriptors to reduce contention over meta2. Since long running migrations are long running, locking up meta2 for the entire period is not ideal.
- [x] Make sure the migration jobs can be paused as necessary, and that they use the regular exponential backoff for job retries.
- [x] Make the application timeout configurable, or simply rely on the passed client context (which would be controlled by the new migration infrastructure mentioned above).
- [x] Improve the UX by failing with an informative error message explaining that the migration cannot be completed because range X replicas Y,Z are unavailable/behind/uninitialized/etc, and why all replicas must go through the migration before the upgrade can be finalized.

Jira issue: CRDB-11351


	applicationErr := waitForApplication(
	ctx, r.store.cfg.NodeDialer, desc.RangeID, desc.Replicas().Descriptors(),
	// We wait for an index >= that of the migration command.
	r.GetLeaseAppliedIndex())

	err := retry.WithMaxAttempts(ctx, base.DefaultRetryOptions(), 5, func() error {
	err := deps.DB.Migrate(ctx, start, end, cv.Version)
	if err != nil {
	log.Infof(ctx, "[batch %d/??] error when running no-op Migrate on range r%d: %s", batchIdx, desc.RangeID, err)
	}
	return err
	})

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: improve below-Raft migrations #72931

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kvserver: improve below-Raft migrations #72931

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions