Skip to content

kvserver: improve below-Raft migrations #72931

@erikgrinaker

Description

@erikgrinaker

Long-running migrations can send a MigrateRequest for migrations that must be applied below Raft. This request is special in that it only succeeds once it has been applied to all known replicas of the range -- it is not sufficient simply to commit it to the Raft log following acknowledgement from a quorum of replicas.

applicationErr := waitForApplication(
ctx, r.store.cfg.NodeDialer, desc.RangeID, desc.Replicas().Descriptors(),
// We wait for an index >= that of the migration command.
r.GetLeaseAppliedIndex())

This requirement is in order to guarantee that no state machine replicas rely on legacy, unmigrated state. However, this requires all replicas for all ranges in a cluster to be available and up-to-date, with a 5-second timeout before giving up. Any retries are currently left to the migration code itself. For example, the postSeparatedIntentsMigration uses 5 retries for a given range and then fails the entire migration, having to restart:

err := retry.WithMaxAttempts(ctx, base.DefaultRetryOptions(), 5, func() error {
err := deps.DB.Migrate(ctx, start, end, cv.Version)
if err != nil {
log.Infof(ctx, "[batch %d/??] error when running no-op Migrate on range r%d: %s", batchIdx, desc.RangeID, err)
}
return err
})

This could be improved in several ways:

  • Consider whether the requirement that all replicas for all ranges are available and up-to-date is necessary, or even viable in large clusters.
  • Introduce migration helpers to do automatic batching, checkpointing and retries of such migrations across ranges, to avoid having to implement this in each separate migration. It should also optimistically continue applying it to additional ranges even when one fails.
  • Introduce knob to control fan out for how many ranges are migrated at a time, to allow operators to speed up migrations with acceptable hit on foreground traffic.
  • Use smaller txns (with low priority) when iterating over full set of range descriptors to reduce contention over meta2. Since long running migrations are long running, locking up meta2 for the entire period is not ideal.
  • Make sure the migration jobs can be paused as necessary, and that they use the regular exponential backoff for job retries.
  • Make the application timeout configurable, or simply rely on the passed client context (which would be controlled by the new migration infrastructure mentioned above).
  • Improve the UX by failing with an informative error message explaining that the migration cannot be completed because range X replicas Y,Z are unavailable/behind/uninitialized/etc, and why all replicas must go through the migration before the upgrade can be finalized.

Jira issue: CRDB-11351

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-replicationRelating to Raft, consensus, and coordination.C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAS-3-ux-surpriseIssue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.T-kvKV Team

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions