allocator/mmaprototype: add repair orchestration and AddVoter by tbg · Pull Request #165423 · cockroachdb/cockroach

tbg · 2026-03-11T14:39:16Z

Second PR in the MMA repair productionization stack (see #164658 for the
prototype and how-to-productionize.md for the full plan). Builds on the
repair foundation from #165413.

This PR adds:

Repair orchestration loop: repair() on rebalanceEnv iterates
repairRanges in priority order, filters to ranges where the local store
is leaseholder, and dispatches to per-action repair functions. Integrated
into ComputeChanges via the IncludeRepair option on ChangeOptions.
repairAddVoter: the first concrete repair action. Two code paths:
(1) promote an existing non-voter to voter when one satisfies voter
constraints; (2) add a new voter on a constraint-satisfying,
diversity-maximizing store.
Repair helpers: pickStoreByDiversity (diversity-based store selection
with reservoir sampling for tie-breaking), filterAddCandidates (filters
to ready stores not already hosting a replica at the node level),
enactRepair (records pending change and appends to change list),
isLeaseholderOnStore, replicaStateForStore.
DSL test infrastructure: repair command that creates a rebalanceEnv
and runs repair(), with deterministic random seed for reproducible output.

The remaining 11 repair actions (RemoveVoter, ReplaceDeadVoter, etc.) are
wired as stubs logging "not yet implemented" — they come in later PRs.

Commits

add repair orchestration loop — infrastructure: repair() loop,
IncludeRepair wiring, originMMARepair enum, DSL repair command.
No actions implemented (all hit default "not yet implemented" log).
implement repairAddVoter with non-voter promotion — first action +
all helpers: enactRepair, filterAddCandidates, pickStoreByDiversity,
repairAddVoter, promoteNonVoterToVoter. Two new DSL tests.

Stacked on #165413. Informs #164658.

trunk-io · 2026-03-11T14:39:22Z

Merging to master in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

blathers-crl · 2026-03-11T14:39:27Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2026-03-11T14:39:41Z

This change is

tbg

@tbg reviewed 18 files and all commit messages, and made 16 comments.
Reviewable status: complete! 0 of 0 LGTMs obtained.

-- commits line 182 at r6:
Can you update the commit message to contrast this with how the legacy allocator works and to highlight any differences?
Interactively let me engage first before pushing the suggested update.

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 306 at r5 (raw file):

) []ExternalRangeChange {
	re.mmaid++
	ctx = logtags.AddTag(ctx, "mmaid", re.mmaid)

now we have more than one place that does this. Couldn't this be lifted to the caller so that we have a ctx that both repair and rebalance can use that already has the mmaid in it? Do this as a new commit in the end (i.e won't be squashed).

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 309 at r5 (raw file):

	// Iterate repair actions in priority order (lower enum = higher priority).
	for action := FinalizeAtomicReplicationChange; action < NoRepairNeeded; action++ {

can you start at 1 (so that we don't think there's anything special about FinalizeAtomicReplicationChange)
mention that 0 is not a valid action by design, in a comment (so nobody wonders if there's an off by one here)
make a const numActions and make sure it doesn't rot. One way to do this is a unit test that validates that Action(numActions) still stringifies properly but numActions+1 doesn't. You might be able to think of a better way too.

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 314 at r5 (raw file):

			continue
		}
		// Sort range IDs for deterministic iteration order.

See #165284 - this isn't merged but add a sync.Pool for this following that pattern regardless, so that we don't allocate slices in the common case.

I'm also worried by this determinism. We don't attempt to be entirely random, but we should also not be entirely not random. Keep the sorting, but then make range ids { iteration order deterministically random (using the rng on rebalanceEnv).

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 298 at r6 (raw file):

// replicaStateForStore returns the ReplicaState of the replica on the given
// store, and whether it was found.
func replicaStateForStore(rs *rangeState, storeID roachpb.StoreID) (ReplicaState, bool) {

can you make this a method on rangeState (and pick a good location)?

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 318 at r6 (raw file):

}

// filterAddCandidates filters candidateStores down to stores that are ready

For performance reasons, I would like the returned storeSet to be backed by the same memory as the incoming candidateStores set. This should be part of the commented contract and the caller needs to be careful when using pooled memory. See #165284 for examples of this pattern.

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 322 at r6 (raw file):

// range at the node level. excludeStoreID, if non-zero, is excluded from the
// existing-replica set (used when a replica on that store is being
// concurrently removed as part of the same change).

add: , such as during non-voter promotions

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 374 at r6 (raw file):

// for diversity scoring: voterLocalityTiers for voter operations,
// replicaLocalityTiers for non-voter operations.
func (re *rebalanceEnv) pickStoreByDiversity(

Does rebalancing (not repair) have an equivalent of this or does it not care about diversity yet?
This is a discussion item.

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 404 at r6 (raw file):

// repairAddVoter attempts to add a voter to an under-replicated range.
// It follows the decision tree from constraint.go: first try to promote a

the reference to constraint.go isn't going to age well, remove it.

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 441 at r6 (raw file):

	re.constraintMatcher.constrainStoresForExpr(constrDisj, &candidateStores)

	validCandidates := re.filterAddCandidates(ctx, rs, candidateStores, 0)

use a local const for the 0.

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 443 at r6 (raw file):

	validCandidates := re.filterAddCandidates(ctx, rs, candidateStores, 0)
	if len(validCandidates) == 0 {
		log.KvDistribution.Warningf(ctx,

Is this really a warning? I'd see this as a verbosity 1 Infof. This can happen routinely, and to many ranges, during outages or when constraints are misconfigured. Yes, worth changing, but hardly worth logging at high density or Warning.

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 465 at r6 (raw file):

	rangeChange := MakePendingRangeChange(rangeID, []ReplicaChange{addChange})
	if err := re.preCheckOnApplyReplicaChanges(rangeChange); err != nil {
		log.KvDistribution.Warningf(ctx,

we don't expect to hit this, right? Then Warningf is ok but please check that this can't be routinely hit (and possibly for many ranges in one go).

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 470 at r6 (raw file):

	}
	re.enactRepair(ctx, localStoreID, rangeChange)
	log.KvDistribution.Infof(ctx,

we don't want to log at Info for routine stuff except in the aggregate. Put this behind verbosity 1.

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 490 at r6 (raw file):

		(*existingReplicaLocalities).getScoreChangeForNewReplica)
	if bestStoreID == 0 {
		log.KvDistribution.Warningf(ctx,

ditto

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 498 at r6 (raw file):

	prevState, found := replicaStateForStore(rs, bestStoreID)
	if !found {
		log.KvDistribution.Warningf(ctx,

This can remain a warning, right? Since we went down this path and so there really ought to be a non-voter we could promote and in particular its store should be known?

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 522 at r6 (raw file):

	}
	re.enactRepair(ctx, localStoreID, rangeChange)
	log.KvDistribution.Infof(ctx,

ditto

Move 7 constraint analysis methods from `constraint_unused_test.go` to `constraint.go`: - candidatesToConvertFromNonVoterToVoter - constraintsForAddingVoter - candidatesToConvertFromVoterToNonVoter - constraintsForAddingNonVoter - candidatesForRoleSwapForConstraints - candidatesVoterConstraintsUnsatisfied - candidatesNonVoterConstraintsUnsatisfied Pure mechanical move with improved doc comments from the prototype. These methods are prerequisites for the per-action repair functions in later PRs (AddVoter, RemoveVoter, constraint swaps). Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

Add the RepairAction enum and computeRepairAction() decision tree. These establish the action space and priority ordering for MMA repair. RepairAction has 15 values (12 actionable + 3 terminal states), ordered by priority via iota. computeRepairAction() maps range state to the highest-priority repair action needed, using a straightforward if/else cascade examining joint configs, quorum, replica counts, and constraint satisfaction. No callers yet — the wiring to clusterState comes in the next commit. Comparison with legacy Allocator.ComputeAction (allocatorimpl/allocator.go): The legacy allocator has two separate orderings that sometimes disagree: 1. The Priority() ordering (used to rank ranges in the replicate queue): FinalizeAtomicReplicationChange 12002 RemoveLearner 12001 ReplaceDeadVoter 12000 AddVoter 10000 ReplaceDecommissioningVoter 5000 RemoveDeadVoter 1000 RemoveDecommissioningVoter 900 RemoveVoter 800 ReplaceDeadNonVoter 700 AddNonVoter 600 ReplaceDecommissioningNonVoter 500 RemoveDeadNonVoter 400 RemoveDecommissioningNonVoter 300 RemoveNonVoter 200 2. The computeAction() if/else cascade (used to pick which action to take for a single range): AddVoter ← checked before quorum! [quorum check → RangeUnavailable] ReplaceDeadVoter ReplaceDecommissioningVoter RemoveDeadVoter ← separate from ReplaceDeadVoter RemoveDecommissioningVoter ← separate from ReplaceDecomVoter RemoveVoter AddNonVoter ReplaceDeadNonVoter ReplaceDecommissioningNonVoter RemoveDeadNonVoter ← separate from ReplaceDeadNonVoter RemoveDecommissioningNonVoter ← separate from ReplaceDecomNonVoter RemoveNonVoter MMA's RepairAction unifies both orderings into a single iota sequence: FinalizeAtomicReplicationChange (1) RemoveLearner (2) AddVoter (3) ReplaceDeadVoter (4) ReplaceDecommissioningVoter (5) RemoveVoter (6) AddNonVoter (7) ReplaceDeadNonVoter (8) ReplaceDecommissioningNonVoter (9) RemoveNonVoter (10) SwapVoterForConstraints (11) ← new, legacy has no equivalent SwapNonVoterForConstraints (12) ← new, legacy has no equivalent RepairSkipped (13) RepairPending (14) NoRepairNeeded (15) Key differences from legacy: - Quorum check gates all actions: In the legacy code, AddVoter is checked before the quorum gate, meaning it can be attempted even without quorum (with a TODO noting this). MMA checks quorum first (step 4) and skips repair entirely if quorum is lost, since all replication changes require raft consensus. - No separate Remove{Dead,Decommissioning}{Voter,NonVoter}: The legacy code distinguishes "replace dead voter" (count matches, add-then-remove) from "remove dead voter" (over-replicated, just remove). MMA collapses these — RemoveVoter handles all over-replication cases, with candidate selection preferring dead > decommissioning > healthy replicas. - Constraint swaps are new: Legacy doesn't have repair actions for constraint violations — those are handled as rebalancing. MMA treats them as repair because a range with correct counts but wrong placement is not fully conformant. Informs cockroachdb#164658.

Wire the repair action computation into clusterState so that each range's repair action is eagerly tracked and indexed. Structural changes: - Add `repairAction RepairAction` field to `rangeState` - Add `repairRanges map[RepairAction]map[RangeID]struct{}` to `clusterState` - Add `updateRepairAction()` and `removeFromRepairRanges()` to maintain the index Trigger points (where updateRepairAction is called): 1. End of processRangeMsg (replicas/config may have changed) 2. pendingChangeEnacted when all pending changes complete 3. End of undoPendingChange 4. End of addPendingRangeChange (sets RepairPending) 5. updateStoreStatuses when health/disposition changes (recomputes for all ranges on the affected store) Range GC calls removeFromRepairRanges before deleting the range. Test infrastructure: - `repair-needed` DSL command: iterates repairRanges by priority, prints action-to-ranges mapping; scans separately for RepairPending - `repair` DSL command: stub (pending changes only, no execution yet) - Parser: nextReplicaID auto-assignment, quiet=true on set-store, relaxed field count for replica lines, repair recomputation on update-store-status 6 new testdata files exercise the repair tracking across priority ordering, config changes, constraint changes, multi-range scenarios, pending change lifecycle, and store status transitions. Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

tbg

Review of commits after "START REVIEWING HERE"

Two commits add repair orchestration and the first concrete repair action (AddVoter with non-voter promotion) to the MMA prototype. The code is well-structured, the commit split is clean, and the core algorithms (reservoir sampling, priority iteration, constraint filtering) are correct.

Verified correct:

Concurrency: repair() runs under a.mu.Lock() via ComputeChanges. No concurrent access issues.
Map mutation during iteration: repair() snapshots range IDs into ids before iterating, so enactRepair -> addPendingRangeChange -> updateRepairAction mutations are safe.
Repair-rebalance interaction: repair pending changes correctly prevent rebalancer from touching the same ranges.
Reservoir sampling in pickStoreByDiversity is correct standard k=1 reservoir sampling.

Strengths:

Clean two-commit structure: orchestration skeleton first, concrete action second.
diversityScorer function type is a clean abstraction using idiomatic Go method value expressions.
Non-voter promotion avoids unnecessary data movement.
Thorough multi-round DSL tests demonstrating the full repair lifecycle.

Test coverage gaps (not blocking, but worth adding):

Leaseholder-only filtering in repair() is untested — both test files always call repair store-id=1 which is the leaseholder. Add a test where the local store is not the leaseholder to verify skipping.
No tests for error/edge cases: no valid candidates, constraint analysis failure, pre-check failure.
Diversity scoring with clearly unequal candidates not tested (only tied or single-candidate scenarios).

Other notes:

The mmaid field comment on rebalanceEnv says "a counter for rebalanceStores calls" but repair() now also increments it. Worth updating.
The repair() return value is discarded at allocator_state.go:349 since rebalanceStores() returns the same accumulated re.changes. Consider removing the return or adding a comment.

(made with /review-crdb)

tbg · 2026-03-11T16:17:14Z