Skip to content

allocator/mmaprototype: add repair orchestration and AddVoter#165423

Closed
tbg wants to merge 7 commits intocockroachdb:masterfrom
tbg:prod-mma-repair-step2
Closed

allocator/mmaprototype: add repair orchestration and AddVoter#165423
tbg wants to merge 7 commits intocockroachdb:masterfrom
tbg:prod-mma-repair-step2

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Mar 11, 2026

Second PR in the MMA repair productionization stack (see #164658 for the
prototype and how-to-productionize.md for the full plan). Builds on the
repair foundation from #165413.

This PR adds:

  • Repair orchestration loop: repair() on rebalanceEnv iterates
    repairRanges in priority order, filters to ranges where the local store
    is leaseholder, and dispatches to per-action repair functions. Integrated
    into ComputeChanges via the IncludeRepair option on ChangeOptions.

  • repairAddVoter: the first concrete repair action. Two code paths:
    (1) promote an existing non-voter to voter when one satisfies voter
    constraints; (2) add a new voter on a constraint-satisfying,
    diversity-maximizing store.

  • Repair helpers: pickStoreByDiversity (diversity-based store selection
    with reservoir sampling for tie-breaking), filterAddCandidates (filters
    to ready stores not already hosting a replica at the node level),
    enactRepair (records pending change and appends to change list),
    isLeaseholderOnStore, replicaStateForStore.

  • DSL test infrastructure: repair command that creates a rebalanceEnv
    and runs repair(), with deterministic random seed for reproducible output.

The remaining 11 repair actions (RemoveVoter, ReplaceDeadVoter, etc.) are
wired as stubs logging "not yet implemented" — they come in later PRs.

Commits

  1. add repair orchestration loop — infrastructure: repair() loop,
    IncludeRepair wiring, originMMARepair enum, DSL repair command.
    No actions implemented (all hit default "not yet implemented" log).

  2. implement repairAddVoter with non-voter promotion — first action +
    all helpers: enactRepair, filterAddCandidates, pickStoreByDiversity,
    repairAddVoter, promoteNonVoterToVoter. Two new DSL tests.

Stacked on #165413. Informs #164658.

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io bot commented Mar 11, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Mar 11, 2026

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg tbg force-pushed the prod-mma-repair-step2 branch from 523176c to 78c22d3 Compare March 11, 2026 14:53
Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg reviewed 18 files and all commit messages, and made 16 comments.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained.


-- commits line 182 at r6:
Can you update the commit message to contrast this with how the legacy allocator works and to highlight any differences?
Interactively let me engage first before pushing the suggested update.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 306 at r5 (raw file):

) []ExternalRangeChange {
	re.mmaid++
	ctx = logtags.AddTag(ctx, "mmaid", re.mmaid)

now we have more than one place that does this. Couldn't this be lifted to the caller so that we have a ctx that both repair and rebalance can use that already has the mmaid in it? Do this as a new commit in the end (i.e won't be squashed).


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 309 at r5 (raw file):

	// Iterate repair actions in priority order (lower enum = higher priority).
	for action := FinalizeAtomicReplicationChange; action < NoRepairNeeded; action++ {
  • can you start at 1 (so that we don't think there's anything special about FinalizeAtomicReplicationChange)
  • mention that 0 is not a valid action by design, in a comment (so nobody wonders if there's an off by one here)
  • make a const numActions and make sure it doesn't rot. One way to do this is a unit test that validates that Action(numActions) still stringifies properly but numActions+1 doesn't. You might be able to think of a better way too.

pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 314 at r5 (raw file):

			continue
		}
		// Sort range IDs for deterministic iteration order.

See #165284 - this isn't merged but add a sync.Pool for this following that pattern regardless, so that we don't allocate slices in the common case.

I'm also worried by this determinism. We don't attempt to be entirely random, but we should also not be entirely not random. Keep the sorting, but then make range ids { iteration order deterministically random (using the rng on rebalanceEnv).


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 298 at r6 (raw file):

// replicaStateForStore returns the ReplicaState of the replica on the given
// store, and whether it was found.
func replicaStateForStore(rs *rangeState, storeID roachpb.StoreID) (ReplicaState, bool) {

can you make this a method on rangeState (and pick a good location)?


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 318 at r6 (raw file):

}

// filterAddCandidates filters candidateStores down to stores that are ready

For performance reasons, I would like the returned storeSet to be backed by the same memory as the incoming candidateStores set. This should be part of the commented contract and the caller needs to be careful when using pooled memory. See #165284 for examples of this pattern.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 322 at r6 (raw file):

// range at the node level. excludeStoreID, if non-zero, is excluded from the
// existing-replica set (used when a replica on that store is being
// concurrently removed as part of the same change).

add: , such as during non-voter promotions


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 374 at r6 (raw file):

// for diversity scoring: voterLocalityTiers for voter operations,
// replicaLocalityTiers for non-voter operations.
func (re *rebalanceEnv) pickStoreByDiversity(

Does rebalancing (not repair) have an equivalent of this or does it not care about diversity yet?
This is a discussion item.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 404 at r6 (raw file):

// repairAddVoter attempts to add a voter to an under-replicated range.
// It follows the decision tree from constraint.go: first try to promote a

the reference to constraint.go isn't going to age well, remove it.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 441 at r6 (raw file):

	re.constraintMatcher.constrainStoresForExpr(constrDisj, &candidateStores)

	validCandidates := re.filterAddCandidates(ctx, rs, candidateStores, 0)

use a local const for the 0.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 443 at r6 (raw file):

	validCandidates := re.filterAddCandidates(ctx, rs, candidateStores, 0)
	if len(validCandidates) == 0 {
		log.KvDistribution.Warningf(ctx,

Is this really a warning? I'd see this as a verbosity 1 Infof. This can happen routinely, and to many ranges, during outages or when constraints are misconfigured. Yes, worth changing, but hardly worth logging at high density or Warning.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 465 at r6 (raw file):

	rangeChange := MakePendingRangeChange(rangeID, []ReplicaChange{addChange})
	if err := re.preCheckOnApplyReplicaChanges(rangeChange); err != nil {
		log.KvDistribution.Warningf(ctx,

we don't expect to hit this, right? Then Warningf is ok but please check that this can't be routinely hit (and possibly for many ranges in one go).


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 470 at r6 (raw file):

	}
	re.enactRepair(ctx, localStoreID, rangeChange)
	log.KvDistribution.Infof(ctx,

we don't want to log at Info for routine stuff except in the aggregate. Put this behind verbosity 1.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 490 at r6 (raw file):

		(*existingReplicaLocalities).getScoreChangeForNewReplica)
	if bestStoreID == 0 {
		log.KvDistribution.Warningf(ctx,

ditto


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 498 at r6 (raw file):

	prevState, found := replicaStateForStore(rs, bestStoreID)
	if !found {
		log.KvDistribution.Warningf(ctx,

This can remain a warning, right? Since we went down this path and so there really ought to be a non-voter we could promote and in particular its store should be known?


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 522 at r6 (raw file):

	}
	re.enactRepair(ctx, localStoreID, rangeChange)
	log.KvDistribution.Infof(ctx,

ditto

tbg and others added 4 commits March 11, 2026 16:36
Move 7 constraint analysis methods from `constraint_unused_test.go` to
`constraint.go`:

- candidatesToConvertFromNonVoterToVoter
- constraintsForAddingVoter
- candidatesToConvertFromVoterToNonVoter
- constraintsForAddingNonVoter
- candidatesForRoleSwapForConstraints
- candidatesVoterConstraintsUnsatisfied
- candidatesNonVoterConstraintsUnsatisfied

Pure mechanical move with improved doc comments from the prototype. These
methods are prerequisites for the per-action repair functions in later PRs
(AddVoter, RemoveVoter, constraint swaps).

Informs cockroachdb#164658.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add the RepairAction enum and computeRepairAction() decision tree. These
establish the action space and priority ordering for MMA repair.

RepairAction has 15 values (12 actionable + 3 terminal states), ordered by
priority via iota. computeRepairAction() maps range state to the
highest-priority repair action needed, using a straightforward if/else
cascade examining joint configs, quorum, replica counts, and constraint
satisfaction.

No callers yet — the wiring to clusterState comes in the next commit.

Comparison with legacy Allocator.ComputeAction (allocatorimpl/allocator.go):

The legacy allocator has two separate orderings that sometimes disagree:

1. The Priority() ordering (used to rank ranges in the replicate queue):
   FinalizeAtomicReplicationChange  12002
   RemoveLearner                    12001
   ReplaceDeadVoter                 12000
   AddVoter                         10000
   ReplaceDecommissioningVoter       5000
   RemoveDeadVoter                   1000
   RemoveDecommissioningVoter         900
   RemoveVoter                        800
   ReplaceDeadNonVoter                700
   AddNonVoter                        600
   ReplaceDecommissioningNonVoter     500
   RemoveDeadNonVoter                 400
   RemoveDecommissioningNonVoter      300
   RemoveNonVoter                     200

2. The computeAction() if/else cascade (used to pick which action to take
   for a single range):
   AddVoter                         ← checked before quorum!
   [quorum check → RangeUnavailable]
   ReplaceDeadVoter
   ReplaceDecommissioningVoter
   RemoveDeadVoter                  ← separate from ReplaceDeadVoter
   RemoveDecommissioningVoter       ← separate from ReplaceDecomVoter
   RemoveVoter
   AddNonVoter
   ReplaceDeadNonVoter
   ReplaceDecommissioningNonVoter
   RemoveDeadNonVoter               ← separate from ReplaceDeadNonVoter
   RemoveDecommissioningNonVoter    ← separate from ReplaceDecomNonVoter
   RemoveNonVoter

MMA's RepairAction unifies both orderings into a single iota sequence:
   FinalizeAtomicReplicationChange   (1)
   RemoveLearner                     (2)
   AddVoter                          (3)
   ReplaceDeadVoter                  (4)
   ReplaceDecommissioningVoter       (5)
   RemoveVoter                       (6)
   AddNonVoter                       (7)
   ReplaceDeadNonVoter               (8)
   ReplaceDecommissioningNonVoter    (9)
   RemoveNonVoter                   (10)
   SwapVoterForConstraints          (11)  ← new, legacy has no equivalent
   SwapNonVoterForConstraints       (12)  ← new, legacy has no equivalent
   RepairSkipped                    (13)
   RepairPending                    (14)
   NoRepairNeeded                   (15)

Key differences from legacy:

- Quorum check gates all actions: In the legacy code, AddVoter is checked
  before the quorum gate, meaning it can be attempted even without quorum
  (with a TODO noting this). MMA checks quorum first (step 4) and skips
  repair entirely if quorum is lost, since all replication changes require
  raft consensus.

- No separate Remove{Dead,Decommissioning}{Voter,NonVoter}: The legacy
  code distinguishes "replace dead voter" (count matches, add-then-remove)
  from "remove dead voter" (over-replicated, just remove). MMA collapses
  these — RemoveVoter handles all over-replication cases, with candidate
  selection preferring dead > decommissioning > healthy replicas.

- Constraint swaps are new: Legacy doesn't have repair actions for
  constraint violations — those are handled as rebalancing. MMA treats
  them as repair because a range with correct counts but wrong placement
  is not fully conformant.

Informs cockroachdb#164658.
Wire the repair action computation into clusterState so that each range's
repair action is eagerly tracked and indexed.

Structural changes:
- Add `repairAction RepairAction` field to `rangeState`
- Add `repairRanges map[RepairAction]map[RangeID]struct{}` to `clusterState`
- Add `updateRepairAction()` and `removeFromRepairRanges()` to maintain the
  index

Trigger points (where updateRepairAction is called):
1. End of processRangeMsg (replicas/config may have changed)
2. pendingChangeEnacted when all pending changes complete
3. End of undoPendingChange
4. End of addPendingRangeChange (sets RepairPending)
5. updateStoreStatuses when health/disposition changes (recomputes for
   all ranges on the affected store)

Range GC calls removeFromRepairRanges before deleting the range.

Test infrastructure:
- `repair-needed` DSL command: iterates repairRanges by priority, prints
  action-to-ranges mapping; scans separately for RepairPending
- `repair` DSL command: stub (pending changes only, no execution yet)
- Parser: nextReplicaID auto-assignment, quiet=true on set-store, relaxed
  field count for replica lines, repair recomputation on update-store-status

6 new testdata files exercise the repair tracking across priority ordering,
config changes, constraint changes, multi-range scenarios, pending change
lifecycle, and store status transitions.

Informs cockroachdb#164658.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@tbg tbg force-pushed the prod-mma-repair-step2 branch from 78c22d3 to 301ba8e Compare March 11, 2026 15:36
Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review of commits after "START REVIEWING HERE"

Two commits add repair orchestration and the first concrete repair action (AddVoter with non-voter promotion) to the MMA prototype. The code is well-structured, the commit split is clean, and the core algorithms (reservoir sampling, priority iteration, constraint filtering) are correct.

Verified correct:

  • Concurrency: repair() runs under a.mu.Lock() via ComputeChanges. No concurrent access issues.
  • Map mutation during iteration: repair() snapshots range IDs into ids before iterating, so enactRepair -> addPendingRangeChange -> updateRepairAction mutations are safe.
  • Repair-rebalance interaction: repair pending changes correctly prevent rebalancer from touching the same ranges.
  • Reservoir sampling in pickStoreByDiversity is correct standard k=1 reservoir sampling.

Strengths:

  • Clean two-commit structure: orchestration skeleton first, concrete action second.
  • diversityScorer function type is a clean abstraction using idiomatic Go method value expressions.
  • Non-voter promotion avoids unnecessary data movement.
  • Thorough multi-round DSL tests demonstrating the full repair lifecycle.

Test coverage gaps (not blocking, but worth adding):

  • Leaseholder-only filtering in repair() is untested — both test files always call repair store-id=1 which is the leaseholder. Add a test where the local store is not the leaseholder to verify skipping.
  • No tests for error/edge cases: no valid candidates, constraint analysis failure, pre-check failure.
  • Diversity scoring with clearly unequal candidates not tested (only tied or single-candidate scenarios).

Other notes:

  • The mmaid field comment on rebalanceEnv says "a counter for rebalanceStores calls" but repair() now also increments it. Worth updating.
  • The repair() return value is discarded at allocator_state.go:349 since rebalanceStores() returns the same accumulated re.changes. Consider removing the return or adding a comment.

(made with /review-crdb)

}
}
// Add to new bucket.
if newAction != NoRepairNeeded && newAction != RepairPending {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/review-crdb(suggestion): RepairSkipped ranges (quorum loss, nil config) are added to repairRanges here but should be excluded like RepairPending. When repair() iterates, it hits the default case and logs "repair action RepairSkipped for rN not yet implemented" at Info level — misleading (these are intentionally skipped, not unimplemented) and potentially noisy in stressed clusters.

Suggested change
if newAction != NoRepairNeeded && newAction != RepairPending {
if newAction != NoRepairNeeded && newAction != RepairPending && newAction != RepairSkipped {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

(*existingReplicaLocalities).getScoreChangeForNewReplica)

// Create the pending change.
targetSS := re.stores[bestStoreID]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/review-crdb(suggestion): Missing bestStoreID == 0 guard. pickStoreByDiversity documents returning 0 for "no valid candidate." While currently safe because filterAddCandidates guarantees non-nil storeStates, promoteNonVoterToVoter has this guard (line 489) and this path should too for defensive consistency.

Suggested change
targetSS := re.stores[bestStoreID]
if bestStoreID == 0 {
log.KvDistribution.Warningf(ctx,
"skipping AddVoter repair for r%d: no valid target after diversity scoring", rangeID)
return
}
targetSS := re.stores[bestStoreID]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// Find the existing replica state for the non-voter being promoted.
prevState, found := replicaStateForStore(rs, bestStoreID)
if !found {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/review-crdb(suggestion): This "replica not found" case indicates an internal invariant violation — bestStoreID was just returned by candidatesToConvertFromNonVoterToVoter() from the same rs.replicas data within the same single-threaded repair() call. Consider making this an assertion failure (matching the pattern in updateRepairAction line 256):

err := errors.AssertionFailedf(
    "non-voter on s%d not found in replicas for r%d", bestStoreID, rangeID)
if buildutil.CrdbTestBuild {
    panic(err)
}
log.KvDistribution.Warningf(ctx,
    "skipping AddVoter repair for r%d: %v", rangeID, err)
return

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Made it an AssertionFailedf with panic in test builds, warning in production.

case AddVoter:
re.repairAddVoter(ctx, localStoreID, rangeID, rs)
default:
log.KvDistribution.Infof(ctx,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/review-crdb(suggestion): With IncludeRepair running on every ComputeChanges pass, this Info-level log fires for every range needing an unimplemented action (FinalizeAtomicReplicationChange, RemoveLearner, ReplaceDeadVoter, etc.) on every pass. Consider VEventf to reduce noise:

Suggested change
log.KvDistribution.Infof(ctx,
log.KvDistribution.VEventf(ctx, 2,

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
re.enactRepair(ctx, localStoreID, rangeChange)
log.KvDistribution.Infof(ctx,
"result(success): AddVoter repair for r%v, adding voter on s%v",
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/review-crdb(nit): Uses r%v and s%v while promoteNonVoterToVoter (line 523) uses r%d and s%d for the same types. Minor inconsistency.

Suggested change
"result(success): AddVoter repair for r%v, adding voter on s%v",
"result(success): AddVoter repair for r%d, adding voter on s%d",

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Mar 11, 2026

Addressed all review comments. Here's a summary of responses:

Commit message update — will discuss interactively (see below).

mmaid lift to caller (line 306) — Done in a separate non-fixup commit: allocator/mmaprototype: lift mmaid tagging to ComputeChanges. Both repair() and rebalanceStores() now share the same mmaid context set once in ComputeChanges. Also updated the mmaid field comment to mention repair.

Repair loop iteration (line 309) — Done:

  • Start at RepairAction(1) with a comment explaining 0 is invalid by design.
  • Added numRepairActions sentinel const at the end of the iota.
  • Added TestNumRepairActions that validates NoRepairNeeded+1 == numRepairActions and that all actions in [1, numRepairActions) have known String() values.

sync.Pool + randomized iteration (line 314) — Done. Added rangeIDSlicePool with sync.Pool for the []roachpb.RangeID slice. Range IDs are sorted then shuffled using re.rng.Shuffle for deterministic-random iteration.

replicaStateForStore as method (line 298) — Done. Moved to a method on *rangeState in cluster_state.go, next to the other rangeState methods (setReplica, removeReplica, etc).

filterAddCandidates reuse memory (line 318) — Done. Changed to valid := candidateStores[:0] for in-place filtering, matching the retainReadyReplicaTargetStoresOnly pattern. Documented in the method comment that the returned slice reuses the backing memory.

Comment: "such as during non-voter promotions" (line 322) — Done.

pickStoreByDiversity in rebalancing (line 374) — Rebalancing already has diversity scoring via getScoreChangeForRebalance() (line 578 of cluster_state_rebalance_stores.go), used in sortTargetCandidateSetAndPick() as the primary sort key — the code explicitly refuses to reduce diversity ("we are not willing to reduce diversity when rebalancing ranges"). The repair path correctly uses the simpler getScoreChangeForNewReplica since it's purely adding a replica (no "old store" to swap out), while rebalancing uses getScoreChangeForRebalance which computes the net diversity delta of the swap.

Remove constraint.go reference (line 404) — Done.

Local const for 0 (line 441) — Done. Named noExcludedStore.

Warning → VEventf(1) for no valid candidates (line 443) — Done. Agreed this can happen routinely.

preCheck warning (line 465) — Kept as Warningf. This shouldn't be routinely hit: we already checked for no pending changes during computeRepairAction, and the range was just in the repairRanges index. If preCheck fails here it indicates something unexpected changed between action computation and enactment within the same pass — worth a warning.

Success log → VEventf(1) (lines 470, 522) — Done for both addVoterFresh and promoteNonVoterToVoter.

bestStoreID == 0 in promoteNonVoter (line 490) — Changed to VEventf(1). Can happen if all promotion candidates have nil storeState.

"not found" in promoteNonVoter (line 498) — Made it an AssertionFailedf with panic in test builds, warning in production. This is an internal invariant violation: bestStoreID was just returned from candidatesToConvertFromNonVoterToVoter() which reads the same rs.replicas within a single-threaded repair() call.

repair() return value discarded — Added a comment at the call site explaining that repair() appends to re.changes and rebalanceStores() returns the same accumulated slice.

Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg reviewed 35 files and all commit messages, made 3 comments, and resolved 16 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained.


-- commits line 243 at r15:
Oh. Well I like this mmaid change. But I don't like the churn in the testdata files. Can you update the datadriven DSL directive so that it maintains an mmaid counter in the context separately in the test? Basically, make the testdata diff go away. Add a comment explaining why this is currently non-unified.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 255 at r13 (raw file):

	}
	// Remove from old bucket.
	if oldAction != NoRepairNeeded && oldAction != RepairPending && oldAction != 0 {

The contract about what is or isn't in repairRanges is a little obfuscated. Looks like RepairPending and RepairSkipped are special in that they aren't tracked. But why not, wouldn't that be more consistent and also more helpful for further metrics (not in this PR), where we will want to expose a gauge that has the count by state? I think we should explicitly track all states, except NoRepairNeeded.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 555 at r13 (raw file):

		// Collect and sort range IDs, then shuffle deterministically so that
		// iteration is not systematically biased toward any range.
		idsPtr := rangeIDSlicePool.Get().(*[]roachpb.RangeID)

can you lift this pool outside the for loop, use defer to return the slices to the pool, and reset the slices at the top of each loop?

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Mar 12, 2026

Addressed the three items from the latest review:

  1. repairRanges now tracks all states except NoRepairNeeded — RepairPending and RepairSkipped are in the index (for future metrics). repair() has explicit cases to skip them.
  2. Lifted rangeIDSlicePool outside the for loop with defer; slice reset at top of each iteration.
  3. Test DSL maintains its own mmaid counter — both repair and rebalance-stores directives increment it and tag the context, so testdata files keep their [mmaid=N] tags. Net zero testdata diff when squashed into the mmaid lift commit. Removed the separate RepairPending scan from repair-needed since it's now in the index.

Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg reviewed 21 files and all commit messages, made 2 comments, and resolved 3 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 569 at r16 (raw file):

			ids[i], ids[j] = ids[j], ids[i]
		})
		*idsPtr = ids // preserve any grown capacity for next iteration

this pattern seems cleaner:

idsPtr := ...
ids := (*idsPtr)[:0]
defer func() {
  // Preserve any grown capacity for next iteration.
  ids = ids[:0]
  *idsPtr = ids
  rangeIDSlicePool.Put(idsPtr)
}()
for ... {
  ids = ids[:0]

  ids = append(ids, ...)
}

WDYT?


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 581 at r16 (raw file):

			case AddVoter:
				re.repairAddVoter(ctx, localStoreID, rangeID, rs)
			case RepairSkipped, RepairPending:

can you move this into the len(ranges) == 0 check at the top so we don't needlessly iterate over all the individual ranges in cases where we're going to not act on any?

Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg reviewed 1 file and all commit messages, and resolved 2 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on tbg).

tbg and others added 3 commits March 12, 2026 09:46
Add the repair() method on rebalanceEnv — the main entry point for MMA
repair. It iterates repairRanges in priority order, filters to ranges
where the local store is the leaseholder, and dispatches to per-action
repair functions. No repair actions are implemented yet (the switch
default logs "not yet implemented"); AddVoter comes in the next commit.

Wire repair into ComputeChanges via the IncludeRepair field on
ChangeOptions. When set, repair() runs before rebalanceStores(), and its
pending changes prevent the rebalancer from touching the same ranges.

Add originMMARepair to the ChangeOrigin enum so that repair-originated
changes can be tracked through AdjustPendingChangeDisposition. For now
repair changes share the rebalance metric counters; dedicated repair
metrics come in a follow-up PR.

Add the "repair" DSL command to the test harness. It creates a
rebalanceEnv with a deterministic random seed and calls repair().

Informs cockroachdb#164658.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…tion

Add the first concrete repair action (AddVoter) and all helper functions it
needs, proving the repair pipeline works end-to-end at the unit/DSL level.

Two code paths for adding a voter:

1. **Promote existing non-voter**: when a non-voter already satisfies voter
   constraints, promote it via MakeReplicaTypeChange (NON_VOTER → VOTER_FULL).
   The best candidate is chosen by diversity scoring with reservoir sampling
   for ties.

2. **Add new voter**: find constraint-satisfying stores via the constraint
   disjunction, filter to ready stores not already hosting a replica at the
   node level, then pick the store maximizing diversity.

Helper functions introduced:
- `enactRepair`: records a pending change and appends to `re.changes`
- `filterAddCandidates`: filters to ready stores without existing replicas
  at node level
- `replicaStateForStore`: finds replica state for a given store
- `pickStoreByDiversity`: diversity-based store selection with reservoir
  sampling for tie-breaking
- `diversityScorer` type: function abstraction over diversity score computation

DSL test coverage:
- `repair_add_voter`: basic voter addition with diversity selection
- `repair_promote_nonvoter`: promotion path when a non-voter exists

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Previously both repair() and rebalanceStores() independently incremented
mmaid and added a logtag. Now the increment and logtag are set once in
ComputeChanges, so both repair and rebalance phases share the same
mmaid context within a single allocator pass.

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@tbg tbg force-pushed the prod-mma-repair-step2 branch from c5d3b47 to cc1aa7e Compare March 12, 2026 08:46
Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg reviewed 10 files and all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on tbg).

tbg added a commit to tbg/cockroach that referenced this pull request Mar 12, 2026
Mark PRs 1 and 2 as completed with links to cockroachdb#165413 and cockroachdb#165423.
Update PR 3 description to include ASIM wiring and reflect prototype
discoveries. Fix PR 4/5 helper lists to account for what already shipped.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@tbg tbg closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants