allocator/mmaprototype: add repair orchestration and AddVoter#165423
allocator/mmaprototype: add repair orchestration and AddVoter#165423tbg wants to merge 7 commits intocockroachdb:masterfrom
Conversation
|
Merging to
|
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
523176c to
78c22d3
Compare
tbg
left a comment
There was a problem hiding this comment.
@tbg reviewed 18 files and all commit messages, and made 16 comments.
Reviewable status:complete! 0 of 0 LGTMs obtained.
-- commits line 182 at r6:
Can you update the commit message to contrast this with how the legacy allocator works and to highlight any differences?
Interactively let me engage first before pushing the suggested update.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 306 at r5 (raw file):
) []ExternalRangeChange { re.mmaid++ ctx = logtags.AddTag(ctx, "mmaid", re.mmaid)
now we have more than one place that does this. Couldn't this be lifted to the caller so that we have a ctx that both repair and rebalance can use that already has the mmaid in it? Do this as a new commit in the end (i.e won't be squashed).
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 309 at r5 (raw file):
// Iterate repair actions in priority order (lower enum = higher priority). for action := FinalizeAtomicReplicationChange; action < NoRepairNeeded; action++ {
- can you start at
1(so that we don't think there's anything special about FinalizeAtomicReplicationChange) - mention that 0 is not a valid action by design, in a comment (so nobody wonders if there's an off by one here)
- make a const numActions and make sure it doesn't rot. One way to do this is a unit test that validates that
Action(numActions)still stringifies properly butnumActions+1doesn't. You might be able to think of a better way too.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 314 at r5 (raw file):
continue } // Sort range IDs for deterministic iteration order.
See #165284 - this isn't merged but add a sync.Pool for this following that pattern regardless, so that we don't allocate slices in the common case.
I'm also worried by this determinism. We don't attempt to be entirely random, but we should also not be entirely not random. Keep the sorting, but then make range ids { iteration order deterministically random (using the rng on rebalanceEnv).
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 298 at r6 (raw file):
// replicaStateForStore returns the ReplicaState of the replica on the given // store, and whether it was found. func replicaStateForStore(rs *rangeState, storeID roachpb.StoreID) (ReplicaState, bool) {
can you make this a method on rangeState (and pick a good location)?
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 318 at r6 (raw file):
} // filterAddCandidates filters candidateStores down to stores that are ready
For performance reasons, I would like the returned storeSet to be backed by the same memory as the incoming candidateStores set. This should be part of the commented contract and the caller needs to be careful when using pooled memory. See #165284 for examples of this pattern.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 322 at r6 (raw file):
// range at the node level. excludeStoreID, if non-zero, is excluded from the // existing-replica set (used when a replica on that store is being // concurrently removed as part of the same change).
add: , such as during non-voter promotions
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 374 at r6 (raw file):
// for diversity scoring: voterLocalityTiers for voter operations, // replicaLocalityTiers for non-voter operations. func (re *rebalanceEnv) pickStoreByDiversity(
Does rebalancing (not repair) have an equivalent of this or does it not care about diversity yet?
This is a discussion item.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 404 at r6 (raw file):
// repairAddVoter attempts to add a voter to an under-replicated range. // It follows the decision tree from constraint.go: first try to promote a
the reference to constraint.go isn't going to age well, remove it.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 441 at r6 (raw file):
re.constraintMatcher.constrainStoresForExpr(constrDisj, &candidateStores) validCandidates := re.filterAddCandidates(ctx, rs, candidateStores, 0)
use a local const for the 0.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 443 at r6 (raw file):
validCandidates := re.filterAddCandidates(ctx, rs, candidateStores, 0) if len(validCandidates) == 0 { log.KvDistribution.Warningf(ctx,
Is this really a warning? I'd see this as a verbosity 1 Infof. This can happen routinely, and to many ranges, during outages or when constraints are misconfigured. Yes, worth changing, but hardly worth logging at high density or Warning.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 465 at r6 (raw file):
rangeChange := MakePendingRangeChange(rangeID, []ReplicaChange{addChange}) if err := re.preCheckOnApplyReplicaChanges(rangeChange); err != nil { log.KvDistribution.Warningf(ctx,
we don't expect to hit this, right? Then Warningf is ok but please check that this can't be routinely hit (and possibly for many ranges in one go).
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 470 at r6 (raw file):
} re.enactRepair(ctx, localStoreID, rangeChange) log.KvDistribution.Infof(ctx,
we don't want to log at Info for routine stuff except in the aggregate. Put this behind verbosity 1.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 490 at r6 (raw file):
(*existingReplicaLocalities).getScoreChangeForNewReplica) if bestStoreID == 0 { log.KvDistribution.Warningf(ctx,
ditto
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 498 at r6 (raw file):
prevState, found := replicaStateForStore(rs, bestStoreID) if !found { log.KvDistribution.Warningf(ctx,
This can remain a warning, right? Since we went down this path and so there really ought to be a non-voter we could promote and in particular its store should be known?
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 522 at r6 (raw file):
} re.enactRepair(ctx, localStoreID, rangeChange) log.KvDistribution.Infof(ctx,
ditto
Move 7 constraint analysis methods from `constraint_unused_test.go` to `constraint.go`: - candidatesToConvertFromNonVoterToVoter - constraintsForAddingVoter - candidatesToConvertFromVoterToNonVoter - constraintsForAddingNonVoter - candidatesForRoleSwapForConstraints - candidatesVoterConstraintsUnsatisfied - candidatesNonVoterConstraintsUnsatisfied Pure mechanical move with improved doc comments from the prototype. These methods are prerequisites for the per-action repair functions in later PRs (AddVoter, RemoveVoter, constraint swaps). Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add the RepairAction enum and computeRepairAction() decision tree. These
establish the action space and priority ordering for MMA repair.
RepairAction has 15 values (12 actionable + 3 terminal states), ordered by
priority via iota. computeRepairAction() maps range state to the
highest-priority repair action needed, using a straightforward if/else
cascade examining joint configs, quorum, replica counts, and constraint
satisfaction.
No callers yet — the wiring to clusterState comes in the next commit.
Comparison with legacy Allocator.ComputeAction (allocatorimpl/allocator.go):
The legacy allocator has two separate orderings that sometimes disagree:
1. The Priority() ordering (used to rank ranges in the replicate queue):
FinalizeAtomicReplicationChange 12002
RemoveLearner 12001
ReplaceDeadVoter 12000
AddVoter 10000
ReplaceDecommissioningVoter 5000
RemoveDeadVoter 1000
RemoveDecommissioningVoter 900
RemoveVoter 800
ReplaceDeadNonVoter 700
AddNonVoter 600
ReplaceDecommissioningNonVoter 500
RemoveDeadNonVoter 400
RemoveDecommissioningNonVoter 300
RemoveNonVoter 200
2. The computeAction() if/else cascade (used to pick which action to take
for a single range):
AddVoter ← checked before quorum!
[quorum check → RangeUnavailable]
ReplaceDeadVoter
ReplaceDecommissioningVoter
RemoveDeadVoter ← separate from ReplaceDeadVoter
RemoveDecommissioningVoter ← separate from ReplaceDecomVoter
RemoveVoter
AddNonVoter
ReplaceDeadNonVoter
ReplaceDecommissioningNonVoter
RemoveDeadNonVoter ← separate from ReplaceDeadNonVoter
RemoveDecommissioningNonVoter ← separate from ReplaceDecomNonVoter
RemoveNonVoter
MMA's RepairAction unifies both orderings into a single iota sequence:
FinalizeAtomicReplicationChange (1)
RemoveLearner (2)
AddVoter (3)
ReplaceDeadVoter (4)
ReplaceDecommissioningVoter (5)
RemoveVoter (6)
AddNonVoter (7)
ReplaceDeadNonVoter (8)
ReplaceDecommissioningNonVoter (9)
RemoveNonVoter (10)
SwapVoterForConstraints (11) ← new, legacy has no equivalent
SwapNonVoterForConstraints (12) ← new, legacy has no equivalent
RepairSkipped (13)
RepairPending (14)
NoRepairNeeded (15)
Key differences from legacy:
- Quorum check gates all actions: In the legacy code, AddVoter is checked
before the quorum gate, meaning it can be attempted even without quorum
(with a TODO noting this). MMA checks quorum first (step 4) and skips
repair entirely if quorum is lost, since all replication changes require
raft consensus.
- No separate Remove{Dead,Decommissioning}{Voter,NonVoter}: The legacy
code distinguishes "replace dead voter" (count matches, add-then-remove)
from "remove dead voter" (over-replicated, just remove). MMA collapses
these — RemoveVoter handles all over-replication cases, with candidate
selection preferring dead > decommissioning > healthy replicas.
- Constraint swaps are new: Legacy doesn't have repair actions for
constraint violations — those are handled as rebalancing. MMA treats
them as repair because a range with correct counts but wrong placement
is not fully conformant.
Informs cockroachdb#164658.
Wire the repair action computation into clusterState so that each range's
repair action is eagerly tracked and indexed.
Structural changes:
- Add `repairAction RepairAction` field to `rangeState`
- Add `repairRanges map[RepairAction]map[RangeID]struct{}` to `clusterState`
- Add `updateRepairAction()` and `removeFromRepairRanges()` to maintain the
index
Trigger points (where updateRepairAction is called):
1. End of processRangeMsg (replicas/config may have changed)
2. pendingChangeEnacted when all pending changes complete
3. End of undoPendingChange
4. End of addPendingRangeChange (sets RepairPending)
5. updateStoreStatuses when health/disposition changes (recomputes for
all ranges on the affected store)
Range GC calls removeFromRepairRanges before deleting the range.
Test infrastructure:
- `repair-needed` DSL command: iterates repairRanges by priority, prints
action-to-ranges mapping; scans separately for RepairPending
- `repair` DSL command: stub (pending changes only, no execution yet)
- Parser: nextReplicaID auto-assignment, quiet=true on set-store, relaxed
field count for replica lines, repair recomputation on update-store-status
6 new testdata files exercise the repair tracking across priority ordering,
config changes, constraint changes, multi-range scenarios, pending change
lifecycle, and store status transitions.
Informs cockroachdb#164658.
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
78c22d3 to
301ba8e
Compare
tbg
left a comment
There was a problem hiding this comment.
Review of commits after "START REVIEWING HERE"
Two commits add repair orchestration and the first concrete repair action (AddVoter with non-voter promotion) to the MMA prototype. The code is well-structured, the commit split is clean, and the core algorithms (reservoir sampling, priority iteration, constraint filtering) are correct.
Verified correct:
- Concurrency:
repair()runs undera.mu.Lock()viaComputeChanges. No concurrent access issues. - Map mutation during iteration:
repair()snapshots range IDs intoidsbefore iterating, soenactRepair->addPendingRangeChange->updateRepairActionmutations are safe. - Repair-rebalance interaction: repair pending changes correctly prevent rebalancer from touching the same ranges.
- Reservoir sampling in
pickStoreByDiversityis correct standard k=1 reservoir sampling.
Strengths:
- Clean two-commit structure: orchestration skeleton first, concrete action second.
diversityScorerfunction type is a clean abstraction using idiomatic Go method value expressions.- Non-voter promotion avoids unnecessary data movement.
- Thorough multi-round DSL tests demonstrating the full repair lifecycle.
Test coverage gaps (not blocking, but worth adding):
- Leaseholder-only filtering in
repair()is untested — both test files always callrepair store-id=1which is the leaseholder. Add a test where the local store is not the leaseholder to verify skipping. - No tests for error/edge cases: no valid candidates, constraint analysis failure, pre-check failure.
- Diversity scoring with clearly unequal candidates not tested (only tied or single-candidate scenarios).
Other notes:
- The
mmaidfield comment onrebalanceEnvsays "a counter for rebalanceStores calls" butrepair()now also increments it. Worth updating. - The
repair()return value is discarded atallocator_state.go:349sincerebalanceStores()returns the same accumulatedre.changes. Consider removing the return or adding a comment.
(made with /review-crdb)
| } | ||
| } | ||
| // Add to new bucket. | ||
| if newAction != NoRepairNeeded && newAction != RepairPending { |
There was a problem hiding this comment.
/review-crdb(suggestion): RepairSkipped ranges (quorum loss, nil config) are added to repairRanges here but should be excluded like RepairPending. When repair() iterates, it hits the default case and logs "repair action RepairSkipped for rN not yet implemented" at Info level — misleading (these are intentionally skipped, not unimplemented) and potentially noisy in stressed clusters.
| if newAction != NoRepairNeeded && newAction != RepairPending { | |
| if newAction != NoRepairNeeded && newAction != RepairPending && newAction != RepairSkipped { |
| (*existingReplicaLocalities).getScoreChangeForNewReplica) | ||
|
|
||
| // Create the pending change. | ||
| targetSS := re.stores[bestStoreID] |
There was a problem hiding this comment.
/review-crdb(suggestion): Missing bestStoreID == 0 guard. pickStoreByDiversity documents returning 0 for "no valid candidate." While currently safe because filterAddCandidates guarantees non-nil storeStates, promoteNonVoterToVoter has this guard (line 489) and this path should too for defensive consistency.
| targetSS := re.stores[bestStoreID] | |
| if bestStoreID == 0 { | |
| log.KvDistribution.Warningf(ctx, | |
| "skipping AddVoter repair for r%d: no valid target after diversity scoring", rangeID) | |
| return | |
| } | |
| targetSS := re.stores[bestStoreID] |
|
|
||
| // Find the existing replica state for the non-voter being promoted. | ||
| prevState, found := replicaStateForStore(rs, bestStoreID) | ||
| if !found { |
There was a problem hiding this comment.
/review-crdb(suggestion): This "replica not found" case indicates an internal invariant violation — bestStoreID was just returned by candidatesToConvertFromNonVoterToVoter() from the same rs.replicas data within the same single-threaded repair() call. Consider making this an assertion failure (matching the pattern in updateRepairAction line 256):
err := errors.AssertionFailedf(
"non-voter on s%d not found in replicas for r%d", bestStoreID, rangeID)
if buildutil.CrdbTestBuild {
panic(err)
}
log.KvDistribution.Warningf(ctx,
"skipping AddVoter repair for r%d: %v", rangeID, err)
returnThere was a problem hiding this comment.
Done. Made it an AssertionFailedf with panic in test builds, warning in production.
| case AddVoter: | ||
| re.repairAddVoter(ctx, localStoreID, rangeID, rs) | ||
| default: | ||
| log.KvDistribution.Infof(ctx, |
There was a problem hiding this comment.
/review-crdb(suggestion): With IncludeRepair running on every ComputeChanges pass, this Info-level log fires for every range needing an unimplemented action (FinalizeAtomicReplicationChange, RemoveLearner, ReplaceDeadVoter, etc.) on every pass. Consider VEventf to reduce noise:
| log.KvDistribution.Infof(ctx, | |
| log.KvDistribution.VEventf(ctx, 2, |
| } | ||
| re.enactRepair(ctx, localStoreID, rangeChange) | ||
| log.KvDistribution.Infof(ctx, | ||
| "result(success): AddVoter repair for r%v, adding voter on s%v", |
There was a problem hiding this comment.
/review-crdb(nit): Uses r%v and s%v while promoteNonVoterToVoter (line 523) uses r%d and s%d for the same types. Minor inconsistency.
| "result(success): AddVoter repair for r%v, adding voter on s%v", | |
| "result(success): AddVoter repair for r%d, adding voter on s%d", |
|
Addressed all review comments. Here's a summary of responses: Commit message update — will discuss interactively (see below).
Repair loop iteration (line 309) — Done:
sync.Pool + randomized iteration (line 314) — Done. Added
Comment: "such as during non-voter promotions" (line 322) — Done.
Remove Local const for Warning → VEventf(1) for no valid candidates (line 443) — Done. Agreed this can happen routinely. preCheck warning (line 465) — Kept as Success log → VEventf(1) (lines 470, 522) — Done for both
"not found" in promoteNonVoter (line 498) — Made it an
|
tbg
left a comment
There was a problem hiding this comment.
@tbg reviewed 35 files and all commit messages, made 3 comments, and resolved 16 discussions.
Reviewable status:complete! 0 of 0 LGTMs obtained.
-- commits line 243 at r15:
Oh. Well I like this mmaid change. But I don't like the churn in the testdata files. Can you update the datadriven DSL directive so that it maintains an mmaid counter in the context separately in the test? Basically, make the testdata diff go away. Add a comment explaining why this is currently non-unified.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 255 at r13 (raw file):
} // Remove from old bucket. if oldAction != NoRepairNeeded && oldAction != RepairPending && oldAction != 0 {
The contract about what is or isn't in repairRanges is a little obfuscated. Looks like RepairPending and RepairSkipped are special in that they aren't tracked. But why not, wouldn't that be more consistent and also more helpful for further metrics (not in this PR), where we will want to expose a gauge that has the count by state? I think we should explicitly track all states, except NoRepairNeeded.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 555 at r13 (raw file):
// Collect and sort range IDs, then shuffle deterministically so that // iteration is not systematically biased toward any range. idsPtr := rangeIDSlicePool.Get().(*[]roachpb.RangeID)
can you lift this pool outside the for loop, use defer to return the slices to the pool, and reset the slices at the top of each loop?
|
Addressed the three items from the latest review:
|
tbg
left a comment
There was a problem hiding this comment.
@tbg reviewed 21 files and all commit messages, made 2 comments, and resolved 3 discussions.
Reviewable status:complete! 0 of 0 LGTMs obtained.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 569 at r16 (raw file):
ids[i], ids[j] = ids[j], ids[i] }) *idsPtr = ids // preserve any grown capacity for next iteration
this pattern seems cleaner:
idsPtr := ...
ids := (*idsPtr)[:0]
defer func() {
// Preserve any grown capacity for next iteration.
ids = ids[:0]
*idsPtr = ids
rangeIDSlicePool.Put(idsPtr)
}()
for ... {
ids = ids[:0]
ids = append(ids, ...)
}WDYT?
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 581 at r16 (raw file):
case AddVoter: re.repairAddVoter(ctx, localStoreID, rangeID, rs) case RepairSkipped, RepairPending:
can you move this into the len(ranges) == 0 check at the top so we don't needlessly iterate over all the individual ranges in cases where we're going to not act on any?
tbg
left a comment
There was a problem hiding this comment.
@tbg reviewed 1 file and all commit messages, and resolved 2 discussions.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on tbg).
Add the repair() method on rebalanceEnv — the main entry point for MMA repair. It iterates repairRanges in priority order, filters to ranges where the local store is the leaseholder, and dispatches to per-action repair functions. No repair actions are implemented yet (the switch default logs "not yet implemented"); AddVoter comes in the next commit. Wire repair into ComputeChanges via the IncludeRepair field on ChangeOptions. When set, repair() runs before rebalanceStores(), and its pending changes prevent the rebalancer from touching the same ranges. Add originMMARepair to the ChangeOrigin enum so that repair-originated changes can be tracked through AdjustPendingChangeDisposition. For now repair changes share the rebalance metric counters; dedicated repair metrics come in a follow-up PR. Add the "repair" DSL command to the test harness. It creates a rebalanceEnv with a deterministic random seed and calls repair(). Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…tion Add the first concrete repair action (AddVoter) and all helper functions it needs, proving the repair pipeline works end-to-end at the unit/DSL level. Two code paths for adding a voter: 1. **Promote existing non-voter**: when a non-voter already satisfies voter constraints, promote it via MakeReplicaTypeChange (NON_VOTER → VOTER_FULL). The best candidate is chosen by diversity scoring with reservoir sampling for ties. 2. **Add new voter**: find constraint-satisfying stores via the constraint disjunction, filter to ready stores not already hosting a replica at the node level, then pick the store maximizing diversity. Helper functions introduced: - `enactRepair`: records a pending change and appends to `re.changes` - `filterAddCandidates`: filters to ready stores without existing replicas at node level - `replicaStateForStore`: finds replica state for a given store - `pickStoreByDiversity`: diversity-based store selection with reservoir sampling for tie-breaking - `diversityScorer` type: function abstraction over diversity score computation DSL test coverage: - `repair_add_voter`: basic voter addition with diversity selection - `repair_promote_nonvoter`: promotion path when a non-voter exists Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Previously both repair() and rebalanceStores() independently incremented mmaid and added a logtag. Now the increment and logtag are set once in ComputeChanges, so both repair and rebalance phases share the same mmaid context within a single allocator pass. Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
c5d3b47 to
cc1aa7e
Compare
tbg
left a comment
There was a problem hiding this comment.
@tbg reviewed 10 files and all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on tbg).
Mark PRs 1 and 2 as completed with links to cockroachdb#165413 and cockroachdb#165423. Update PR 3 description to include ASIM wiring and reflect prototype discoveries. Fix PR 4/5 helper lists to account for what already shipped. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Second PR in the MMA repair productionization stack (see #164658 for the
prototype and
how-to-productionize.mdfor the full plan). Builds on therepair foundation from #165413.
This PR adds:
Repair orchestration loop:
repair()onrebalanceEnviteratesrepairRangesin priority order, filters to ranges where the local storeis leaseholder, and dispatches to per-action repair functions. Integrated
into
ComputeChangesvia theIncludeRepairoption onChangeOptions.repairAddVoter: the first concrete repair action. Two code paths:
(1) promote an existing non-voter to voter when one satisfies voter
constraints; (2) add a new voter on a constraint-satisfying,
diversity-maximizing store.
Repair helpers:
pickStoreByDiversity(diversity-based store selectionwith reservoir sampling for tie-breaking),
filterAddCandidates(filtersto ready stores not already hosting a replica at the node level),
enactRepair(records pending change and appends to change list),isLeaseholderOnStore,replicaStateForStore.DSL test infrastructure:
repaircommand that creates arebalanceEnvand runs
repair(), with deterministic random seed for reproducible output.The remaining 11 repair actions (RemoveVoter, ReplaceDeadVoter, etc.) are
wired as stubs logging "not yet implemented" — they come in later PRs.
Commits
add repair orchestration loop — infrastructure:
repair()loop,IncludeRepairwiring,originMMARepairenum, DSLrepaircommand.No actions implemented (all hit default "not yet implemented" log).
implement repairAddVoter with non-voter promotion — first action +
all helpers:
enactRepair,filterAddCandidates,pickStoreByDiversity,repairAddVoter,promoteNonVoterToVoter. Two new DSL tests.Stacked on #165413. Informs #164658.