allocator/mmaprototype: add repair action foundation#165413
allocator/mmaprototype: add repair action foundation#165413tbg wants to merge 3 commits intocockroachdb:masterfrom
Conversation
|
Merging to
|
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
tbg
left a comment
There was a problem hiding this comment.
@tbg reviewed 13 files and all commit messages, and made 9 comments.
Reviewable status:complete! 0 of 0 LGTMs obtained.
-- commits line 27 at r2:
Can you find where the corresponding sequence for the legacy allocator lives (see Allocator.ComputeAction) and, in the commit message, draw a comparison between both of them and their priority ordering with the goal of explaining to the reviewer what changes here, if anything, when transitioning from legacy to mma. Iterate on it with me before pushing the result.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 1315 at r3 (raw file):
// repairRanges indexes ranges by their RepairAction. Ranges with // NoRepairNeeded or RepairPending are NOT stored. This allows repair() // to iterate ranges needing repair without scanning all ranges.
mention that this is always consistent with the rangeState.repairAction fields of all the tracked ranges.
Is there a convenient place to assert this (maybe we have other integrity checks already)?
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 1966 at r3 (raw file):
delete(cs.pendingChanges, change.changeID) // Recompute repair action if all pending changes for this range have // been enacted.
Add that this is correct (we don't have to re-compute if partially enacted) because as long as there are ANY pending changes, the repair action is RepairPending anyway.
(.. and check that I'm not saying something wrong here..)
Or should we just call this unconditionally? Performance is not a concern in this path. I'm asking because undoPendingChange unconditionally calls updateRepairAction so it would be more consistent to do this here, too.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state.go line 2344 at r3 (raw file):
// If the store's status changed, recompute repair actions for all // ranges on this store. The cached rs.constraints is still valid here // (store status doesn't affect constraint satisfaction).
Log at verbosity one about the status change and the need to do a full recomputation.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 1 at r2 (raw file):
// Copyright 2025 The Cockroach Authors.
2026
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 85 at r2 (raw file):
) var repairActionNames = [...]string{
Can you make this use go:generate stringer like done in other places? There might be some BAZEL stuff to get this right but you can copy from other types that have this already.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 197 at r2 (raw file):
aliveVoters := numVoters - deadVoters quorum := numVoters/2 + 1 if aliveVoters < quorum {
this should be swapped with step 4. If we don't have quorum, we also shouldn't try to finalize replication changes or to remove learners because that too requires quorum.
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_repair.go line 269 at r3 (raw file):
// Remove from old bucket. if oldAction != NoRepairNeeded && oldAction != RepairPending && oldAction != 0 { if m, ok := cs.repairRanges[oldAction]; ok {
It would be an invariant violation to hit !ok, right? Add an assertion (only when buildutil.CrdbTestBuild).
pkg/kv/kvserver/allocator/mmaprototype/cluster_state_test.go line 731 at r3 (raw file):
return fmt.Sprintf("%s%v\n", safeTrace(t, &sb), out) case "repair":
I don't think this is used anywhere, so back this out for now. It can come back when it's actually wired up to do something useful and exercised.
15f2aa2 to
e58ee9f
Compare
tbg
left a comment
There was a problem hiding this comment.
@tbg reviewed 12 files and all commit messages, and resolved 5 discussions.
Reviewable status:complete! 0 of 0 LGTMs obtained.
e58ee9f to
6dea171
Compare
Move 7 constraint analysis methods from `constraint_unused_test.go` to `constraint.go`: - candidatesToConvertFromNonVoterToVoter - constraintsForAddingVoter - candidatesToConvertFromVoterToNonVoter - constraintsForAddingNonVoter - candidatesForRoleSwapForConstraints - candidatesVoterConstraintsUnsatisfied - candidatesNonVoterConstraintsUnsatisfied Pure mechanical move with improved doc comments from the prototype. These methods are prerequisites for the per-action repair functions in later PRs (AddVoter, RemoveVoter, constraint swaps). Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add the RepairAction enum and computeRepairAction() decision tree. These
establish the action space and priority ordering for MMA repair.
RepairAction has 15 values (12 actionable + 3 terminal states), ordered by
priority via iota. computeRepairAction() maps range state to the
highest-priority repair action needed, using a straightforward if/else
cascade examining joint configs, quorum, replica counts, and constraint
satisfaction.
No callers yet — the wiring to clusterState comes in the next commit.
Comparison with legacy Allocator.ComputeAction (allocatorimpl/allocator.go):
The legacy allocator has two separate orderings that sometimes disagree:
1. The Priority() ordering (used to rank ranges in the replicate queue):
FinalizeAtomicReplicationChange 12002
RemoveLearner 12001
ReplaceDeadVoter 12000
AddVoter 10000
ReplaceDecommissioningVoter 5000
RemoveDeadVoter 1000
RemoveDecommissioningVoter 900
RemoveVoter 800
ReplaceDeadNonVoter 700
AddNonVoter 600
ReplaceDecommissioningNonVoter 500
RemoveDeadNonVoter 400
RemoveDecommissioningNonVoter 300
RemoveNonVoter 200
2. The computeAction() if/else cascade (used to pick which action to take
for a single range):
AddVoter ← checked before quorum!
[quorum check → RangeUnavailable]
ReplaceDeadVoter
ReplaceDecommissioningVoter
RemoveDeadVoter ← separate from ReplaceDeadVoter
RemoveDecommissioningVoter ← separate from ReplaceDecomVoter
RemoveVoter
AddNonVoter
ReplaceDeadNonVoter
ReplaceDecommissioningNonVoter
RemoveDeadNonVoter ← separate from ReplaceDeadNonVoter
RemoveDecommissioningNonVoter ← separate from ReplaceDecomNonVoter
RemoveNonVoter
MMA's RepairAction unifies both orderings into a single iota sequence:
FinalizeAtomicReplicationChange (1)
RemoveLearner (2)
AddVoter (3)
ReplaceDeadVoter (4)
ReplaceDecommissioningVoter (5)
RemoveVoter (6)
AddNonVoter (7)
ReplaceDeadNonVoter (8)
ReplaceDecommissioningNonVoter (9)
RemoveNonVoter (10)
SwapVoterForConstraints (11) ← new, legacy has no equivalent
SwapNonVoterForConstraints (12) ← new, legacy has no equivalent
RepairSkipped (13)
RepairPending (14)
NoRepairNeeded (15)
Key differences from legacy:
- Quorum check gates all actions: In the legacy code, AddVoter is checked
before the quorum gate, meaning it can be attempted even without quorum
(with a TODO noting this). MMA checks quorum first (step 4) and skips
repair entirely if quorum is lost, since all replication changes require
raft consensus.
- No separate Remove{Dead,Decommissioning}{Voter,NonVoter}: The legacy
code distinguishes "replace dead voter" (count matches, add-then-remove)
from "remove dead voter" (over-replicated, just remove). MMA collapses
these — RemoveVoter handles all over-replication cases, with candidate
selection preferring dead > decommissioning > healthy replicas.
- Constraint swaps are new: Legacy doesn't have repair actions for
constraint violations — those are handled as rebalancing. MMA treats
them as repair because a range with correct counts but wrong placement
is not fully conformant.
Informs cockroachdb#164658.
Wire the repair action computation into clusterState so that each range's
repair action is eagerly tracked and indexed.
Structural changes:
- Add `repairAction RepairAction` field to `rangeState`
- Add `repairRanges map[RepairAction]map[RangeID]struct{}` to `clusterState`
- Add `updateRepairAction()` and `removeFromRepairRanges()` to maintain the
index
Trigger points (where updateRepairAction is called):
1. End of processRangeMsg (replicas/config may have changed)
2. pendingChangeEnacted when all pending changes complete
3. End of undoPendingChange
4. End of addPendingRangeChange (sets RepairPending)
5. updateStoreStatuses when health/disposition changes (recomputes for
all ranges on the affected store)
Range GC calls removeFromRepairRanges before deleting the range.
Test infrastructure:
- `repair-needed` DSL command: iterates repairRanges by priority, prints
action-to-ranges mapping; scans separately for RepairPending
- `repair` DSL command: stub (pending changes only, no execution yet)
- Parser: nextReplicaID auto-assignment, quiet=true on set-store, relaxed
field count for replica lines, repair recomputation on update-store-status
6 new testdata files exercise the repair tracking across priority ordering,
config changes, constraint changes, multi-range scenarios, pending change
lifecycle, and store status transitions.
Informs cockroachdb#164658.
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
6dea171 to
9cf44a5
Compare
tbg
left a comment
There was a problem hiding this comment.
@tbg reviewed 13 files and all commit messages, and resolved 4 discussions.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on tbg).
Mark PRs 1 and 2 as completed with links to cockroachdb#165413 and cockroachdb#165423. Update PR 3 description to include ASIM wiring and reflect prototype discoveries. Fix PR 4/5 helper lists to account for what already shipped. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
First PR in the MMA repair productionization stack (see #164658 for the
prototype and how-to-productionize.md for the full plan).
This PR adds the foundation for MMA repair:
RepairAction enum: 12 actionable repair actions plus 3 terminal states,
priority-ordered. The ordering determines both which problem to fix first for
a single range and which ranges get repaired first across the cluster.
computeRepairAction() decision tree: maps range state (replica counts,
store health, constraint satisfaction) to the highest-priority repair action
needed. This is a straightforward if/else cascade — no numerical scoring.
Repair index on clusterState: eagerly-computed
repairActionper range,with a
repairRangesmap indexing ranges by action for efficient iteration.Recomputed at every trigger point (range messages, pending changes, store
status changes).
Constraint helper methods: moved 7 methods from test-only code to
production code. These are prerequisites for the per-action repair functions
in later PRs.
DSL test infrastructure:
repair-neededcommand to inspect the repairindex, plus parser enhancements for compact repair test scenarios.
No repair actions are executed yet — that comes in PR 2 (AddVoter +
orchestration). This PR establishes the action space and priority ordering
for early review feedback.
Informs #164658.