allocator/mmaprototype: implement MMA repair pipeline#164658
Closed
tbg wants to merge 69 commits intocockroachdb:masterfrom
Closed
allocator/mmaprototype: implement MMA repair pipeline#164658tbg wants to merge 69 commits intocockroachdb:masterfrom
tbg wants to merge 69 commits intocockroachdb:masterfrom
Conversation
Contributor
|
Merging to
|
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Member
Add a reference document mapping the responsibilities of the replicate queue, lease queue, store rebalancer, and MMA. This serves as the foundation for planning the absorption of the replicate and lease queues into MMA. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Document the lexicographic priority hierarchy used by the legacy allocator's candidate scoring (candidate.compare()), the different scorer variants, and how MMA's approach differs. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
MMA checks constraints but won't actively fix violations — it skips non-conformant ranges. Update the table to reflect this. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Document the approach for teaching MMA to detect and repair constraint violations: eager constraint analysis at StoreLeaseholderMsg processing time with caching, and a repair set processed with higher priority than load rebalancing. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Replace the TODOs in mma-constraint-repair.md with resolved design decisions: - Repair action ordering: plain enum replacing legacy numerical priorities, with simplified Remove actions and explicit constraint swap actions. - Repair operations: reuse existing MMA add/remove/replace primitives. - Pending change interaction: skip ranges with in-flight changes. - Store health/disposition interaction: targets require ReplicaDispositionOK; removal prefers Dead > Unknown > Unhealthy > Shedding > Refusing > healthy. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Three improvements to the cluster state test DSL to reduce boilerplate in upcoming repair tests: 1. `set-store quiet=true`: suppress the verbose node/store listing output. 2. Auto-assign replica IDs when `replica-id=` is omitted from replica lines in `store-leaseholder-msg`. A per-range counter starts at 1 and advances with each replica; explicit values update the counter to stay above them. 3. Relax the minimum field count for replica lines from 3 to 2 (store-id + type is sufficient when replica-id and leaseholder are omitted). Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add a `repair` method on `rebalanceEnv` that currently logs "not yet implemented" and returns nil. This will be filled in as the constraint repair logic is built out. Add a corresponding `repair` command to the TestClusterState datadriven DSL, following the same pattern as `rebalance-stores`: it creates a rebalanceEnv, calls repair, and outputs the trace plus pending changes. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add 10 datadriven test files exercising count-based repair scenarios. Each test sets up stores and ranges with specific replica configurations, then invokes `repair` and expects the stub "not yet implemented" output. These tests will be rewritten with `-rewrite` as the repair logic is implemented. Tests cover: finalizing atomic replication changes, removing learners, adding/removing voters and non-voters, and replacing dead or decommissioning replicas. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add 4 datadriven test files for constraint-based and interaction repair scenarios: - repair_swap_voter: voter misplaced relative to zone constraints - repair_swap_nonvoter: non-voter misplaced relative to zone constraints - repair_pending_skip: range with existing pending change is skipped - repair_range_unavailable: range without quorum (2 of 3 voters dead) Like the count-based tests, these currently expect the stub output and will be rewritten as repair logic is implemented. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add the RepairAction enum and its core computation method. RepairAction represents the highest-priority repair needed for a range, ordered so that lower enum values have higher priority. The zero value is intentionally invalid (iota + 1) to catch uninitialized fields. computeRepairAction inspects a range's replicas, store statuses, and constraint satisfaction to determine the single highest-priority action: joint config finalization, learner removal, voter/non-voter count adjustments (add, remove, replace dead/decommissioning), and constraint swaps. Ranges that have lost quorum or have pending changes return NoRepairNeeded (can't repair or already being repaired). Also adds updateRepairAction and removeFromRepairRanges helpers that maintain the repairRanges index (wired in the next commit). Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add a repairAction field to rangeState and a repairRanges index to clusterState that maps RepairAction → set of range IDs. This allows repair() to iterate ranges needing repair by priority without scanning all ranges. Wire updateRepairAction calls at all trigger points where a range's repair status may change: - processRangeMsg (replicas or config changed) - updateStoreStatuses (store health changed) - addPendingRangeChange (pending change suppresses repair) - undoPendingChange (pending change removed) - pendingChangeEnacted (pending change completed) - range GC (range removed from tracking) The pendingChangeEnacted signature gains a context.Context parameter to support the updateRepairAction call, updated at all 3 call sites. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…king tests Add the `repair-needed` DSL command to TestClusterState, which dumps the repairRanges index in priority order, showing which ranges need which repair action. Update all 14 existing repair test files to assert repair tracking after each state mutation. Add 6 new datadriven tests exercising the tracking lifecycle: - repair_tracking_status_change: store health transitions - repair_tracking_pending_lifecycle: pending change add/reject/enact - repair_tracking_config_change_with_pending: config change during pending - repair_tracking_multi_range: multiple ranges with different actions - repair_tracking_constraint_change: constraint satisfaction changes - repair_tracking_action_priority: action priority transitions Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Move `candidatesToConvertFromNonVoterToVoter()` and `constraintsForAddingVoter()` from `constraint_unused_test.go` to `constraint.go`. These methods are needed for the upcoming AddVoter repair action implementation. Also add `originMMARepair` to `ChangeOrigin` for tracking repair-originated changes separately from rebalance-originated ones. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Implement the `repair()` dispatch loop in `rebalanceEnv` and the first concrete repair action: `AddVoter`. When a range has fewer voters than its span config requires, repair selects a target store based on constraint satisfaction and diversity scoring, then creates a pending change to add a voter there. The repair loop iterates `repairRanges` in priority order (matching `RepairAction` enum ordering) and only repairs ranges where the local store is the leaseholder. Unimplemented actions log a message identifying the specific action. The `repairAddVoter` flow: 1. Analyze constraints for the range 2. Check for non-voter promotion candidates (TODO: implement promotion) 3. Find constraint-satisfying candidate stores 4. Filter by disposition, existing replicas, and node-level diversity 5. Pick the target with the best voter diversity score 6. Create and register the pending change The `repair_add_voter.txt` test is extended to verify the full lifecycle: repair creates a pending change, `repair-needed` confirms suppression, and after enactment via `store-leaseholder-msg` the range is healthy. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Start with a single voter (config says 3) so repair must add voters in two successive rounds. This exercises the full cycle twice: repair picks the best-diversity candidate, creates a pending change, the pending change suppresses further repair, enactment re-enables repair for the next round, and the second addition completes the range. Round 1 picks s2 over s3 (equal diversity, lower StoreID wins). Round 2 picks s3 as the only remaining candidate. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…Voter repair When an AddVoter repair is needed and there are existing non-voters that could satisfy the voter constraint, promote one instead of adding a new replica on a fresh store. The best promotion candidate is chosen by voter diversity score (highest wins, ties broken by lower StoreID). Extract `pickBestStoreByVoterDiversity` helper to avoid duplicating the diversity-scoring loop between the add-new-voter and promote paths. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
pickBestStoreByVoterDiversity When multiple candidate stores have equal voter diversity scores, use reservoir sampling (via the existing rebalanceEnv RNG) to choose uniformly at random instead of deterministically preferring the lowest StoreID. This avoids systematically biasing placement toward low-numbered stores in symmetric clusters. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Rename `pickBestStoreByVoterDiversity` to `pickStoreByDiversity` with a `diversityScorer` function parameter. This allows the same picker to be used with both `getScoreChangeForNewReplica` (for additions) and `getScoreChangeForReplicaRemoval` (for removals). Update existing call sites in `repairAddVoter` and `promoteNonVoterToVoter`. Pure refactor, no behavior change. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add `repairRemoveVoter()` which handles over-replicated ranges by removing a voter. Candidate selection uses a health-based priority ordering (dead > unknown > unhealthy > shedding > refusing > healthy), taking the worst-health bucket first. Within that bucket, diversity-based tiebreaking picks the most redundant voter (least diversity loss on removal). The leaseholder is never considered for removal. Wire `RemoveVoter` into the `repair()` dispatch loop. Update the `repair_remove_voter.txt` test with the full lifecycle (repair, pending suppression, confirm, healthy). Add a new `repair_remove_voter_healthy.txt` test that verifies diversity-based selection when all stores are healthy. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…irs from healthy Previously, `computeRepairAction` returned `NoRepairNeeded` for ranges with pending changes in flight, conflating "range is healthy" with "range needs repair but a change is already in flight." Add a new `RepairPending` enum value so these states are distinguishable, making it possible to observe how many ranges have outstanding repair actions. `RepairPending` ranges are excluded from the `repairRanges` index (same as `NoRepairNeeded`) so they are not acted on during repair, but they are surfaced in the `repair-needed` test command output via a separate scan. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…on code Move `candidatesToConvertFromVoterToNonVoter` and `constraintsForAddingNonVoter` from `constraint_unused_test.go` to `constraint.go`. These methods are needed by the upcoming `AddNonVoter` repair action: the first finds voters that could be demoted to non-voter, and the second returns the constraint disjunction for placing a new non-voter. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
… tiers Add a `replicasLocalityTiers` parameter to `pickStoreByDiversity` so that non-voter operations can pass `replicaLocalityTiers` (all replicas) instead of the previously hardcoded `voterLocalityTiers` (voters only). This is needed because non-voter diversity should be scored against all replicas, not just voters. The three existing call sites (repairAddVoter, promoteNonVoterToVoter, repairRemoveVoter) are updated to explicitly pass `voterLocalityTiers`, preserving their existing behavior. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add `repairRemoveNonVoter`, which removes an over-replicated non-voter. Candidate selection follows the same priority as `repairRemoveVoter` (dead > unknown > unhealthy > shedding > refusing > healthy), but does not need to exclude the leaseholder since non-voters cannot hold leases. Within the worst-health bucket, the non-voter whose removal hurts diversity the least is chosen using `replicaLocalityTiers`. The test exercises the full lifecycle: detect over-replication, remove one non-voter, confirm via leaseholder message, verify healthy. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add `repairAddNonVoter`, which adds a non-voter to an under-replicated range. Like `repairAddVoter`, it first checks for a type-change shortcut: if there are extra voters that could be demoted to non-voter (via `candidatesToConvertFromVoterToNonVoter`), it uses `demoteVoterToNonVoter` to change the type in place. Otherwise, it finds a new store using the constraint disjunction from `constraintsForAddingNonVoter`, filters candidates, and picks by replica diversity. The `demoteVoterToNonVoter` helper mirrors `promoteNonVoterToVoter` but excludes the leaseholder and creates a VOTER_FULL -> NON_VOTER type change. The test exercises the full two-round lifecycle: add first non-voter, confirm, add second, confirm, verify healthy. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Move 7 constraint analysis methods from `constraint_unused_test.go` to `constraint.go`: - candidatesToConvertFromNonVoterToVoter - constraintsForAddingVoter - candidatesToConvertFromVoterToNonVoter - constraintsForAddingNonVoter - candidatesForRoleSwapForConstraints - candidatesVoterConstraintsUnsatisfied - candidatesNonVoterConstraintsUnsatisfied Pure mechanical move with improved doc comments from the prototype. These methods are prerequisites for the per-action repair functions in later PRs (AddVoter, RemoveVoter, constraint swaps). Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Add the RepairAction enum and computeRepairAction() decision tree. These establish the action space and priority ordering for MMA repair. RepairAction has 15 values (12 actionable + 3 terminal states), ordered by priority via iota. computeRepairAction() maps range state to the highest-priority repair action needed, using a straightforward if/else cascade examining joint configs, quorum, replica counts, and constraint satisfaction. No callers yet — the wiring to clusterState comes in the next commit. Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Wire the repair action computation into clusterState so that each range's
repair action is eagerly tracked and indexed.
Structural changes:
- Add `repairAction RepairAction` field to `rangeState`
- Add `repairRanges map[RepairAction]map[RangeID]struct{}` to `clusterState`
- Add `updateRepairAction()` and `removeFromRepairRanges()` to maintain the
index
Trigger points (where updateRepairAction is called):
1. End of processRangeMsg (replicas/config may have changed)
2. pendingChangeEnacted when all pending changes complete
3. End of undoPendingChange
4. End of addPendingRangeChange (sets RepairPending)
5. updateStoreStatuses when health/disposition changes (recomputes for
all ranges on the affected store)
Range GC calls removeFromRepairRanges before deleting the range.
Test infrastructure:
- `repair-needed` DSL command: iterates repairRanges by priority, prints
action-to-ranges mapping; scans separately for RepairPending
- `repair` DSL command: stub (pending changes only, no execution yet)
- Parser: nextReplicaID auto-assignment, quiet=true on set-store, relaxed
field count for replica lines, repair recomputation on update-store-status
6 new testdata files exercise the repair tracking across priority ordering,
config changes, constraint changes, multi-range scenarios, pending change
lifecycle, and store status transitions.
Informs cockroachdb#164658.
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Add the RepairAction enum and computeRepairAction() decision tree. These
establish the action space and priority ordering for MMA repair.
RepairAction has 15 values (12 actionable + 3 terminal states), ordered by
priority via iota. computeRepairAction() maps range state to the
highest-priority repair action needed, using a straightforward if/else
cascade examining joint configs, quorum, replica counts, and constraint
satisfaction.
No callers yet — the wiring to clusterState comes in the next commit.
Comparison with legacy Allocator.ComputeAction (allocatorimpl/allocator.go):
The legacy allocator has two separate orderings that sometimes disagree:
1. The Priority() ordering (used to rank ranges in the replicate queue):
FinalizeAtomicReplicationChange 12002
RemoveLearner 12001
ReplaceDeadVoter 12000
AddVoter 10000
ReplaceDecommissioningVoter 5000
RemoveDeadVoter 1000
RemoveDecommissioningVoter 900
RemoveVoter 800
ReplaceDeadNonVoter 700
AddNonVoter 600
ReplaceDecommissioningNonVoter 500
RemoveDeadNonVoter 400
RemoveDecommissioningNonVoter 300
RemoveNonVoter 200
2. The computeAction() if/else cascade (used to pick which action to take
for a single range):
AddVoter ← checked before quorum!
[quorum check → RangeUnavailable]
ReplaceDeadVoter
ReplaceDecommissioningVoter
RemoveDeadVoter ← separate from ReplaceDeadVoter
RemoveDecommissioningVoter ← separate from ReplaceDecomVoter
RemoveVoter
AddNonVoter
ReplaceDeadNonVoter
ReplaceDecommissioningNonVoter
RemoveDeadNonVoter ← separate from ReplaceDeadNonVoter
RemoveDecommissioningNonVoter ← separate from ReplaceDecomNonVoter
RemoveNonVoter
MMA's RepairAction unifies both orderings into a single iota sequence:
FinalizeAtomicReplicationChange (1)
RemoveLearner (2)
AddVoter (3)
ReplaceDeadVoter (4)
ReplaceDecommissioningVoter (5)
RemoveVoter (6)
AddNonVoter (7)
ReplaceDeadNonVoter (8)
ReplaceDecommissioningNonVoter (9)
RemoveNonVoter (10)
SwapVoterForConstraints (11) ← new, legacy has no equivalent
SwapNonVoterForConstraints (12) ← new, legacy has no equivalent
RepairSkipped (13)
RepairPending (14)
NoRepairNeeded (15)
Key differences from legacy:
- Quorum check gates all actions: In the legacy code, AddVoter is checked
before the quorum gate, meaning it can be attempted even without quorum
(with a TODO noting this). MMA checks quorum first (step 4) and skips
repair entirely if quorum is lost, since all replication changes require
raft consensus.
- No separate Remove{Dead,Decommissioning}{Voter,NonVoter}: The legacy
code distinguishes "replace dead voter" (count matches, add-then-remove)
from "remove dead voter" (over-replicated, just remove). MMA collapses
these — RemoveVoter handles all over-replication cases, with candidate
selection preferring dead > decommissioning > healthy replicas.
- Constraint swaps are new: Legacy doesn't have repair actions for
constraint violations — those are handled as rebalancing. MMA treats
them as repair because a range with correct counts but wrong placement
is not fully conformant.
Informs cockroachdb#164658.
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Wire the repair action computation into clusterState so that each range's
repair action is eagerly tracked and indexed.
Structural changes:
- Add `repairAction RepairAction` field to `rangeState`
- Add `repairRanges map[RepairAction]map[RangeID]struct{}` to `clusterState`
- Add `updateRepairAction()` and `removeFromRepairRanges()` to maintain the
index
Trigger points (where updateRepairAction is called):
1. End of processRangeMsg (replicas/config may have changed)
2. pendingChangeEnacted when all pending changes complete
3. End of undoPendingChange
4. End of addPendingRangeChange (sets RepairPending)
5. updateStoreStatuses when health/disposition changes (recomputes for
all ranges on the affected store)
Range GC calls removeFromRepairRanges before deleting the range.
Test infrastructure:
- `repair-needed` DSL command: iterates repairRanges by priority, prints
action-to-ranges mapping; scans separately for RepairPending
- `repair` DSL command: stub (pending changes only, no execution yet)
- Parser: nextReplicaID auto-assignment, quiet=true on set-store, relaxed
field count for replica lines, repair recomputation on update-store-status
6 new testdata files exercise the repair tracking across priority ordering,
config changes, constraint changes, multi-range scenarios, pending change
lifecycle, and store status transitions.
Informs cockroachdb#164658.
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Add the RepairAction enum and computeRepairAction() decision tree. These
establish the action space and priority ordering for MMA repair.
RepairAction has 15 values (12 actionable + 3 terminal states), ordered by
priority via iota. computeRepairAction() maps range state to the
highest-priority repair action needed, using a straightforward if/else
cascade examining joint configs, quorum, replica counts, and constraint
satisfaction.
No callers yet — the wiring to clusterState comes in the next commit.
Comparison with legacy Allocator.ComputeAction (allocatorimpl/allocator.go):
The legacy allocator has two separate orderings that sometimes disagree:
1. The Priority() ordering (used to rank ranges in the replicate queue):
FinalizeAtomicReplicationChange 12002
RemoveLearner 12001
ReplaceDeadVoter 12000
AddVoter 10000
ReplaceDecommissioningVoter 5000
RemoveDeadVoter 1000
RemoveDecommissioningVoter 900
RemoveVoter 800
ReplaceDeadNonVoter 700
AddNonVoter 600
ReplaceDecommissioningNonVoter 500
RemoveDeadNonVoter 400
RemoveDecommissioningNonVoter 300
RemoveNonVoter 200
2. The computeAction() if/else cascade (used to pick which action to take
for a single range):
AddVoter ← checked before quorum!
[quorum check → RangeUnavailable]
ReplaceDeadVoter
ReplaceDecommissioningVoter
RemoveDeadVoter ← separate from ReplaceDeadVoter
RemoveDecommissioningVoter ← separate from ReplaceDecomVoter
RemoveVoter
AddNonVoter
ReplaceDeadNonVoter
ReplaceDecommissioningNonVoter
RemoveDeadNonVoter ← separate from ReplaceDeadNonVoter
RemoveDecommissioningNonVoter ← separate from ReplaceDecomNonVoter
RemoveNonVoter
MMA's RepairAction unifies both orderings into a single iota sequence:
FinalizeAtomicReplicationChange (1)
RemoveLearner (2)
AddVoter (3)
ReplaceDeadVoter (4)
ReplaceDecommissioningVoter (5)
RemoveVoter (6)
AddNonVoter (7)
ReplaceDeadNonVoter (8)
ReplaceDecommissioningNonVoter (9)
RemoveNonVoter (10)
SwapVoterForConstraints (11) ← new, legacy has no equivalent
SwapNonVoterForConstraints (12) ← new, legacy has no equivalent
RepairSkipped (13)
RepairPending (14)
NoRepairNeeded (15)
Key differences from legacy:
- Quorum check gates all actions: In the legacy code, AddVoter is checked
before the quorum gate, meaning it can be attempted even without quorum
(with a TODO noting this). MMA checks quorum first (step 4) and skips
repair entirely if quorum is lost, since all replication changes require
raft consensus.
- No separate Remove{Dead,Decommissioning}{Voter,NonVoter}: The legacy
code distinguishes "replace dead voter" (count matches, add-then-remove)
from "remove dead voter" (over-replicated, just remove). MMA collapses
these — RemoveVoter handles all over-replication cases, with candidate
selection preferring dead > decommissioning > healthy replicas.
- Constraint swaps are new: Legacy doesn't have repair actions for
constraint violations — those are handled as rebalancing. MMA treats
them as repair because a range with correct counts but wrong placement
is not fully conformant.
Informs cockroachdb#164658.
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Wire the repair action computation into clusterState so that each range's
repair action is eagerly tracked and indexed.
Structural changes:
- Add `repairAction RepairAction` field to `rangeState`
- Add `repairRanges map[RepairAction]map[RangeID]struct{}` to `clusterState`
- Add `updateRepairAction()` and `removeFromRepairRanges()` to maintain the
index
Trigger points (where updateRepairAction is called):
1. End of processRangeMsg (replicas/config may have changed)
2. pendingChangeEnacted when all pending changes complete
3. End of undoPendingChange
4. End of addPendingRangeChange (sets RepairPending)
5. updateStoreStatuses when health/disposition changes (recomputes for
all ranges on the affected store)
Range GC calls removeFromRepairRanges before deleting the range.
Test infrastructure:
- `repair-needed` DSL command: iterates repairRanges by priority, prints
action-to-ranges mapping; scans separately for RepairPending
- `repair` DSL command: stub (pending changes only, no execution yet)
- Parser: nextReplicaID auto-assignment, quiet=true on set-store, relaxed
field count for replica lines, repair recomputation on update-store-status
6 new testdata files exercise the repair tracking across priority ordering,
config changes, constraint changes, multi-range scenarios, pending change
lifecycle, and store status transitions.
Informs cockroachdb#164658.
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Add the repair() method on rebalanceEnv — the main entry point for MMA repair. It iterates repairRanges in priority order, filters to ranges where the local store is the leaseholder, and dispatches to per-action repair functions. No repair actions are implemented yet (the switch default logs "not yet implemented"); AddVoter comes in the next commit. Wire repair into ComputeChanges via the IncludeRepair field on ChangeOptions. When set, repair() runs before rebalanceStores(), and its pending changes prevent the rebalancer from touching the same ranges. Add originMMARepair to the ChangeOrigin enum so that repair-originated changes can be tracked through AdjustPendingChangeDisposition. For now repair changes share the rebalance metric counters; dedicated repair metrics come in a follow-up PR. Add the "repair" DSL command to the test harness. It creates a rebalanceEnv with a deterministic random seed and calls repair(). Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Add the repair() method on rebalanceEnv — the main entry point for MMA repair. It iterates repairRanges in priority order, filters to ranges where the local store is the leaseholder, and dispatches to per-action repair functions. No repair actions are implemented yet (the switch default logs "not yet implemented"); AddVoter comes in the next commit. Wire repair into ComputeChanges via the IncludeRepair field on ChangeOptions. When set, repair() runs before rebalanceStores(), and its pending changes prevent the rebalancer from touching the same ranges. Add originMMARepair to the ChangeOrigin enum so that repair-originated changes can be tracked through AdjustPendingChangeDisposition. For now repair changes share the rebalance metric counters; dedicated repair metrics come in a follow-up PR. Add the "repair" DSL command to the test harness. It creates a rebalanceEnv with a deterministic random seed and calls repair(). Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Move 7 constraint analysis methods from `constraint_unused_test.go` to `constraint.go`: - candidatesToConvertFromNonVoterToVoter - constraintsForAddingVoter - candidatesToConvertFromVoterToNonVoter - constraintsForAddingNonVoter - candidatesForRoleSwapForConstraints - candidatesVoterConstraintsUnsatisfied - candidatesNonVoterConstraintsUnsatisfied Pure mechanical move with improved doc comments from the prototype. These methods are prerequisites for the per-action repair functions in later PRs (AddVoter, RemoveVoter, constraint swaps). Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Add the RepairAction enum and computeRepairAction() decision tree. These
establish the action space and priority ordering for MMA repair.
RepairAction has 15 values (12 actionable + 3 terminal states), ordered by
priority via iota. computeRepairAction() maps range state to the
highest-priority repair action needed, using a straightforward if/else
cascade examining joint configs, quorum, replica counts, and constraint
satisfaction.
No callers yet — the wiring to clusterState comes in the next commit.
Comparison with legacy Allocator.ComputeAction (allocatorimpl/allocator.go):
The legacy allocator has two separate orderings that sometimes disagree:
1. The Priority() ordering (used to rank ranges in the replicate queue):
FinalizeAtomicReplicationChange 12002
RemoveLearner 12001
ReplaceDeadVoter 12000
AddVoter 10000
ReplaceDecommissioningVoter 5000
RemoveDeadVoter 1000
RemoveDecommissioningVoter 900
RemoveVoter 800
ReplaceDeadNonVoter 700
AddNonVoter 600
ReplaceDecommissioningNonVoter 500
RemoveDeadNonVoter 400
RemoveDecommissioningNonVoter 300
RemoveNonVoter 200
2. The computeAction() if/else cascade (used to pick which action to take
for a single range):
AddVoter ← checked before quorum!
[quorum check → RangeUnavailable]
ReplaceDeadVoter
ReplaceDecommissioningVoter
RemoveDeadVoter ← separate from ReplaceDeadVoter
RemoveDecommissioningVoter ← separate from ReplaceDecomVoter
RemoveVoter
AddNonVoter
ReplaceDeadNonVoter
ReplaceDecommissioningNonVoter
RemoveDeadNonVoter ← separate from ReplaceDeadNonVoter
RemoveDecommissioningNonVoter ← separate from ReplaceDecomNonVoter
RemoveNonVoter
MMA's RepairAction unifies both orderings into a single iota sequence:
FinalizeAtomicReplicationChange (1)
RemoveLearner (2)
AddVoter (3)
ReplaceDeadVoter (4)
ReplaceDecommissioningVoter (5)
RemoveVoter (6)
AddNonVoter (7)
ReplaceDeadNonVoter (8)
ReplaceDecommissioningNonVoter (9)
RemoveNonVoter (10)
SwapVoterForConstraints (11) ← new, legacy has no equivalent
SwapNonVoterForConstraints (12) ← new, legacy has no equivalent
RepairSkipped (13)
RepairPending (14)
NoRepairNeeded (15)
Key differences from legacy:
- Quorum check gates all actions: In the legacy code, AddVoter is checked
before the quorum gate, meaning it can be attempted even without quorum
(with a TODO noting this). MMA checks quorum first (step 4) and skips
repair entirely if quorum is lost, since all replication changes require
raft consensus.
- No separate Remove{Dead,Decommissioning}{Voter,NonVoter}: The legacy
code distinguishes "replace dead voter" (count matches, add-then-remove)
from "remove dead voter" (over-replicated, just remove). MMA collapses
these — RemoveVoter handles all over-replication cases, with candidate
selection preferring dead > decommissioning > healthy replicas.
- Constraint swaps are new: Legacy doesn't have repair actions for
constraint violations — those are handled as rebalancing. MMA treats
them as repair because a range with correct counts but wrong placement
is not fully conformant.
Informs cockroachdb#164658.
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Wire the repair action computation into clusterState so that each range's
repair action is eagerly tracked and indexed.
Structural changes:
- Add `repairAction RepairAction` field to `rangeState`
- Add `repairRanges map[RepairAction]map[RangeID]struct{}` to `clusterState`
- Add `updateRepairAction()` and `removeFromRepairRanges()` to maintain the
index
Trigger points (where updateRepairAction is called):
1. End of processRangeMsg (replicas/config may have changed)
2. pendingChangeEnacted when all pending changes complete
3. End of undoPendingChange
4. End of addPendingRangeChange (sets RepairPending)
5. updateStoreStatuses when health/disposition changes (recomputes for
all ranges on the affected store)
Range GC calls removeFromRepairRanges before deleting the range.
Test infrastructure:
- `repair-needed` DSL command: iterates repairRanges by priority, prints
action-to-ranges mapping; scans separately for RepairPending
- `repair` DSL command: stub (pending changes only, no execution yet)
- Parser: nextReplicaID auto-assignment, quiet=true on set-store, relaxed
field count for replica lines, repair recomputation on update-store-status
6 new testdata files exercise the repair tracking across priority ordering,
config changes, constraint changes, multi-range scenarios, pending change
lifecycle, and store status transitions.
Informs cockroachdb#164658.
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Add the repair() method on rebalanceEnv — the main entry point for MMA repair. It iterates repairRanges in priority order, filters to ranges where the local store is the leaseholder, and dispatches to per-action repair functions. No repair actions are implemented yet (the switch default logs "not yet implemented"); AddVoter comes in the next commit. Wire repair into ComputeChanges via the IncludeRepair field on ChangeOptions. When set, repair() runs before rebalanceStores(), and its pending changes prevent the rebalancer from touching the same ranges. Add originMMARepair to the ChangeOrigin enum so that repair-originated changes can be tracked through AdjustPendingChangeDisposition. For now repair changes share the rebalance metric counters; dedicated repair metrics come in a follow-up PR. Add the "repair" DSL command to the test harness. It creates a rebalanceEnv with a deterministic random seed and calls repair(). Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Move 7 constraint analysis methods from `constraint_unused_test.go` to `constraint.go`: - candidatesToConvertFromNonVoterToVoter - constraintsForAddingVoter - candidatesToConvertFromVoterToNonVoter - constraintsForAddingNonVoter - candidatesForRoleSwapForConstraints - candidatesVoterConstraintsUnsatisfied - candidatesNonVoterConstraintsUnsatisfied Pure mechanical move with improved doc comments from the prototype. These methods are prerequisites for the per-action repair functions in later PRs (AddVoter, RemoveVoter, constraint swaps). Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Add the RepairAction enum and computeRepairAction() decision tree. These
establish the action space and priority ordering for MMA repair.
RepairAction has 15 values (12 actionable + 3 terminal states), ordered by
priority via iota. computeRepairAction() maps range state to the
highest-priority repair action needed, using a straightforward if/else
cascade examining joint configs, quorum, replica counts, and constraint
satisfaction.
No callers yet — the wiring to clusterState comes in the next commit.
Comparison with legacy Allocator.ComputeAction (allocatorimpl/allocator.go):
The legacy allocator has two separate orderings that sometimes disagree:
1. The Priority() ordering (used to rank ranges in the replicate queue):
FinalizeAtomicReplicationChange 12002
RemoveLearner 12001
ReplaceDeadVoter 12000
AddVoter 10000
ReplaceDecommissioningVoter 5000
RemoveDeadVoter 1000
RemoveDecommissioningVoter 900
RemoveVoter 800
ReplaceDeadNonVoter 700
AddNonVoter 600
ReplaceDecommissioningNonVoter 500
RemoveDeadNonVoter 400
RemoveDecommissioningNonVoter 300
RemoveNonVoter 200
2. The computeAction() if/else cascade (used to pick which action to take
for a single range):
AddVoter ← checked before quorum!
[quorum check → RangeUnavailable]
ReplaceDeadVoter
ReplaceDecommissioningVoter
RemoveDeadVoter ← separate from ReplaceDeadVoter
RemoveDecommissioningVoter ← separate from ReplaceDecomVoter
RemoveVoter
AddNonVoter
ReplaceDeadNonVoter
ReplaceDecommissioningNonVoter
RemoveDeadNonVoter ← separate from ReplaceDeadNonVoter
RemoveDecommissioningNonVoter ← separate from ReplaceDecomNonVoter
RemoveNonVoter
MMA's RepairAction unifies both orderings into a single iota sequence:
FinalizeAtomicReplicationChange (1)
RemoveLearner (2)
AddVoter (3)
ReplaceDeadVoter (4)
ReplaceDecommissioningVoter (5)
RemoveVoter (6)
AddNonVoter (7)
ReplaceDeadNonVoter (8)
ReplaceDecommissioningNonVoter (9)
RemoveNonVoter (10)
SwapVoterForConstraints (11) ← new, legacy has no equivalent
SwapNonVoterForConstraints (12) ← new, legacy has no equivalent
RepairSkipped (13)
RepairPending (14)
NoRepairNeeded (15)
Key differences from legacy:
- Quorum check gates all actions: In the legacy code, AddVoter is checked
before the quorum gate, meaning it can be attempted even without quorum
(with a TODO noting this). MMA checks quorum first (step 4) and skips
repair entirely if quorum is lost, since all replication changes require
raft consensus.
- No separate Remove{Dead,Decommissioning}{Voter,NonVoter}: The legacy
code distinguishes "replace dead voter" (count matches, add-then-remove)
from "remove dead voter" (over-replicated, just remove). MMA collapses
these — RemoveVoter handles all over-replication cases, with candidate
selection preferring dead > decommissioning > healthy replicas.
- Constraint swaps are new: Legacy doesn't have repair actions for
constraint violations — those are handled as rebalancing. MMA treats
them as repair because a range with correct counts but wrong placement
is not fully conformant.
Informs cockroachdb#164658.
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 11, 2026
Wire the repair action computation into clusterState so that each range's
repair action is eagerly tracked and indexed.
Structural changes:
- Add `repairAction RepairAction` field to `rangeState`
- Add `repairRanges map[RepairAction]map[RangeID]struct{}` to `clusterState`
- Add `updateRepairAction()` and `removeFromRepairRanges()` to maintain the
index
Trigger points (where updateRepairAction is called):
1. End of processRangeMsg (replicas/config may have changed)
2. pendingChangeEnacted when all pending changes complete
3. End of undoPendingChange
4. End of addPendingRangeChange (sets RepairPending)
5. updateStoreStatuses when health/disposition changes (recomputes for
all ranges on the affected store)
Range GC calls removeFromRepairRanges before deleting the range.
Test infrastructure:
- `repair-needed` DSL command: iterates repairRanges by priority, prints
action-to-ranges mapping; scans separately for RepairPending
- `repair` DSL command: stub (pending changes only, no execution yet)
- Parser: nextReplicaID auto-assignment, quiet=true on set-store, relaxed
field count for replica lines, repair recomputation on update-store-status
6 new testdata files exercise the repair tracking across priority ordering,
config changes, constraint changes, multi-range scenarios, pending change
lifecycle, and store status transitions.
Informs cockroachdb#164658.
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 12, 2026
Add the repair() method on rebalanceEnv — the main entry point for MMA repair. It iterates repairRanges in priority order, filters to ranges where the local store is the leaseholder, and dispatches to per-action repair functions. No repair actions are implemented yet (the switch default logs "not yet implemented"); AddVoter comes in the next commit. Wire repair into ComputeChanges via the IncludeRepair field on ChangeOptions. When set, repair() runs before rebalanceStores(), and its pending changes prevent the rebalancer from touching the same ranges. Add originMMARepair to the ChangeOrigin enum so that repair-originated changes can be tracked through AdjustPendingChangeDisposition. For now repair changes share the rebalance metric counters; dedicated repair metrics come in a follow-up PR. Add the "repair" DSL command to the test harness. It creates a rebalanceEnv with a deterministic random seed and calls repair(). Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Mark PRs 1 and 2 as completed with links to cockroachdb#165413 and cockroachdb#165423. Update PR 3 description to include ASIM wiring and reflect prototype discoveries. Fix PR 4/5 helper lists to account for what already shipped. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 12, 2026
Add the new LBRebalancingMultiMetricRepairAndRebalance enum constant to LBRebalancingMode. In this mode, MMA handles both rebalancing and repair; the replicate and lease queues are completely disabled. The mode is deliberately NOT added to the LoadBasedRebalancingMode settings registration map, so it cannot be set via SET CLUSTER SETTING — only via Override() in tests. (EnumSetting.Override explicitly bypasses validation.) Also expand LoadBasedRebalancingModeIsMMA to include the new mode and add the LoadBasedRebalancingModeIsMMARepairAndRebalance helper function. No behavior change: nothing checks for this value yet. Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 12, 2026
When LBRebalancingMultiMetricRepairAndRebalance mode is active, short-circuit both shouldQueue and process on the replicate and lease queues. This prevents both enqueuing and processing, making MMA solely responsible for all replica placement decisions. Also update CountBasedRebalancingDisabled to return true in the new mode, since count-based rebalancing should also be disabled when MMA handles repair. All changes are gated by the new mode — no behavior change under existing modes. Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 12, 2026
Add RepairReplicaChange{Success,Failure} and RepairLeaseChange{Success,Failure}
counters, replacing the temporary routing of repair metrics through the
rebalance counters from the previous PR.
Split the combined `originMMARebalance, originMMARepair` case in
AdjustPendingChangeDisposition into separate cases, each incrementing its own
metric counters.
Informs cockroachdb#164658.
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 12, 2026
Wire the MMA repair-and-rebalance mode into the allocator simulator: - Add early return guards in replicate queue and lease queue Tick() methods when the respective queue is disabled via SimulationSettings. - Expand SetClusterSetting to disable both queues when the new mode is set. - Wire IncludeRepair in the ASIM MMA store rebalancer's ComputeChanges call, gated on the repair-and-rebalance mode. - Add "mma-repair" to knownConfigurations for datadriven tests. - Add repair_add_voter and repair_promote_nonvoter testdata files that verify MMA repair upreplicates under-replicated ranges and promotes non-voters to voters, respectively. Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 12, 2026
Pure refactor: extract the tight rebalance loop from run() into a rebalanceUntilStable method for reuse by ForceReplicationScanAndProcess in the next commit. No behavior change. Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 12, 2026
Complete the production mma_store_rebalancer wiring for repair: 1. Add IsLeaveJoint() predicate on ExternalRangeChange that detects joint-config finalization changes. These cannot be expressed as kvpb.ReplicationChanges because the production code uses maybeLeaveAtomicChangeReplicas directly. 2. Wire IncludeRepair in rebalance(), gated on the LBRebalancingMultiMetricRepairAndRebalance mode check. 3. Add IsLeaveJoint routing in applyChange() between IsPureTransferLease and IsChangeReplicas, delegating to maybeLeaveAtomicChangeReplicas. 4. Expand the replicaToApplyChanges interface with maybeLeaveAtomicChangeReplicas. Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
added a commit
to tbg/cockroach
that referenced
this pull request
Mar 12, 2026
…stMMAUpreplication When MMA repair-and-rebalance mode is active, ForceReplicationScanAndProcess delegates to rebalanceUntilStable() instead of the replicate queue (whose shouldQueue/process are no-ops in that mode). This enables WaitForFullReplication and deterministic test driving. Add TestMMAUpreplication, which starts a 3-node cluster with ReplicationAuto, creates a scratch range (1 replica), enables MMA repair-and-rebalance mode, and verifies it upreplicates to 3 voters entirely through MMA repair. Informs cockroachdb#164658. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important
Start here: walkthrough.md — a linear, code-level walkthrough of the repair pipeline. Uses
repairAddVoteras a representative example and traces the full data flow from mode gating through change application. Includes inline source snippets (verifiable viashowboat verify).Architecture & design brief: design-brief.md — how the repair pipeline works and how it fits into MMA.
Productionization plan: mma-repair-brief.md — proposed merge strategy (6 PRs) and what's in/out of scope.
Detailed PR split: how-to-productionize.md — per-PR breakdown with files, LOC estimates, and review focus areas.
To discuss:
computeRepairActionSummary
This PR implements the complete repair pipeline for the Multi-Metric
Allocator (MMA) prototype. Today, range repair (upreplication,
decommissioning, dead-node replacement, constraint enforcement) is
handled by the replicate queue. This PR teaches MMA to perform all of
these repairs itself, behind a new cluster setting mode
(
LBRebalancingMultiMetricRepairAndRebalance). When that mode is active,the replicate and lease queues become no-ops and MMA takes full
ownership of both rebalancing and repair.
Design & planning
The branch starts with design documents that map out the repair action
space and the constraint satisfaction logic, including how the legacy
allocator's scorer hierarchy maps onto MMA concepts.
Repair actions (12 total)
All twelve repair actions are implemented in
cluster_state_repair.go(~1,600 lines), organized into three groups:AddVoter(with non-voter promotion),RemoveVoter,AddNonVoter,RemoveNonVoter,RemoveLearner,FinalizeAtomicReplicationChange.ReplaceDeadVoter,ReplaceDeadNonVoter,ReplaceDecommissioningVoter,ReplaceDecommissioningNonVoter.SwapVoterForConstraints,SwapNonVoterForConstraints.Each action is tested via the MMA DSL (
repair,repair-neededcommands) with datadriven test files covering count-based, constraint,
and interaction scenarios.
Diversity picker
pickStoreByDiversityis generalized to accept adiversityScorerfunction parameter, enabling reuse across add, remove, replace, and swap
paths. Random tiebreaking ensures non-deterministic selection among
equally-good candidates.
Integration
A new
LBRebalancingMultiMetricRepairAndRebalancemode is added to theLoadBasedRebalancingModecluster setting. When active:shouldQueue/processreturn immediately.shouldQueue/processreturn immediately.IncludeRepairis passed toComputeChanges, which callsrepair()before computing rebalancing changes.
Production fixes
IsLeaveJoint()onExternalRangeChangeroutes leave-joint changes through
maybeLeaveAtomicChangeReplicasinstead of
changeReplicasImpl, which would fail.ForceReplicationScanAndProcess: now delegates tommaStoreRebalancer.rebalanceUntilStable()when MMArepair-and-rebalance mode is active, fixing
WaitForFullReplicationand enabling
ReplicationAutoin MMA tests.TestMMAUpreplication: integration test verifying end-to-endupreplication from 1 to 3 voters under MMA.
ASIM tests
The allocator simulator (ASIM) is extended to support MMA repair:
queue disabling under the new mode, same-store voter type transitions in
ReplicaChange.Apply, and golden-output repair test scenarios coveringupreplication, dead-node replacement, decommissioning, and constraint
enforcement.
Commits
Individual commits speak for themselves. They are grouped into phases
visible in the commit log: design docs, infrastructure, count-based
repairs, replacement repairs, constraint swaps, refactoring, integration,
ASIM tests, and production fixes.
Epic: CRDB-39508