mmaintegration: introduce physical capacity model by wenyihu6 · Pull Request #164792 · cockroachdb/cockroach

wenyihu6 · 2026-03-04T02:46:38Z

Rebased on top of #164760.
Epic: CRDB-55052
Release note: none

mmaintegration: introduce physical capacity model

This change introduces a physical capacity model that expresses MMA store
loads and capacities in physical resource units (CPU ns/s, disk bytes)
and wires it into all callers, replacing the logical (capped multiplier)
model.

The capped model expressed capacity in a synthetic "logical KV CPU" unit:

load = CPUPerSecond
capacity = (nodeCap - estimatedBg) / estimatedMult / numStores

The physical model factors the multiplier out of the capacity formula and
into a separate amplification factor applied at the range-load boundary:

load = CPUPerSecond * ampFactor
capacity = (nodeCap - estimatedBg) / numStores
range load = logicalRangeLoad * ampFactor (via MakePhysicalRangeLoad)

Both models use the same capped-multiplier logic for background
estimation (ampFactor = clamp(nodeUsage/storesCPU, 1, 3), background =
max(0, nodeUsage - storesCPU * ampFactor)). The ratio load/capacity is
algebraically identical, so all rebalancing decisions driven by
fractionUsed in loadSummaryForDimension are unchanged.

The physical model is preferred because it separates two concerns that
the capped model conflates into a single capacity value. Capacity
answers "how much physical CPU is available for MMA work after
subtracting background" — a quantity grounded in the actual resource.
The amplification factor answers "how much physical CPU does each unit
of logical range work cost" and is applied only at the range-load
boundary via MakePhysicalRangeLoad.

This separation has two practical benefits:

It makes the LoadVector uniform across dimensions. Disk is
inherently physical (load=Used, capacity=Used+Available). CPU
now follows the same pattern with a separate amplification factor
for per-range conversion.
It resolves the unit mismatch between physical NodeCPULoad and
store-level pending deltas in nodeState.adjustedCPU. MMA uses
adjustedCPU for node-level overload detection: it starts at
NodeCPULoad (= NodeCPURateUsage, physical ns/s) and accumulates
pending change deltas as ranges are scheduled to move
(applyChangeLoadDelta adds change.loadDelta[CPURate]). The
result is compared against NodeCPUCapacity in
loadSummaryForDimension. Under the capped model, these deltas were
in logical units (RequestCPUNanos + RaftCPUNanos, no amplification)
— a range using 0.5 cores of direct KV CPU produced a delta of
0.5e9 ns/s, but its true physical impact on node CPU could be
0.5 * ampFactor = 1.5e9 ns/s. Under the physical model,
MakePhysicalRangeLoad applies the amplification factor, so the
delta is already in physical ns/s and combines directly with
NodeCPULoad.

The trade-off is that both load and capacity now include estimation
error from the amplification factor. In the capped model, store load was
the directly-measured CPUPerSecond — a ground-truth value requiring no
estimation. In the physical model, load is CPUPerSecond * ampFactor,
so diagnosing a store reporting "6 cores of load" requires knowing the
amplification factor to recover the underlying 2 cores of direct KV CPU.

kvserver: convert MMA store and range loads to physical units

This change wires the physical capacity model into all MMA callers.

MakeStoreLoadMsg now delegates to computePhysicalStore.
Range loads are converted via MakePhysicalRangeLoad using an
AmpVector from the new Store.AmplificationFactors() method,
threaded through mmaRangeLoad, tryConstructMMARangeMsg,
NonMMAPreTransferLease, and NonMMAPreChangeReplicas.

Store.AmplificationFactors() returns IdentityAmpVector() when
CachedCapacity() is zero (e.g. early startup). The simulator uses
IdentityAmpVector with a TODO for real factors.

Release note: None

The capped multiplier model (`computeCPUCapacityWithCap`) subtracts background CPU when deriving per-store capacity. This intentionally breaks the invariant maintained by the naive model: sum(store loads) / sum(store capacities) = NodeCPURateUsage / NodeCPURateCapacity When background CPU exceeds the cap, node-level overload detection can no longer recover physical utilization by summing store-level values. This change replaces the derived `NodeLoad` fields (`ReportedCPU`, `CapacityCPU` — incremental sums of per-store values) which relied on this invariance with explicit physical node-level CPU values (`NodeCPULoad`, `NodeCPUCapacity`) carried through `StoreLoadMsg` from `NodeCapacity` fields. Note that each store generates its descriptor independently, so `NodeCPULoad` values may differ slightly across stores on the same node due to sampling time differences. This is acceptable since they should be roughly the same.

mmaintegration: introduce physical capacity model for MMA loads This change introduces a physical capacity model that expresses MMA store loads and capacities in physical resource units (CPU ns/s, disk bytes) and wires it into all callers, replacing the logical (capped multiplier) model. The capped model expressed capacity in a synthetic "logical KV CPU" unit: load = CPUPerSecond capacity = (nodeCap - estimatedBg) / estimatedMult / numStores The physical model factors the multiplier out of the capacity formula and into a separate amplification factor applied at the range-load boundary: load = CPUPerSecond * ampFactor capacity = (nodeCap - estimatedBg) / numStores range load = logicalRangeLoad * ampFactor (via MakePhysicalRangeLoad) Both models use the same capped-multiplier logic for background estimation (ampFactor = clamp(nodeUsage/storesCPU, 1, 3), background = max(0, nodeUsage - storesCPU * ampFactor)). The ratio load/capacity is algebraically identical, so all rebalancing decisions driven by `fractionUsed` in `loadSummaryForDimension` are unchanged. The physical model is preferred because it separates two concerns that the capped model conflates into a single capacity value. Capacity answers "how much physical CPU is available for MMA work after subtracting background" — a quantity grounded in the actual resource. The amplification factor answers "how much physical CPU does each unit of logical range work cost" and is applied only at the range-load boundary via `MakePhysicalRangeLoad`. This separation has two practical benefits: 1. It makes the `LoadVector` uniform across dimensions. Disk is inherently physical (`load=Used`, `capacity=Used+Available`). CPU now follows the same pattern with a separate amplification factor for per-range conversion. 2. It resolves the unit mismatch between physical `NodeCPULoad` and store-level pending deltas in `nodeState.adjustedCPU`. MMA uses `adjustedCPU` for node-level overload detection: it starts at `NodeCPULoad` (= `NodeCPURateUsage`, physical ns/s) and accumulates pending change deltas as ranges are scheduled to move (`applyChangeLoadDelta` adds `change.loadDelta[CPURate]`). The result is compared against `NodeCPUCapacity` in `loadSummaryForDimension`. Under the capped model, these deltas were in logical units (`RequestCPUNanos + RaftCPUNanos`, no amplification) — a range using 0.5 cores of direct KV CPU produced a delta of 0.5e9 ns/s, but its true physical impact on node CPU could be 0.5 * ampFactor = 1.5e9 ns/s. Under the physical model, `MakePhysicalRangeLoad` applies the amplification factor, so the delta is already in physical ns/s and combines directly with `NodeCPULoad`. The trade-off is that both load and capacity now include estimation error from the amplification factor. In the capped model, store load was the directly-measured `CPUPerSecond` — a ground-truth value requiring no estimation. In the physical model, load is `CPUPerSecond * ampFactor`, so diagnosing a store reporting "6 cores of load" requires knowing the amplification factor to recover the underlying 2 cores of direct KV CPU.

This change wires the physical capacity model into all MMA callers. `MakeStoreLoadMsg` now delegates to `computePhysicalStore`. Range loads are converted via `MakePhysicalRangeLoad` using an `AmpVector` from the new `Store.AmplificationFactors()` method, threaded through `mmaRangeLoad`, `tryConstructMMARangeMsg`, `NonMMAPreTransferLease`, and `NonMMAPreChangeReplicas`. `Store.AmplificationFactors()` returns `IdentityAmpVector()` when `CachedCapacity()` is zero (e.g. early startup). The simulator uses `IdentityAmpVector` with a TODO for real factors. Release note: None

trunk-io · 2026-03-04T02:46:42Z

Merging to master in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

cockroach-teamcity · 2026-03-04T02:46:59Z

This change is

Add a project-level code review skill that orchestrates multiple review aspects (redactability, commit structure, red/green testing, PR descriptions, general quality) by referencing existing rules and the commit-arc skill rather than duplicating their content. Includes support for posting batched GitHub reviews with inline line comments via the API. Also add a conventions rule covering PR descriptions, Bazel build files, metrics, Go conventions, and CI checks. Note: .github/workflows/pr-analyzer-threestage.yml already runs an AI-powered bug-screening pipeline on PRs. That workflow is deliberately narrow — it only flags likely bugs and uses a three-stage confidence filter to minimize noise. This skill is broader (reviewability, commit structure, redactability, etc.) and designed for interactive use. In the future, the CI workflow could invoke this skill to get structured, aspect-based review while keeping its conservative posting threshold. Example reviews generated by this skill: - PR cockroachdb#164658 (63-commit MMA repair): https://gist.github.com/tbg/ba4d528bcfa27cdae9ca587f91fb7178 - PR cockroachdb#164792 (physical modeling): https://gist.github.com/tbg/1d78a0c496069f18ec35a2b551483052 - PR cockroachdb#164677 (connection retry test): https://gist.github.com/tbg/69ec2eea0c4ab5d794a135fa62e551dc - PR cockroachdb#79134 (SKIP LOCKED): https://gist.github.com/tbg/052dae1e3f6449811fca92e275905f1f - PR cockroachdb#161454 (engine separation): https://gist.github.com/tbg/63ef958edd355fb166493b008a0291d5 Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>

tbg

Reviewing commits 2 and 3 only (commit 1 is from #164760).

Two clean commits: the first introduces the physical model types and functions with a comprehensive datadriven test; the second wires the new model into all callers. The algebraic equivalence argument is convincing (fractionUsed is unchanged), and the commit split follows a good "introduce abstraction, then wire" pattern.

A few stale comment references that are now out of date after commit 3's changes to MakeStoreLoadMsg — not blocking, but worth a follow-up or a fixup commit:

StoreLoadMsg.Capacity comment (messages.go:25) says "See computeCPUCapacityWithCap" but MakeStoreLoadMsg now uses computePhysicalStore.
NodeLoad comment (load.go:257) references computeCPUCapacityWithCap — same issue.
adjustedCPU comment (cluster_state.go:908) says "deltas are in store-level modeled CPU units while NodeCPULoad is physical." Under the physical model, range deltas are also in physical units (via MakePhysicalRangeLoad). The remaining mismatch is subtler: the source store's amplification factor is used for the delta, but the target node's actual amplification may differ.

(made with /review-crdb)

tbg · 2026-03-04T16:01:52Z

pkg/kv/kvserver/mmaintegration/physical_model.go

+//     jobs, etc. This is zero when the implicit multiplier is below the cap.
+//
+//  4. load = mmaAttributedLoad / numStores
+//     Per-store MMA-attributed physical CPU. This is what MMA can control.


nit: Step 4 says "load = mmaAttributedLoad / numStores" but the implementation (line 99) uses desc.Capacity.CPUPerSecond * ampFactor when CPUPerSecond > 0, only falling back to the even split. The inline comment at line 96 explains this well, but the numbered summary should match (e.g., "load = storeCPU * ampFactor, where storeCPU is this store's CPUPerSecond or storesCPURate / numStores as fallback").

tbg · 2026-03-04T16:01:52Z

pkg/kv/kvserver/mmaintegration/physical_model.go

+// minCapacity is the floor for per-store physical capacity in any dimension.
+// This prevents zero-capacity values that could arise during early node startup
+// or on empty stores.
+const minCapacity mmaprototype.LoadValue = 1.0


suggestion: This floor is ~8 orders of magnitude lower than the old cpuCapacityFloorPerStore (0.1 cores = 1e8 ns/s). The algebraic equivalence doesn't hold at the floor: the old model bounded fractionUsed at ~20x while this allows astronomically large values. The rebalancing outcome is the same (overloadUrgent saturates at fractionUsed > 0.9), so this is fine — but worth documenting.

Suggested change

// minCapacity is the floor for per-store physical capacity in any dimension.

// This prevents zero-capacity values that could arise during early node startup

// or on empty stores.

const minCapacity mmaprototype.LoadValue = 1.0

// minCapacity is a division-by-zero guard for per-store physical capacity in

// any dimension. It is intentionally tiny (1 unit) rather than a meaningful

// capacity floor: when capacity is this small, fractionUsed can be very large,

// but loadSummaryForDimension saturates at overloadUrgent (fractionUsed > 0.9)

// regardless of magnitude.

const minCapacity mmaprototype.LoadValue = 1.0

tbg · 2026-03-04T16:01:52Z

pkg/kv/kvserver/mmaintegration/physical_model_test.go

+						NodeCPURateUsage:    int64(nodeUsageCores * nsPerCore),
+						NodeCPURateCapacity: int64(nodeCapCores * nsPerCore),
+					},
+					Capacity: roachpb.StoreCapacity{


suggestion: None of the test cases set CPUPerSecond, so the per-store CPU differentiation path (computePhysicalCPU line 99-102) is untested. That's the production path — each store reports its own CPUPerSecond. Consider adding a multi-store case where CPUPerSecond differs from the even split to exercise the load weighting.

tbg · 2026-03-04T16:01:52Z

pkg/kv/kvserver/mma_replica_store.go

@@ -22,23 +23,23 @@ type mmaReplica Replica
 // The returned RangeLoad contains stats across multiple dimensions that MMA uses
 // to determine top-k replicas and evaluate the load impact of rebalancing them.
 func mmaRangeLoad(


nit: The comment starts with "The returned RangeLoad..." without the conventional function-name lead sentence. The method variant at line 37 has the full doc. Consider: "mmaRangeLoad constructs a RangeLoad from replica load stats and MVCC stats, converting logical loads to physical units via the given amplification factors."

tbg

@tbg partially reviewed 27 files and made 1 comment.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on sumeerbhola and wenyihu6).

pkg/kv/kvserver/allocator/mmaprototype/load.go line 151 at r2 (raw file):

		av[CPURate],
		av[WriteBandwidth],
		av[ByteSize],

Add something like var _ [3 - NumLoadDimensions]struct{} near here so that we are forced to update this should NumLoadDimensions grow.

sumeerbhola

@sumeerbhola made 1 comment.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on tbg and wenyihu6).

-- commits line 43 at r3:
(drive-by comment)

can we make capacity = nodeCap / numStores?
and just take estimatedBg/numStores and add it to each store's load. Then at least one number (the capacity) corresponds to the real world.

wenyihu6 · 2026-03-05T05:37:45Z

Closing in favor of #164900.

wenyihu6 added 3 commits March 3, 2026 21:19

wenyihu6 requested review from a team as code owners March 4, 2026 02:46

wenyihu6 requested review from sumeerbhola and tbg and removed request for a team March 4, 2026 02:46

tbg reviewed Mar 4, 2026

View reviewed changes

sumeerbhola reviewed Mar 4, 2026

View reviewed changes

wenyihu6 closed this Mar 5, 2026

tbg mentioned this pull request Mar 5, 2026

.claude: enhance review-crdb skill with parallel agent dispatch #164902

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mmaintegration: introduce physical capacity model#164792

mmaintegration: introduce physical capacity model#164792
wenyihu6 wants to merge 3 commits intocockroachdb:masterfrom
wenyihu6:first

wenyihu6 commented Mar 4, 2026 •

edited

Loading

Uh oh!

trunk-io bot commented Mar 4, 2026

Uh oh!

cockroach-teamcity commented Mar 4, 2026

Uh oh!

tbg left a comment

Uh oh!

tbg Mar 4, 2026

Uh oh!

tbg Mar 4, 2026

Uh oh!

tbg Mar 4, 2026

Uh oh!

tbg Mar 4, 2026

Uh oh!

tbg left a comment

Uh oh!

sumeerbhola left a comment

Uh oh!

wenyihu6 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wenyihu6 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trunk-io bot commented Mar 4, 2026

Uh oh!

cockroach-teamcity commented Mar 4, 2026

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

tbg Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

tbg Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

tbg Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

tbg Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

tbg left a comment

Choose a reason for hiding this comment

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wenyihu6 commented Mar 4, 2026 •

edited

Loading