mmaintegration: introduce physical capacity model by wenyihu6 · Pull Request #164900 · cockroachdb/cockroach

wenyihu6 · 2026-03-05T05:26:22Z

Epic: CRDB-55052
Release note: none

mmaintegration: introduce physical capacity model

This change introduces a physical capacity model that expresses MMA store
loads and capacities in physical resource units (CPU ns/s, disk bytes)
and wires it into all callers, replacing the logical (capped multiplier)
model.

The capped model expressed capacity in a synthetic "logical KV CPU" unit:

load = CPUPerSecond
capacity = (nodeCap - estimatedBg) / estimatedMult / numStores

The physical model factors the multiplier out of the capacity formula and
into a separate amplification factor applied at the range-load boundary,
and moves background from the capacity side to the load side:

load = CPUPerSecond * ampFactor + estimatedBg / numStores
capacity = nodeCap / numStores
range load = logicalRangeLoad * ampFactor (via MakePhysicalRangeLoad)

Both models use the same capped-multiplier logic for background
estimation (ampFactor = clamp(nodeUsage/storesCPU, 1, 3), background =
max(0, nodeUsage - storesCPU * ampFactor)).

Background is added to load rather than subtracted from capacity.
Capacity is now nodeCap/numStores — a real-world quantity (the store's
share of the node's CPU cores) that operators can directly interpret
without knowing the amplification factor. Subtracting background from
capacity hides pressure in the denominator: a node at 80% real CPU
(60% background + 20% KV) shows capacity = (10-6)/numStores = 4 and
load = 2, yielding 50% utilization. Adding background to load instead
yields load = 8, capacity = 10, utilization = 80% — matching reality.
This ensures MMA correctly identifies background-heavy nodes as shed
candidates.

The physical model separates two concerns that the capped model
conflates into a single capacity value. Capacity answers "how many
physical CPU cores are available to this store" — a quantity grounded
in the actual resource. The amplification factor answers "how much
physical CPU does each unit of logical range work cost" and is applied
only at the range-load boundary via MakePhysicalRangeLoad.

This separation has two practical benefits:

It makes the LoadVector uniform across dimensions. Disk is
inherently physical (load=Used, capacity=Used+Available). CPU
now follows the same pattern with a separate amplification factor
for per-range conversion.
It resolves the unit mismatch between physical NodeCPULoad and
store-level pending deltas in nodeState.adjustedCPU. MMA uses
adjustedCPU for node-level overload detection: it starts at
NodeCPULoad (= NodeCPURateUsage, physical ns/s) and accumulates
pending change deltas as ranges are scheduled to move. Under the
capped model, these deltas were in logical units — a range using
0.5 cores of direct KV CPU produced a delta of 0.5e9 ns/s, but its
true physical impact could be 0.5 * ampFactor = 1.5e9 ns/s. Under
the physical model, MakePhysicalRangeLoad applies the amplification
factor, so the delta is already in physical ns/s and combines directly
with NodeCPULoad.

mmaintegration: floor physical capacity at minCapacity

This commit floors per-store physical capacity at 1.0 (minCapacity)
for both CPU and disk dimensions to prevent zero-capacity during
startup or on empty stores. The floor is applied via a defer on named
returns so it covers all code paths in computePhysicalCPU and
computePhysicalDisk.

kvserver: convert MMA store and range loads to physical units

This change wires the physical capacity model into all MMA callers.

MakeStoreLoadMsg now delegates to computePhysicalStore.
Range loads are converted via MakePhysicalRangeLoad using an
AmpVector from the new Store.AmplificationFactors() method,
threaded through mmaRangeLoad, tryConstructMMARangeMsg,
NonMMAPreTransferLease, and NonMMAPreChangeReplicas.

Store.AmplificationFactors() returns IdentityAmpVector() when
CachedCapacity() is zero (e.g. early startup). The simulator uses
IdentityAmpVector with a TODO for real factors.

Release note: None

trunk-io · 2026-03-05T05:26:26Z

😎 Merged directly without going through the merge queue, as the queue was empty and the PR was up to date with the target branch - details.

cockroach-teamcity · 2026-03-05T05:26:48Z

This change is

tbg

Review: mmaintegration: introduce physical capacity model

The design is sound — moving background CPU from capacity to load is the right call, and the design comments with worked examples are excellent. The separation between store-level physical computation and range-level amplification via MakePhysicalRangeLoad is clean. A few issues to address before merging.

Issues not tied to specific lines

[correctness] highDiskSpaceUtilization comment is now stale (capacity_model.go:703-724): The comment derives fractionUsed = LogicalBytes / (LogicalBytes / diskUtil) = diskUtil. Under the new model, load=Used, capacity=Used+Available — the math still recovers actual disk utilization, but the comment references the old LogicalBytes-based derivation and is now misleading. (Not in the diff, so noting here.)

[correctness] Old computeCPUCapacityWithCap and computeStoreByteSizeCapacity are now dead code (capacity_model.go): After commit 3, these are only referenced from tests. Either remove them or add a comment noting they're retained as test baselines. (Not in the diff, so noting here.)

[commits] Commit 1 message claims "wires it into all callers": The opening sentence says "wires it into all callers, replacing the logical (capped multiplier) model" but commit 1 only introduces the model. The wiring happens in commit 3.

[tests] MakePhysicalRangeLoad has no direct test: The existing mma_conversion_test.go only passes IdentityAmpVector() (amp=1.0), which is a no-op. A bug in dimension indexing (e.g., applying the wrong dimension's amp factor) would go undetected. Add a test with non-identity amp factors.

[tests] maxDiskSpaceAmplification cap is never exercised in tests: The testdata covers amp=2.0 and amp=0.5 but never hits the 5.0 cap. Add a case like logical-gib=10, used-gib=100 to verify capping.

Strengths

Excellent design documentation in computePhysicalCPU — the worked examples for "background in load, not capacity" make the tradeoffs auditable.
Eliminates code duplication (two separate mmaRangeLoad implementations consolidated into MakePhysicalRangeLoad).
Good use of data-driven testing with 12 well-commented scenarios.
Correct math — sum of per-store loads recovers node usage (proved in the comment).

(made with /review-crdb)

tbg · 2026-03-05T10:02:19Z

pkg/kv/kvserver/mmaintegration/physical_model.go

+
+// minCapacity is the floor for per-store physical capacity in any dimension.
+// This prevents zero-capacity values that could arise during early node startup
+// or on empty stores.


blocking: minCapacity = 1.0 means 1 ns/s — effectively zero CPU capacity. The old model had cpuCapacityFloorPerStore = 0.1 * 1e9 (0.1 cores), which existed specifically to prevent utilization from going to infinity on overloaded nodes (see its detailed comment in capacity_model.go). If a store has non-zero load and capacity=1 ns/s, utilization becomes astronomical.

Either the new floor should be comparable to the old one, or a comment should explain why the protection is no longer needed under the physical model.

(tbg here: a comment is preferrable if we are indeed comfortable with the new lower floor).

Added a comment to clarify why this is okay with the new model.

Code snippet:

// Note: The old capacity model(computeCPUCapacityWithCap) needed a more // meaningful floor (cpuCapacityFloorPerStore = 0.1 cores = 1e8 ns/s) because // its capacity formula subtracted background load from the node's CPU capacity: // capacity = (nodeCap - background) / mult. As background grew, capacity shrank // toward zero while store load stayed constant, sending utilization // (load/capacity) to infinity. The physical model avoids this by keeping // capacity fixed at nodeCap/numStores and folding background into the load side // instead, so capacity never shrinks under load pressure and a negligible floor // suffices.

tbg · 2026-03-05T10:02:19Z

pkg/kv/kvserver/mmaintegration/physical_model.go

+// and MMA may try to shed from it. We accept this because: with this model,
+//
+//  1. Operators see balanced total CPU across nodes — they don't need to
+//     distinguish KV from non-KV usage to understand the cluster state.


nit: The numbered list skips item 2 (goes 1, 3, 4, 5). Renumber to 1, 2, 3, 4.