Skip to content

mmaintegration: introduce physical capacity model#164900

Merged
trunk-io[bot] merged 12 commits intocockroachdb:masterfrom
wenyihu6:oldmodel2
Mar 6, 2026
Merged

mmaintegration: introduce physical capacity model#164900
trunk-io[bot] merged 12 commits intocockroachdb:masterfrom
wenyihu6:oldmodel2

Conversation

@wenyihu6
Copy link
Copy Markdown
Contributor

@wenyihu6 wenyihu6 commented Mar 5, 2026

Epic: CRDB-55052
Release note: none


mmaintegration: introduce physical capacity model

This change introduces a physical capacity model that expresses MMA store
loads and capacities in physical resource units (CPU ns/s, disk bytes)
and wires it into all callers, replacing the logical (capped multiplier)
model.

The capped model expressed capacity in a synthetic "logical KV CPU" unit:

load = CPUPerSecond
capacity = (nodeCap - estimatedBg) / estimatedMult / numStores

The physical model factors the multiplier out of the capacity formula and
into a separate amplification factor applied at the range-load boundary,
and moves background from the capacity side to the load side:

load = CPUPerSecond * ampFactor + estimatedBg / numStores
capacity = nodeCap / numStores
range load = logicalRangeLoad * ampFactor (via MakePhysicalRangeLoad)

Both models use the same capped-multiplier logic for background
estimation (ampFactor = clamp(nodeUsage/storesCPU, 1, 3), background =
max(0, nodeUsage - storesCPU * ampFactor)).

Background is added to load rather than subtracted from capacity.
Capacity is now nodeCap/numStores — a real-world quantity (the store's
share of the node's CPU cores) that operators can directly interpret
without knowing the amplification factor. Subtracting background from
capacity hides pressure in the denominator: a node at 80% real CPU
(60% background + 20% KV) shows capacity = (10-6)/numStores = 4 and
load = 2, yielding 50% utilization. Adding background to load instead
yields load = 8, capacity = 10, utilization = 80% — matching reality.
This ensures MMA correctly identifies background-heavy nodes as shed
candidates.

The physical model separates two concerns that the capped model
conflates into a single capacity value. Capacity answers "how many
physical CPU cores are available to this store" — a quantity grounded
in the actual resource. The amplification factor answers "how much
physical CPU does each unit of logical range work cost" and is applied
only at the range-load boundary via MakePhysicalRangeLoad.

This separation has two practical benefits:

  1. It makes the LoadVector uniform across dimensions. Disk is
    inherently physical (load=Used, capacity=Used+Available). CPU
    now follows the same pattern with a separate amplification factor
    for per-range conversion.

  2. It resolves the unit mismatch between physical NodeCPULoad and
    store-level pending deltas in nodeState.adjustedCPU. MMA uses
    adjustedCPU for node-level overload detection: it starts at
    NodeCPULoad (= NodeCPURateUsage, physical ns/s) and accumulates
    pending change deltas as ranges are scheduled to move. Under the
    capped model, these deltas were in logical units — a range using
    0.5 cores of direct KV CPU produced a delta of 0.5e9 ns/s, but its
    true physical impact could be 0.5 * ampFactor = 1.5e9 ns/s. Under
    the physical model, MakePhysicalRangeLoad applies the amplification
    factor, so the delta is already in physical ns/s and combines directly
    with NodeCPULoad.


mmaintegration: floor physical capacity at minCapacity

This commit floors per-store physical capacity at 1.0 (minCapacity)
for both CPU and disk dimensions to prevent zero-capacity during
startup or on empty stores. The floor is applied via a defer on named
returns so it covers all code paths in computePhysicalCPU and
computePhysicalDisk.


kvserver: convert MMA store and range loads to physical units

This change wires the physical capacity model into all MMA callers.

MakeStoreLoadMsg now delegates to computePhysicalStore.
Range loads are converted via MakePhysicalRangeLoad using an
AmpVector from the new Store.AmplificationFactors() method,
threaded through mmaRangeLoad, tryConstructMMARangeMsg,
NonMMAPreTransferLease, and NonMMAPreChangeReplicas.

Store.AmplificationFactors() returns IdentityAmpVector() when
CachedCapacity() is zero (e.g. early startup). The simulator uses
IdentityAmpVector with a TODO for real factors.

Release note: None

@wenyihu6 wenyihu6 requested review from a team as code owners March 5, 2026 05:26
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io bot commented Mar 5, 2026

😎 Merged directly without going through the merge queue, as the queue was empty and the PR was up to date with the target branch - details.

@wenyihu6 wenyihu6 requested review from sumeerbhola and tbg March 5, 2026 05:26
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: mmaintegration: introduce physical capacity model

The design is sound — moving background CPU from capacity to load is the right call, and the design comments with worked examples are excellent. The separation between store-level physical computation and range-level amplification via MakePhysicalRangeLoad is clean. A few issues to address before merging.

Issues not tied to specific lines

[correctness] highDiskSpaceUtilization comment is now stale (capacity_model.go:703-724): The comment derives fractionUsed = LogicalBytes / (LogicalBytes / diskUtil) = diskUtil. Under the new model, load=Used, capacity=Used+Available — the math still recovers actual disk utilization, but the comment references the old LogicalBytes-based derivation and is now misleading. (Not in the diff, so noting here.)

[correctness] Old computeCPUCapacityWithCap and computeStoreByteSizeCapacity are now dead code (capacity_model.go): After commit 3, these are only referenced from tests. Either remove them or add a comment noting they're retained as test baselines. (Not in the diff, so noting here.)

[commits] Commit 1 message claims "wires it into all callers": The opening sentence says "wires it into all callers, replacing the logical (capped multiplier) model" but commit 1 only introduces the model. The wiring happens in commit 3.

[tests] MakePhysicalRangeLoad has no direct test: The existing mma_conversion_test.go only passes IdentityAmpVector() (amp=1.0), which is a no-op. A bug in dimension indexing (e.g., applying the wrong dimension's amp factor) would go undetected. Add a test with non-identity amp factors.

[tests] maxDiskSpaceAmplification cap is never exercised in tests: The testdata covers amp=2.0 and amp=0.5 but never hits the 5.0 cap. Add a case like logical-gib=10, used-gib=100 to verify capping.

Strengths

  • Excellent design documentation in computePhysicalCPU — the worked examples for "background in load, not capacity" make the tradeoffs auditable.
  • Eliminates code duplication (two separate mmaRangeLoad implementations consolidated into MakePhysicalRangeLoad).
  • Good use of data-driven testing with 12 well-commented scenarios.
  • Correct math — sum of per-store loads recovers node usage (proved in the comment).

(made with /review-crdb)


// minCapacity is the floor for per-store physical capacity in any dimension.
// This prevents zero-capacity values that could arise during early node startup
// or on empty stores.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: minCapacity = 1.0 means 1 ns/s — effectively zero CPU capacity. The old model had cpuCapacityFloorPerStore = 0.1 * 1e9 (0.1 cores), which existed specifically to prevent utilization from going to infinity on overloaded nodes (see its detailed comment in capacity_model.go). If a store has non-zero load and capacity=1 ns/s, utilization becomes astronomical.

Either the new floor should be comparable to the old one, or a comment should explain why the protection is no longer needed under the physical model.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(tbg here: a comment is preferrable if we are indeed comfortable with the new lower floor).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to clarify why this is okay with the new model.

Code snippet:

// Note: The old capacity model(computeCPUCapacityWithCap) needed a more
// meaningful floor (cpuCapacityFloorPerStore = 0.1 cores = 1e8 ns/s) because
// its capacity formula subtracted background load from the node's CPU capacity:
// capacity = (nodeCap - background) / mult. As background grew, capacity shrank
// toward zero while store load stayed constant, sending utilization
// (load/capacity) to infinity. The physical model avoids this by keeping
// capacity fixed at nodeCap/numStores and folding background into the load side
// instead, so capacity never shrinks under load pressure and a negligible floor
// suffices.

// and MMA may try to shed from it. We accept this because: with this model,
//
// 1. Operators see balanced total CPU across nodes — they don't need to
// distinguish KV from non-KV usage to understand the cluster state.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The numbered list skips item 2 (goes 1, 3, 4, 5). Renumber to 1, 2, 3, 4.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(bot here): pushed a commit for this


type Amp float64

func (af Amp) SafeFormat(w redact.SafePrinter, _ rune) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: af is passed to SafePrinter.Printf without marking it safe. The established pattern in this file (e.g., meanStoreLoad.SafeFormat) uses redact.SafeFloat() for floats. Amplification factors contain no PII, so either approach works:

Suggested change
func (af Amp) SafeFormat(w redact.SafePrinter, _ rune) {
func (af Amp) SafeFormat(w redact.SafePrinter, _ rune) {
w.Printf("%.2f", redact.SafeFloat(float64(af)))

Alternatively, implement SafeValue() on Amp since it's always safe.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(bot here): pushed a commit for this

}
}

type Amp float64
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Missing doc comment. Every other named type in this file (LoadDimension, LoadValue, LoadVector) has one. Same for AmpVector below.

Suggested change
type Amp float64
// Amp is a per-dimension amplification factor that converts logical per-range
// load measurements into physical units. For example, a CPU amplification
// factor of 2.0 means each nanosecond of tracked replica CPU corresponds to
// 2 nanoseconds of actual node CPU.
type Amp float64

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(bot here): pushed a commit for this

// structured that adding additional resource dimensions is easy.
type LoadDimension uint8

var _ [3 - NumLoadDimensions]struct{}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This assertion is one-directional (catches additions but not removals) and undocumented. Make it bidirectional and add a comment:

Suggested change
var _ [3 - NumLoadDimensions]struct{}
// NumLoadDimensions must be exactly 3. Update SafeFormat methods for
// LoadVector and AmpVector if a dimension is added or removed.
var _ [NumLoadDimensions - 3]struct{}
var _ [3 - NumLoadDimensions]struct{}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(bot here): pushed a commit for this

// (CachedCapacity + GetNodeCapacity) are cheap and uncontended at the call
// rates involved (once/min for leaseholder msgs, once per rebalance for queue
// operations). See ComputeAmplificationFactors for discussion of consistency
// with the store-level load in StoreLoadMsg.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cap shadows the Go builtin. Consider sc or storeCap.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(bot here): pushed a commit for this

amp mmaprototype.AmpVector,
) mmaprototype.RangeLoad {
var rl mmaprototype.RangeLoad
cpuNanos := requestCPUNanos + raftCPUNanos
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Redundant float64() casts — cpuNanos, raftCPUNanos, and writeBytesPerSec are already float64. Only float64(logicalBytes) (int64) is needed. float64(amp[...]) is needed since Amp is a named type.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(bot here): pushed a commit for this

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: I posted the AI review, addressed the "obvious" fixes, did a personal review. Very nice! A couple of comments are still open for discussion.

@tbg partially reviewed 16 files and made 6 comments.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on sumeerbhola and wenyihu6).


pkg/kv/kvserver/mmaintegration/physical_model.go line 19 at r1 (raw file):

	// load is the per-store physical load in each dimension's native unit
	// (ns/s for CPURate, bytes/s for WriteBandwidth, bytes for ByteSize).
	load mmaprototype.LoadVector

this could be a convenient place to point out that the load doesn't have to follow from what MMA tracks (i.e. originating from ranges) and could point to the CPU dimension as an example: background load is reflected here, but doesn't originate from ranges. So adding up the range CPU loads will generally yield less than the store load.


pkg/kv/kvserver/mmaintegration/physical_model.go line 43 at r1 (raw file):

//     store work to the directly-tracked store CPU (CPUPerSecond). This is
//     clamped to [1, cpuIndirectOverheadMultiplier]. When storesCPURate is 0,
//     the factor defaults to the cap.

in this case, the factor has no implications, right? We don't have any load to give, so it doesn't matter how we amplify it - this is just to stay away from infinities, right? Other stores sending us load will use their own multiplier instead of ours?


pkg/kv/kvserver/mmaintegration/physical_model.go line 53 at r1 (raw file):

//     the factor, plus an even share of background. This lets stores with more
//     KV work report higher load while still accounting for node-level overhead.
//     Falls back to storesCPU/numStores when CPUPerSecond is unavailable.

why would CPUPerSecond be unavailable?


pkg/kv/kvserver/mmaintegration/physical_model.go line 129 at r1 (raw file):

	// Step 3: Compute per-store load and capacity.
	storeCPU := desc.Capacity.CPUPerSecond
	if storeCPU <= 0 {

Still unsure why this wouldn't always be provided. The field was added in #96127, so wouldn't this always be populated now?


pkg/kv/kvserver/mmaintegration/physical_model.go line 226 at r1 (raw file):

//     replicas contribute CPU to the store total but don't report range-level
//     load, and ranges may join or leave between snapshots. A small drift in
//     amplification factors is negligible compared to these existing gaps.

and CPU load now reflects a share of background load which by design doesn't originate from range load. So it's not even just that it "tolerates" it, it is by design that they're not even trying to be equal in general.

Copy link
Copy Markdown
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the commits!

@wenyihu6 made 7 comments and resolved 6 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on sumeerbhola and tbg).


pkg/kv/kvserver/mmaintegration/physical_model.go line 19 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

this could be a convenient place to point out that the load doesn't have to follow from what MMA tracks (i.e. originating from ranges) and could point to the CPU dimension as an example: background load is reflected here, but doesn't originate from ranges. So adding up the range CPU loads will generally yield less than the store load.

Added.


pkg/kv/kvserver/mmaintegration/physical_model.go line 43 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

in this case, the factor has no implications, right? We don't have any load to give, so it doesn't matter how we amplify it - this is just to stay away from infinities, right? Other stores sending us load will use their own multiplier instead of ours?

Yes, added a comment to emphasize.


pkg/kv/kvserver/mmaintegration/physical_model.go line 53 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

why would CPUPerSecond be unavailable?

It may be -1 if grunning is unsupported https://github.com/kvoli/cockroach/blob/c28ed6b40bdc2d48184ab93e18c6b3ca28cd2eba/pkg/kv/kvserver/store.go#L2620-L2622 or 0 during node start up.


pkg/kv/kvserver/mmaintegration/physical_model.go line 129 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Still unsure why this wouldn't always be provided. The field was added in #96127, so wouldn't this always be populated now?

Replied above.


pkg/kv/kvserver/mmaintegration/physical_model.go line 226 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

and CPU load now reflects a share of background load which by design doesn't originate from range load. So it's not even just that it "tolerates" it, it is by design that they're not even trying to be equal in general.

Yes, updated.


// minCapacity is the floor for per-store physical capacity in any dimension.
// This prevents zero-capacity values that could arise during early node startup
// or on empty stores.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to clarify why this is okay with the new model.

Code snippet:

// Note: The old capacity model(computeCPUCapacityWithCap) needed a more
// meaningful floor (cpuCapacityFloorPerStore = 0.1 cores = 1e8 ns/s) because
// its capacity formula subtracted background load from the node's CPU capacity:
// capacity = (nodeCap - background) / mult. As background grew, capacity shrank
// toward zero while store load stayed constant, sending utilization
// (load/capacity) to infinity. The physical model avoids this by keeping
// capacity fixed at nodeCap/numStores and folding background into the load side
// instead, so capacity never shrinks under load pressure and a negligible floor
// suffices.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Mar 5, 2026

Pushed a commit to rename cpuIndirectOverheadMultiplier to maxCPUAmplification to be consistent with maxSpaceAmplification if you wanna have a look.

@wenyihu6 wenyihu6 force-pushed the oldmodel2 branch 3 times, most recently from 9a5785f to 809cac2 Compare March 6, 2026 04:40
@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Mar 6, 2026

/trunk merge

wenyihu6 and others added 12 commits March 6, 2026 00:17
This change introduces a physical capacity model that expresses MMA store
loads and capacities in physical resource units (CPU ns/s, disk bytes)
and wires it into all callers, replacing the logical (capped multiplier)
model.

The capped model expressed capacity in a synthetic "logical KV CPU" unit:

    load = CPUPerSecond
    capacity = (nodeCap - estimatedBg) / estimatedMult / numStores

The physical model factors the multiplier out of the capacity formula and
into a separate amplification factor applied at the range-load boundary,
and moves background from the capacity side to the load side:

    load = CPUPerSecond * ampFactor + estimatedBg / numStores
    capacity = nodeCap / numStores
    range load = logicalRangeLoad * ampFactor (via MakePhysicalRangeLoad)

Both models use the same capped-multiplier logic for background
estimation (ampFactor = clamp(nodeUsage/storesCPU, 1, 3), background =
max(0, nodeUsage - storesCPU * ampFactor)).

Background is added to load rather than subtracted from capacity.
Capacity is now nodeCap/numStores — a real-world quantity (the store's
share of the node's CPU cores) that operators can directly interpret
without knowing the amplification factor. Subtracting background from
capacity hides pressure in the denominator: a node at 80% real CPU
(60% background + 20% KV) shows capacity = (10-6)/numStores = 4 and
load = 2, yielding 50% utilization. Adding background to load instead
yields load = 8, capacity = 10, utilization = 80% — matching reality.
This ensures MMA correctly identifies background-heavy nodes as shed
candidates.

The physical model separates two concerns that the capped model
conflates into a single capacity value. Capacity answers "how many
physical CPU cores are available to this store" — a quantity grounded
in the actual resource. The amplification factor answers "how much
physical CPU does each unit of logical range work cost" and is applied
only at the range-load boundary via `MakePhysicalRangeLoad`.

This separation has two practical benefits:

1. It makes the `LoadVector` uniform across dimensions. Disk is
   inherently physical (load=`Used`, capacity=`Used+Available`). CPU
   now follows the same pattern with a separate amplification factor
   for per-range conversion.

2. It resolves the unit mismatch between physical `NodeCPULoad` and
   store-level pending deltas in `nodeState.adjustedCPU`. MMA uses
   `adjustedCPU` for node-level overload detection: it starts at
   `NodeCPULoad` (= `NodeCPURateUsage`, physical ns/s) and accumulates
   pending change deltas as ranges are scheduled to move. Under the
   capped model, these deltas were in logical units — a range using
   0.5 cores of direct KV CPU produced a delta of 0.5e9 ns/s, but its
   true physical impact could be 0.5 * ampFactor = 1.5e9 ns/s. Under
   the physical model, `MakePhysicalRangeLoad` applies the amplification
   factor, so the delta is already in physical ns/s and combines directly
   with `NodeCPULoad`.
This commit floors per-store physical capacity at 1.0 (minCapacity)
for both CPU and disk dimensions to prevent zero-capacity during
startup or on empty stores. The floor is applied via a defer on named
returns so it covers all code paths in computePhysicalCPU and
computePhysicalDisk.
This change wires the physical capacity model into all MMA callers.

`MakeStoreLoadMsg` now delegates to `computePhysicalStore`.
Range loads are converted via `MakePhysicalRangeLoad` using an
`AmpVector` from the new `Store.AmplificationFactors()` method,
threaded through `mmaRangeLoad`, `tryConstructMMARangeMsg`,
`NonMMAPreTransferLease`, and `NonMMAPreChangeReplicas`.

`Store.AmplificationFactors()` returns `IdentityAmpVector()` when
`CachedCapacity()` is zero (e.g. early startup). The simulator uses
`IdentityAmpVector` with a TODO for real factors.

Release note: None
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…Load

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Mar 6, 2026

/trunk merge

@trunk-io trunk-io bot merged commit d85d4e6 into cockroachdb:master Mar 6, 2026
24 checks passed
@wenyihu6 wenyihu6 deleted the oldmodel2 branch March 6, 2026 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants