mmaintegration: introduce physical capacity model#164900
mmaintegration: introduce physical capacity model#164900trunk-io[bot] merged 12 commits intocockroachdb:masterfrom
Conversation
|
😎 Merged directly without going through the merge queue, as the queue was empty and the PR was up to date with the target branch - details. |
tbg
left a comment
There was a problem hiding this comment.
Review: mmaintegration: introduce physical capacity model
The design is sound — moving background CPU from capacity to load is the right call, and the design comments with worked examples are excellent. The separation between store-level physical computation and range-level amplification via MakePhysicalRangeLoad is clean. A few issues to address before merging.
Issues not tied to specific lines
[correctness] highDiskSpaceUtilization comment is now stale (capacity_model.go:703-724): The comment derives fractionUsed = LogicalBytes / (LogicalBytes / diskUtil) = diskUtil. Under the new model, load=Used, capacity=Used+Available — the math still recovers actual disk utilization, but the comment references the old LogicalBytes-based derivation and is now misleading. (Not in the diff, so noting here.)
[correctness] Old computeCPUCapacityWithCap and computeStoreByteSizeCapacity are now dead code (capacity_model.go): After commit 3, these are only referenced from tests. Either remove them or add a comment noting they're retained as test baselines. (Not in the diff, so noting here.)
[commits] Commit 1 message claims "wires it into all callers": The opening sentence says "wires it into all callers, replacing the logical (capped multiplier) model" but commit 1 only introduces the model. The wiring happens in commit 3.
[tests] MakePhysicalRangeLoad has no direct test: The existing mma_conversion_test.go only passes IdentityAmpVector() (amp=1.0), which is a no-op. A bug in dimension indexing (e.g., applying the wrong dimension's amp factor) would go undetected. Add a test with non-identity amp factors.
[tests] maxDiskSpaceAmplification cap is never exercised in tests: The testdata covers amp=2.0 and amp=0.5 but never hits the 5.0 cap. Add a case like logical-gib=10, used-gib=100 to verify capping.
Strengths
- Excellent design documentation in
computePhysicalCPU— the worked examples for "background in load, not capacity" make the tradeoffs auditable. - Eliminates code duplication (two separate
mmaRangeLoadimplementations consolidated intoMakePhysicalRangeLoad). - Good use of data-driven testing with 12 well-commented scenarios.
- Correct math — sum of per-store loads recovers node usage (proved in the comment).
(made with /review-crdb)
|
|
||
| // minCapacity is the floor for per-store physical capacity in any dimension. | ||
| // This prevents zero-capacity values that could arise during early node startup | ||
| // or on empty stores. |
There was a problem hiding this comment.
blocking: minCapacity = 1.0 means 1 ns/s — effectively zero CPU capacity. The old model had cpuCapacityFloorPerStore = 0.1 * 1e9 (0.1 cores), which existed specifically to prevent utilization from going to infinity on overloaded nodes (see its detailed comment in capacity_model.go). If a store has non-zero load and capacity=1 ns/s, utilization becomes astronomical.
Either the new floor should be comparable to the old one, or a comment should explain why the protection is no longer needed under the physical model.
There was a problem hiding this comment.
(tbg here: a comment is preferrable if we are indeed comfortable with the new lower floor).
There was a problem hiding this comment.
Added a comment to clarify why this is okay with the new model.
Code snippet:
// Note: The old capacity model(computeCPUCapacityWithCap) needed a more
// meaningful floor (cpuCapacityFloorPerStore = 0.1 cores = 1e8 ns/s) because
// its capacity formula subtracted background load from the node's CPU capacity:
// capacity = (nodeCap - background) / mult. As background grew, capacity shrank
// toward zero while store load stayed constant, sending utilization
// (load/capacity) to infinity. The physical model avoids this by keeping
// capacity fixed at nodeCap/numStores and folding background into the load side
// instead, so capacity never shrinks under load pressure and a negligible floor
// suffices.| // and MMA may try to shed from it. We accept this because: with this model, | ||
| // | ||
| // 1. Operators see balanced total CPU across nodes — they don't need to | ||
| // distinguish KV from non-KV usage to understand the cluster state. |
There was a problem hiding this comment.
nit: The numbered list skips item 2 (goes 1, 3, 4, 5). Renumber to 1, 2, 3, 4.
There was a problem hiding this comment.
(bot here): pushed a commit for this
|
|
||
| type Amp float64 | ||
|
|
||
| func (af Amp) SafeFormat(w redact.SafePrinter, _ rune) { |
There was a problem hiding this comment.
suggestion: af is passed to SafePrinter.Printf without marking it safe. The established pattern in this file (e.g., meanStoreLoad.SafeFormat) uses redact.SafeFloat() for floats. Amplification factors contain no PII, so either approach works:
| func (af Amp) SafeFormat(w redact.SafePrinter, _ rune) { | |
| func (af Amp) SafeFormat(w redact.SafePrinter, _ rune) { | |
| w.Printf("%.2f", redact.SafeFloat(float64(af))) |
Alternatively, implement SafeValue() on Amp since it's always safe.
There was a problem hiding this comment.
(bot here): pushed a commit for this
| } | ||
| } | ||
|
|
||
| type Amp float64 |
There was a problem hiding this comment.
suggestion: Missing doc comment. Every other named type in this file (LoadDimension, LoadValue, LoadVector) has one. Same for AmpVector below.
| type Amp float64 | |
| // Amp is a per-dimension amplification factor that converts logical per-range | |
| // load measurements into physical units. For example, a CPU amplification | |
| // factor of 2.0 means each nanosecond of tracked replica CPU corresponds to | |
| // 2 nanoseconds of actual node CPU. | |
| type Amp float64 |
There was a problem hiding this comment.
(bot here): pushed a commit for this
| // structured that adding additional resource dimensions is easy. | ||
| type LoadDimension uint8 | ||
|
|
||
| var _ [3 - NumLoadDimensions]struct{} |
There was a problem hiding this comment.
suggestion: This assertion is one-directional (catches additions but not removals) and undocumented. Make it bidirectional and add a comment:
| var _ [3 - NumLoadDimensions]struct{} | |
| // NumLoadDimensions must be exactly 3. Update SafeFormat methods for | |
| // LoadVector and AmpVector if a dimension is added or removed. | |
| var _ [NumLoadDimensions - 3]struct{} | |
| var _ [3 - NumLoadDimensions]struct{} |
There was a problem hiding this comment.
(bot here): pushed a commit for this
| // (CachedCapacity + GetNodeCapacity) are cheap and uncontended at the call | ||
| // rates involved (once/min for leaseholder msgs, once per rebalance for queue | ||
| // operations). See ComputeAmplificationFactors for discussion of consistency | ||
| // with the store-level load in StoreLoadMsg. |
There was a problem hiding this comment.
nit: cap shadows the Go builtin. Consider sc or storeCap.
There was a problem hiding this comment.
(bot here): pushed a commit for this
| amp mmaprototype.AmpVector, | ||
| ) mmaprototype.RangeLoad { | ||
| var rl mmaprototype.RangeLoad | ||
| cpuNanos := requestCPUNanos + raftCPUNanos |
There was a problem hiding this comment.
nit: Redundant float64() casts — cpuNanos, raftCPUNanos, and writeBytesPerSec are already float64. Only float64(logicalBytes) (int64) is needed. float64(amp[...]) is needed since Amp is a named type.
There was a problem hiding this comment.
(bot here): pushed a commit for this
tbg
left a comment
There was a problem hiding this comment.
I posted the AI review, addressed the "obvious" fixes, did a personal review. Very nice! A couple of comments are still open for discussion.
@tbg partially reviewed 16 files and made 6 comments.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on sumeerbhola and wenyihu6).
pkg/kv/kvserver/mmaintegration/physical_model.go line 19 at r1 (raw file):
// load is the per-store physical load in each dimension's native unit // (ns/s for CPURate, bytes/s for WriteBandwidth, bytes for ByteSize). load mmaprototype.LoadVector
this could be a convenient place to point out that the load doesn't have to follow from what MMA tracks (i.e. originating from ranges) and could point to the CPU dimension as an example: background load is reflected here, but doesn't originate from ranges. So adding up the range CPU loads will generally yield less than the store load.
pkg/kv/kvserver/mmaintegration/physical_model.go line 43 at r1 (raw file):
// store work to the directly-tracked store CPU (CPUPerSecond). This is // clamped to [1, cpuIndirectOverheadMultiplier]. When storesCPURate is 0, // the factor defaults to the cap.
in this case, the factor has no implications, right? We don't have any load to give, so it doesn't matter how we amplify it - this is just to stay away from infinities, right? Other stores sending us load will use their own multiplier instead of ours?
pkg/kv/kvserver/mmaintegration/physical_model.go line 53 at r1 (raw file):
// the factor, plus an even share of background. This lets stores with more // KV work report higher load while still accounting for node-level overhead. // Falls back to storesCPU/numStores when CPUPerSecond is unavailable.
why would CPUPerSecond be unavailable?
pkg/kv/kvserver/mmaintegration/physical_model.go line 129 at r1 (raw file):
// Step 3: Compute per-store load and capacity. storeCPU := desc.Capacity.CPUPerSecond if storeCPU <= 0 {
Still unsure why this wouldn't always be provided. The field was added in #96127, so wouldn't this always be populated now?
pkg/kv/kvserver/mmaintegration/physical_model.go line 226 at r1 (raw file):
// replicas contribute CPU to the store total but don't report range-level // load, and ranges may join or leave between snapshots. A small drift in // amplification factors is negligible compared to these existing gaps.
and CPU load now reflects a share of background load which by design doesn't originate from range load. So it's not even just that it "tolerates" it, it is by design that they're not even trying to be equal in general.
wenyihu6
left a comment
There was a problem hiding this comment.
Thanks for the commits!
@wenyihu6 made 7 comments and resolved 6 discussions.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on sumeerbhola and tbg).
pkg/kv/kvserver/mmaintegration/physical_model.go line 19 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
this could be a convenient place to point out that the load doesn't have to follow from what MMA tracks (i.e. originating from ranges) and could point to the CPU dimension as an example: background load is reflected here, but doesn't originate from ranges. So adding up the range CPU loads will generally yield less than the store load.
Added.
pkg/kv/kvserver/mmaintegration/physical_model.go line 43 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
in this case, the factor has no implications, right? We don't have any load to give, so it doesn't matter how we amplify it - this is just to stay away from infinities, right? Other stores sending us load will use their own multiplier instead of ours?
Yes, added a comment to emphasize.
pkg/kv/kvserver/mmaintegration/physical_model.go line 53 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
why would CPUPerSecond be unavailable?
It may be -1 if grunning is unsupported https://github.com/kvoli/cockroach/blob/c28ed6b40bdc2d48184ab93e18c6b3ca28cd2eba/pkg/kv/kvserver/store.go#L2620-L2622 or 0 during node start up.
pkg/kv/kvserver/mmaintegration/physical_model.go line 129 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Still unsure why this wouldn't always be provided. The field was added in #96127, so wouldn't this always be populated now?
Replied above.
pkg/kv/kvserver/mmaintegration/physical_model.go line 226 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
and CPU load now reflects a share of background load which by design doesn't originate from range load. So it's not even just that it "tolerates" it, it is by design that they're not even trying to be equal in general.
Yes, updated.
|
|
||
| // minCapacity is the floor for per-store physical capacity in any dimension. | ||
| // This prevents zero-capacity values that could arise during early node startup | ||
| // or on empty stores. |
There was a problem hiding this comment.
Added a comment to clarify why this is okay with the new model.
Code snippet:
// Note: The old capacity model(computeCPUCapacityWithCap) needed a more
// meaningful floor (cpuCapacityFloorPerStore = 0.1 cores = 1e8 ns/s) because
// its capacity formula subtracted background load from the node's CPU capacity:
// capacity = (nodeCap - background) / mult. As background grew, capacity shrank
// toward zero while store load stayed constant, sending utilization
// (load/capacity) to infinity. The physical model avoids this by keeping
// capacity fixed at nodeCap/numStores and folding background into the load side
// instead, so capacity never shrinks under load pressure and a negligible floor
// suffices.|
Pushed a commit to rename cpuIndirectOverheadMultiplier to maxCPUAmplification to be consistent with maxSpaceAmplification if you wanna have a look. |
9a5785f to
809cac2
Compare
|
/trunk merge |
This change introduces a physical capacity model that expresses MMA store
loads and capacities in physical resource units (CPU ns/s, disk bytes)
and wires it into all callers, replacing the logical (capped multiplier)
model.
The capped model expressed capacity in a synthetic "logical KV CPU" unit:
load = CPUPerSecond
capacity = (nodeCap - estimatedBg) / estimatedMult / numStores
The physical model factors the multiplier out of the capacity formula and
into a separate amplification factor applied at the range-load boundary,
and moves background from the capacity side to the load side:
load = CPUPerSecond * ampFactor + estimatedBg / numStores
capacity = nodeCap / numStores
range load = logicalRangeLoad * ampFactor (via MakePhysicalRangeLoad)
Both models use the same capped-multiplier logic for background
estimation (ampFactor = clamp(nodeUsage/storesCPU, 1, 3), background =
max(0, nodeUsage - storesCPU * ampFactor)).
Background is added to load rather than subtracted from capacity.
Capacity is now nodeCap/numStores — a real-world quantity (the store's
share of the node's CPU cores) that operators can directly interpret
without knowing the amplification factor. Subtracting background from
capacity hides pressure in the denominator: a node at 80% real CPU
(60% background + 20% KV) shows capacity = (10-6)/numStores = 4 and
load = 2, yielding 50% utilization. Adding background to load instead
yields load = 8, capacity = 10, utilization = 80% — matching reality.
This ensures MMA correctly identifies background-heavy nodes as shed
candidates.
The physical model separates two concerns that the capped model
conflates into a single capacity value. Capacity answers "how many
physical CPU cores are available to this store" — a quantity grounded
in the actual resource. The amplification factor answers "how much
physical CPU does each unit of logical range work cost" and is applied
only at the range-load boundary via `MakePhysicalRangeLoad`.
This separation has two practical benefits:
1. It makes the `LoadVector` uniform across dimensions. Disk is
inherently physical (load=`Used`, capacity=`Used+Available`). CPU
now follows the same pattern with a separate amplification factor
for per-range conversion.
2. It resolves the unit mismatch between physical `NodeCPULoad` and
store-level pending deltas in `nodeState.adjustedCPU`. MMA uses
`adjustedCPU` for node-level overload detection: it starts at
`NodeCPULoad` (= `NodeCPURateUsage`, physical ns/s) and accumulates
pending change deltas as ranges are scheduled to move. Under the
capped model, these deltas were in logical units — a range using
0.5 cores of direct KV CPU produced a delta of 0.5e9 ns/s, but its
true physical impact could be 0.5 * ampFactor = 1.5e9 ns/s. Under
the physical model, `MakePhysicalRangeLoad` applies the amplification
factor, so the delta is already in physical ns/s and combines directly
with `NodeCPULoad`.
This commit floors per-store physical capacity at 1.0 (minCapacity) for both CPU and disk dimensions to prevent zero-capacity during startup or on empty stores. The floor is applied via a defer on named returns so it covers all code paths in computePhysicalCPU and computePhysicalDisk.
This change wires the physical capacity model into all MMA callers. `MakeStoreLoadMsg` now delegates to `computePhysicalStore`. Range loads are converted via `MakePhysicalRangeLoad` using an `AmpVector` from the new `Store.AmplificationFactors()` method, threaded through `mmaRangeLoad`, `tryConstructMMARangeMsg`, `NonMMAPreTransferLease`, and `NonMMAPreChangeReplicas`. `Store.AmplificationFactors()` returns `IdentityAmpVector()` when `CachedCapacity()` is zero (e.g. early startup). The simulator uses `IdentityAmpVector` with a TODO for real factors. Release note: None
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
…Load Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
|
/trunk merge |
Epic: CRDB-55052
Release note: none
mmaintegration: introduce physical capacity model
This change introduces a physical capacity model that expresses MMA store
loads and capacities in physical resource units (CPU ns/s, disk bytes)
and wires it into all callers, replacing the logical (capped multiplier)
model.
The capped model expressed capacity in a synthetic "logical KV CPU" unit:
The physical model factors the multiplier out of the capacity formula and
into a separate amplification factor applied at the range-load boundary,
and moves background from the capacity side to the load side:
Both models use the same capped-multiplier logic for background
estimation (ampFactor = clamp(nodeUsage/storesCPU, 1, 3), background =
max(0, nodeUsage - storesCPU * ampFactor)).
Background is added to load rather than subtracted from capacity.
Capacity is now nodeCap/numStores — a real-world quantity (the store's
share of the node's CPU cores) that operators can directly interpret
without knowing the amplification factor. Subtracting background from
capacity hides pressure in the denominator: a node at 80% real CPU
(60% background + 20% KV) shows capacity = (10-6)/numStores = 4 and
load = 2, yielding 50% utilization. Adding background to load instead
yields load = 8, capacity = 10, utilization = 80% — matching reality.
This ensures MMA correctly identifies background-heavy nodes as shed
candidates.
The physical model separates two concerns that the capped model
conflates into a single capacity value. Capacity answers "how many
physical CPU cores are available to this store" — a quantity grounded
in the actual resource. The amplification factor answers "how much
physical CPU does each unit of logical range work cost" and is applied
only at the range-load boundary via
MakePhysicalRangeLoad.This separation has two practical benefits:
It makes the
LoadVectoruniform across dimensions. Disk isinherently physical (load=
Used, capacity=Used+Available). CPUnow follows the same pattern with a separate amplification factor
for per-range conversion.
It resolves the unit mismatch between physical
NodeCPULoadandstore-level pending deltas in
nodeState.adjustedCPU. MMA usesadjustedCPUfor node-level overload detection: it starts atNodeCPULoad(=NodeCPURateUsage, physical ns/s) and accumulatespending change deltas as ranges are scheduled to move. Under the
capped model, these deltas were in logical units — a range using
0.5 cores of direct KV CPU produced a delta of 0.5e9 ns/s, but its
true physical impact could be 0.5 * ampFactor = 1.5e9 ns/s. Under
the physical model,
MakePhysicalRangeLoadapplies the amplificationfactor, so the delta is already in physical ns/s and combines directly
with
NodeCPULoad.mmaintegration: floor physical capacity at minCapacity
This commit floors per-store physical capacity at 1.0 (minCapacity)
for both CPU and disk dimensions to prevent zero-capacity during
startup or on empty stores. The floor is applied via a defer on named
returns so it covers all code paths in computePhysicalCPU and
computePhysicalDisk.
kvserver: convert MMA store and range loads to physical units
This change wires the physical capacity model into all MMA callers.
MakeStoreLoadMsgnow delegates tocomputePhysicalStore.Range loads are converted via
MakePhysicalRangeLoadusing anAmpVectorfrom the newStore.AmplificationFactors()method,threaded through
mmaRangeLoad,tryConstructMMARangeMsg,NonMMAPreTransferLease, andNonMMAPreChangeReplicas.Store.AmplificationFactors()returnsIdentityAmpVector()whenCachedCapacity()is zero (e.g. early startup). The simulator usesIdentityAmpVectorwith a TODO for real factors.Release note: None