mmaintegration: introduce physical capacity model#164792
mmaintegration: introduce physical capacity model#164792wenyihu6 wants to merge 3 commits intocockroachdb:masterfrom
Conversation
The capped multiplier model (`computeCPUCapacityWithCap`) subtracts background CPU when deriving per-store capacity. This intentionally breaks the invariant maintained by the naive model: sum(store loads) / sum(store capacities) = NodeCPURateUsage / NodeCPURateCapacity When background CPU exceeds the cap, node-level overload detection can no longer recover physical utilization by summing store-level values. This change replaces the derived `NodeLoad` fields (`ReportedCPU`, `CapacityCPU` — incremental sums of per-store values) which relied on this invariance with explicit physical node-level CPU values (`NodeCPULoad`, `NodeCPUCapacity`) carried through `StoreLoadMsg` from `NodeCapacity` fields. Note that each store generates its descriptor independently, so `NodeCPULoad` values may differ slightly across stores on the same node due to sampling time differences. This is acceptable since they should be roughly the same.
mmaintegration: introduce physical capacity model for MMA loads This change introduces a physical capacity model that expresses MMA store loads and capacities in physical resource units (CPU ns/s, disk bytes) and wires it into all callers, replacing the logical (capped multiplier) model. The capped model expressed capacity in a synthetic "logical KV CPU" unit: load = CPUPerSecond capacity = (nodeCap - estimatedBg) / estimatedMult / numStores The physical model factors the multiplier out of the capacity formula and into a separate amplification factor applied at the range-load boundary: load = CPUPerSecond * ampFactor capacity = (nodeCap - estimatedBg) / numStores range load = logicalRangeLoad * ampFactor (via MakePhysicalRangeLoad) Both models use the same capped-multiplier logic for background estimation (ampFactor = clamp(nodeUsage/storesCPU, 1, 3), background = max(0, nodeUsage - storesCPU * ampFactor)). The ratio load/capacity is algebraically identical, so all rebalancing decisions driven by `fractionUsed` in `loadSummaryForDimension` are unchanged. The physical model is preferred because it separates two concerns that the capped model conflates into a single capacity value. Capacity answers "how much physical CPU is available for MMA work after subtracting background" — a quantity grounded in the actual resource. The amplification factor answers "how much physical CPU does each unit of logical range work cost" and is applied only at the range-load boundary via `MakePhysicalRangeLoad`. This separation has two practical benefits: 1. It makes the `LoadVector` uniform across dimensions. Disk is inherently physical (`load=Used`, `capacity=Used+Available`). CPU now follows the same pattern with a separate amplification factor for per-range conversion. 2. It resolves the unit mismatch between physical `NodeCPULoad` and store-level pending deltas in `nodeState.adjustedCPU`. MMA uses `adjustedCPU` for node-level overload detection: it starts at `NodeCPULoad` (= `NodeCPURateUsage`, physical ns/s) and accumulates pending change deltas as ranges are scheduled to move (`applyChangeLoadDelta` adds `change.loadDelta[CPURate]`). The result is compared against `NodeCPUCapacity` in `loadSummaryForDimension`. Under the capped model, these deltas were in logical units (`RequestCPUNanos + RaftCPUNanos`, no amplification) — a range using 0.5 cores of direct KV CPU produced a delta of 0.5e9 ns/s, but its true physical impact on node CPU could be 0.5 * ampFactor = 1.5e9 ns/s. Under the physical model, `MakePhysicalRangeLoad` applies the amplification factor, so the delta is already in physical ns/s and combines directly with `NodeCPULoad`. The trade-off is that both load and capacity now include estimation error from the amplification factor. In the capped model, store load was the directly-measured `CPUPerSecond` — a ground-truth value requiring no estimation. In the physical model, load is `CPUPerSecond * ampFactor`, so diagnosing a store reporting "6 cores of load" requires knowing the amplification factor to recover the underlying 2 cores of direct KV CPU.
This change wires the physical capacity model into all MMA callers. `MakeStoreLoadMsg` now delegates to `computePhysicalStore`. Range loads are converted via `MakePhysicalRangeLoad` using an `AmpVector` from the new `Store.AmplificationFactors()` method, threaded through `mmaRangeLoad`, `tryConstructMMARangeMsg`, `NonMMAPreTransferLease`, and `NonMMAPreChangeReplicas`. `Store.AmplificationFactors()` returns `IdentityAmpVector()` when `CachedCapacity()` is zero (e.g. early startup). The simulator uses `IdentityAmpVector` with a TODO for real factors. Release note: None
|
Merging to
|
Add a project-level code review skill that orchestrates multiple review aspects (redactability, commit structure, red/green testing, PR descriptions, general quality) by referencing existing rules and the commit-arc skill rather than duplicating their content. Includes support for posting batched GitHub reviews with inline line comments via the API. Also add a conventions rule covering PR descriptions, Bazel build files, metrics, Go conventions, and CI checks. Note: .github/workflows/pr-analyzer-threestage.yml already runs an AI-powered bug-screening pipeline on PRs. That workflow is deliberately narrow — it only flags likely bugs and uses a three-stage confidence filter to minimize noise. This skill is broader (reviewability, commit structure, redactability, etc.) and designed for interactive use. In the future, the CI workflow could invoke this skill to get structured, aspect-based review while keeping its conservative posting threshold. Example reviews generated by this skill: - PR cockroachdb#164658 (63-commit MMA repair): https://gist.github.com/tbg/ba4d528bcfa27cdae9ca587f91fb7178 - PR cockroachdb#164792 (physical modeling): https://gist.github.com/tbg/1d78a0c496069f18ec35a2b551483052 - PR cockroachdb#164677 (connection retry test): https://gist.github.com/tbg/69ec2eea0c4ab5d794a135fa62e551dc - PR cockroachdb#79134 (SKIP LOCKED): https://gist.github.com/tbg/052dae1e3f6449811fca92e275905f1f - PR cockroachdb#161454 (engine separation): https://gist.github.com/tbg/63ef958edd355fb166493b008a0291d5 Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add a project-level code review skill that orchestrates multiple review aspects (redactability, commit structure, red/green testing, PR descriptions, general quality) by referencing existing rules and the commit-arc skill rather than duplicating their content. Includes support for posting batched GitHub reviews with inline line comments via the API. Also add a conventions rule covering PR descriptions, Bazel build files, metrics, Go conventions, and CI checks. Note: .github/workflows/pr-analyzer-threestage.yml already runs an AI-powered bug-screening pipeline on PRs. That workflow is deliberately narrow — it only flags likely bugs and uses a three-stage confidence filter to minimize noise. This skill is broader (reviewability, commit structure, redactability, etc.) and designed for interactive use. In the future, the CI workflow could invoke this skill to get structured, aspect-based review while keeping its conservative posting threshold. Example reviews generated by this skill: - PR cockroachdb#164658 (63-commit MMA repair): https://gist.github.com/tbg/ba4d528bcfa27cdae9ca587f91fb7178 - PR cockroachdb#164792 (physical modeling): https://gist.github.com/tbg/1d78a0c496069f18ec35a2b551483052 - PR cockroachdb#164677 (connection retry test): https://gist.github.com/tbg/69ec2eea0c4ab5d794a135fa62e551dc - PR cockroachdb#79134 (SKIP LOCKED): https://gist.github.com/tbg/052dae1e3f6449811fca92e275905f1f - PR cockroachdb#161454 (engine separation): https://gist.github.com/tbg/63ef958edd355fb166493b008a0291d5 Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add a project-level code review skill that orchestrates multiple review aspects (redactability, commit structure, red/green testing, PR descriptions, general quality) by referencing existing rules and the commit-arc skill rather than duplicating their content. Includes support for posting batched GitHub reviews with inline line comments via the API. Also add a conventions rule covering PR descriptions, Bazel build files, metrics, Go conventions, and CI checks. Note: .github/workflows/pr-analyzer-threestage.yml already runs an AI-powered bug-screening pipeline on PRs. That workflow is deliberately narrow — it only flags likely bugs and uses a three-stage confidence filter to minimize noise. This skill is broader (reviewability, commit structure, redactability, etc.) and designed for interactive use. In the future, the CI workflow could invoke this skill to get structured, aspect-based review while keeping its conservative posting threshold. Example reviews generated by this skill: - PR cockroachdb#164658 (63-commit MMA repair): https://gist.github.com/tbg/ba4d528bcfa27cdae9ca587f91fb7178 - PR cockroachdb#164792 (physical modeling): https://gist.github.com/tbg/1d78a0c496069f18ec35a2b551483052 - PR cockroachdb#164677 (connection retry test): https://gist.github.com/tbg/69ec2eea0c4ab5d794a135fa62e551dc - PR cockroachdb#79134 (SKIP LOCKED): https://gist.github.com/tbg/052dae1e3f6449811fca92e275905f1f - PR cockroachdb#161454 (engine separation): https://gist.github.com/tbg/63ef958edd355fb166493b008a0291d5 Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add a project-level code review skill that orchestrates multiple review aspects (redactability, commit structure, red/green testing, PR descriptions, general quality) by referencing existing rules and the commit-arc skill rather than duplicating their content. Includes support for posting batched GitHub reviews with inline line comments via the API. Also add a conventions rule covering PR descriptions, Bazel build files, metrics, Go conventions, and CI checks. Note: .github/workflows/pr-analyzer-threestage.yml already runs an AI-powered bug-screening pipeline on PRs. That workflow is deliberately narrow — it only flags likely bugs and uses a three-stage confidence filter to minimize noise. This skill is broader (reviewability, commit structure, redactability, etc.) and designed for interactive use. In the future, the CI workflow could invoke this skill to get structured, aspect-based review while keeping its conservative posting threshold. Example reviews generated by this skill: - PR cockroachdb#164658 (63-commit MMA repair): https://gist.github.com/tbg/ba4d528bcfa27cdae9ca587f91fb7178 - PR cockroachdb#164792 (physical modeling): https://gist.github.com/tbg/1d78a0c496069f18ec35a2b551483052 - PR cockroachdb#164677 (connection retry test): https://gist.github.com/tbg/69ec2eea0c4ab5d794a135fa62e551dc - PR cockroachdb#79134 (SKIP LOCKED): https://gist.github.com/tbg/052dae1e3f6449811fca92e275905f1f - PR cockroachdb#161454 (engine separation): https://gist.github.com/tbg/63ef958edd355fb166493b008a0291d5 Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Add a project-level code review skill that orchestrates multiple review aspects (redactability, commit structure, red/green testing, PR descriptions, general quality) by referencing existing rules and the commit-arc skill rather than duplicating their content. Includes support for posting batched GitHub reviews with inline line comments via the API. Also add a conventions rule covering PR descriptions, Bazel build files, metrics, Go conventions, and CI checks. Note: .github/workflows/pr-analyzer-threestage.yml already runs an AI-powered bug-screening pipeline on PRs. That workflow is deliberately narrow — it only flags likely bugs and uses a three-stage confidence filter to minimize noise. This skill is broader (reviewability, commit structure, redactability, etc.) and designed for interactive use. In the future, the CI workflow could invoke this skill to get structured, aspect-based review while keeping its conservative posting threshold. Example reviews generated by this skill: - PR cockroachdb#164658 (63-commit MMA repair): https://gist.github.com/tbg/ba4d528bcfa27cdae9ca587f91fb7178 - PR cockroachdb#164792 (physical modeling): https://gist.github.com/tbg/1d78a0c496069f18ec35a2b551483052 - PR cockroachdb#164677 (connection retry test): https://gist.github.com/tbg/69ec2eea0c4ab5d794a135fa62e551dc - PR cockroachdb#79134 (SKIP LOCKED): https://gist.github.com/tbg/052dae1e3f6449811fca92e275905f1f - PR cockroachdb#161454 (engine separation): https://gist.github.com/tbg/63ef958edd355fb166493b008a0291d5 Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg
left a comment
There was a problem hiding this comment.
Reviewing commits 2 and 3 only (commit 1 is from #164760).
Two clean commits: the first introduces the physical model types and functions with a comprehensive datadriven test; the second wires the new model into all callers. The algebraic equivalence argument is convincing (fractionUsed is unchanged), and the commit split follows a good "introduce abstraction, then wire" pattern.
A few stale comment references that are now out of date after commit 3's changes to MakeStoreLoadMsg — not blocking, but worth a follow-up or a fixup commit:
StoreLoadMsg.Capacitycomment (messages.go:25) says "SeecomputeCPUCapacityWithCap" butMakeStoreLoadMsgnow usescomputePhysicalStore.NodeLoadcomment (load.go:257) referencescomputeCPUCapacityWithCap— same issue.adjustedCPUcomment (cluster_state.go:908) says "deltas are in store-level modeled CPU units whileNodeCPULoadis physical." Under the physical model, range deltas are also in physical units (viaMakePhysicalRangeLoad). The remaining mismatch is subtler: the source store's amplification factor is used for the delta, but the target node's actual amplification may differ.
(made with /review-crdb)
| // jobs, etc. This is zero when the implicit multiplier is below the cap. | ||
| // | ||
| // 4. load = mmaAttributedLoad / numStores | ||
| // Per-store MMA-attributed physical CPU. This is what MMA can control. |
There was a problem hiding this comment.
nit: Step 4 says "load = mmaAttributedLoad / numStores" but the implementation (line 99) uses desc.Capacity.CPUPerSecond * ampFactor when CPUPerSecond > 0, only falling back to the even split. The inline comment at line 96 explains this well, but the numbered summary should match (e.g., "load = storeCPU * ampFactor, where storeCPU is this store's CPUPerSecond or storesCPURate / numStores as fallback").
| // minCapacity is the floor for per-store physical capacity in any dimension. | ||
| // This prevents zero-capacity values that could arise during early node startup | ||
| // or on empty stores. | ||
| const minCapacity mmaprototype.LoadValue = 1.0 |
There was a problem hiding this comment.
suggestion: This floor is ~8 orders of magnitude lower than the old cpuCapacityFloorPerStore (0.1 cores = 1e8 ns/s). The algebraic equivalence doesn't hold at the floor: the old model bounded fractionUsed at ~20x while this allows astronomically large values. The rebalancing outcome is the same (overloadUrgent saturates at fractionUsed > 0.9), so this is fine — but worth documenting.
| // minCapacity is the floor for per-store physical capacity in any dimension. | |
| // This prevents zero-capacity values that could arise during early node startup | |
| // or on empty stores. | |
| const minCapacity mmaprototype.LoadValue = 1.0 | |
| // minCapacity is a division-by-zero guard for per-store physical capacity in | |
| // any dimension. It is intentionally tiny (1 unit) rather than a meaningful | |
| // capacity floor: when capacity is this small, fractionUsed can be very large, | |
| // but loadSummaryForDimension saturates at overloadUrgent (fractionUsed > 0.9) | |
| // regardless of magnitude. | |
| const minCapacity mmaprototype.LoadValue = 1.0 |
| NodeCPURateUsage: int64(nodeUsageCores * nsPerCore), | ||
| NodeCPURateCapacity: int64(nodeCapCores * nsPerCore), | ||
| }, | ||
| Capacity: roachpb.StoreCapacity{ |
There was a problem hiding this comment.
suggestion: None of the test cases set CPUPerSecond, so the per-store CPU differentiation path (computePhysicalCPU line 99-102) is untested. That's the production path — each store reports its own CPUPerSecond. Consider adding a multi-store case where CPUPerSecond differs from the even split to exercise the load weighting.
| @@ -22,23 +23,23 @@ type mmaReplica Replica | |||
| // The returned RangeLoad contains stats across multiple dimensions that MMA uses | |||
| // to determine top-k replicas and evaluate the load impact of rebalancing them. | |||
| func mmaRangeLoad( | |||
There was a problem hiding this comment.
nit: The comment starts with "The returned RangeLoad..." without the conventional function-name lead sentence. The method variant at line 37 has the full doc. Consider: "mmaRangeLoad constructs a RangeLoad from replica load stats and MVCC stats, converting logical loads to physical units via the given amplification factors."
tbg
left a comment
There was a problem hiding this comment.
@tbg partially reviewed 27 files and made 1 comment.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on sumeerbhola and wenyihu6).
pkg/kv/kvserver/allocator/mmaprototype/load.go line 151 at r2 (raw file):
av[CPURate], av[WriteBandwidth], av[ByteSize],
Add something like var _ [3 - NumLoadDimensions]struct{} near here so that we are forced to update this should NumLoadDimensions grow.
sumeerbhola
left a comment
There was a problem hiding this comment.
@sumeerbhola made 1 comment.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on tbg and wenyihu6).
-- commits line 43 at r3:
(drive-by comment)
can we make capacity = nodeCap / numStores?
and just take estimatedBg/numStores and add it to each store's load. Then at least one number (the capacity) corresponds to the real world.
|
Closing in favor of #164900. |
Rebased on top of #164760.
Epic: CRDB-55052
Release note: none
mmaintegration: introduce physical capacity model
This change introduces a physical capacity model that expresses MMA store
loads and capacities in physical resource units (CPU ns/s, disk bytes)
and wires it into all callers, replacing the logical (capped multiplier)
model.
The capped model expressed capacity in a synthetic "logical KV CPU" unit:
load = CPUPerSecond
capacity = (nodeCap - estimatedBg) / estimatedMult / numStores
The physical model factors the multiplier out of the capacity formula and
into a separate amplification factor applied at the range-load boundary:
load = CPUPerSecond * ampFactor
capacity = (nodeCap - estimatedBg) / numStores
range load = logicalRangeLoad * ampFactor (via MakePhysicalRangeLoad)
Both models use the same capped-multiplier logic for background
estimation (ampFactor = clamp(nodeUsage/storesCPU, 1, 3), background =
max(0, nodeUsage - storesCPU * ampFactor)). The ratio load/capacity is
algebraically identical, so all rebalancing decisions driven by
fractionUsedinloadSummaryForDimensionare unchanged.The physical model is preferred because it separates two concerns that
the capped model conflates into a single capacity value. Capacity
answers "how much physical CPU is available for MMA work after
subtracting background" — a quantity grounded in the actual resource.
The amplification factor answers "how much physical CPU does each unit
of logical range work cost" and is applied only at the range-load
boundary via
MakePhysicalRangeLoad.This separation has two practical benefits:
It makes the
LoadVectoruniform across dimensions. Disk isinherently physical (
load=Used,capacity=Used+Available). CPUnow follows the same pattern with a separate amplification factor
for per-range conversion.
It resolves the unit mismatch between physical
NodeCPULoadandstore-level pending deltas in
nodeState.adjustedCPU. MMA usesadjustedCPUfor node-level overload detection: it starts atNodeCPULoad(=NodeCPURateUsage, physical ns/s) and accumulatespending change deltas as ranges are scheduled to move
(
applyChangeLoadDeltaaddschange.loadDelta[CPURate]). Theresult is compared against
NodeCPUCapacityinloadSummaryForDimension. Under the capped model, these deltas werein logical units (
RequestCPUNanos + RaftCPUNanos, no amplification)— a range using 0.5 cores of direct KV CPU produced a delta of
0.5e9 ns/s, but its true physical impact on node CPU could be
0.5 * ampFactor = 1.5e9 ns/s. Under the physical model,
MakePhysicalRangeLoadapplies the amplification factor, so thedelta is already in physical ns/s and combines directly with
NodeCPULoad.The trade-off is that both load and capacity now include estimation
error from the amplification factor. In the capped model, store load was
the directly-measured
CPUPerSecond— a ground-truth value requiring noestimation. In the physical model, load is
CPUPerSecond * ampFactor,so diagnosing a store reporting "6 cores of load" requires knowing the
amplification factor to recover the underlying 2 cores of direct KV CPU.
kvserver: convert MMA store and range loads to physical units
This change wires the physical capacity model into all MMA callers.
MakeStoreLoadMsgnow delegates tocomputePhysicalStore.Range loads are converted via
MakePhysicalRangeLoadusing anAmpVectorfrom the newStore.AmplificationFactors()method,threaded through
mmaRangeLoad,tryConstructMMARangeMsg,NonMMAPreTransferLease, andNonMMAPreChangeReplicas.Store.AmplificationFactors()returnsIdentityAmpVector()whenCachedCapacity()is zero (e.g. early startup). The simulator usesIdentityAmpVectorwith a TODO for real factors.Release note: None