Skip to content

mmaintegration: introduce physical capacity model#164792

Closed
wenyihu6 wants to merge 3 commits intocockroachdb:masterfrom
wenyihu6:first
Closed

mmaintegration: introduce physical capacity model#164792
wenyihu6 wants to merge 3 commits intocockroachdb:masterfrom
wenyihu6:first

Conversation

@wenyihu6
Copy link
Copy Markdown
Contributor

@wenyihu6 wenyihu6 commented Mar 4, 2026

Rebased on top of #164760.
Epic: CRDB-55052
Release note: none


mmaintegration: introduce physical capacity model

This change introduces a physical capacity model that expresses MMA store
loads and capacities in physical resource units (CPU ns/s, disk bytes)
and wires it into all callers, replacing the logical (capped multiplier)
model.

The capped model expressed capacity in a synthetic "logical KV CPU" unit:

load = CPUPerSecond
capacity = (nodeCap - estimatedBg) / estimatedMult / numStores

The physical model factors the multiplier out of the capacity formula and
into a separate amplification factor applied at the range-load boundary:

load = CPUPerSecond * ampFactor
capacity = (nodeCap - estimatedBg) / numStores
range load = logicalRangeLoad * ampFactor (via MakePhysicalRangeLoad)

Both models use the same capped-multiplier logic for background
estimation (ampFactor = clamp(nodeUsage/storesCPU, 1, 3), background =
max(0, nodeUsage - storesCPU * ampFactor)). The ratio load/capacity is
algebraically identical, so all rebalancing decisions driven by
fractionUsed in loadSummaryForDimension are unchanged.

The physical model is preferred because it separates two concerns that
the capped model conflates into a single capacity value. Capacity
answers "how much physical CPU is available for MMA work after
subtracting background" — a quantity grounded in the actual resource.
The amplification factor answers "how much physical CPU does each unit
of logical range work cost" and is applied only at the range-load
boundary via MakePhysicalRangeLoad.

This separation has two practical benefits:

  1. It makes the LoadVector uniform across dimensions. Disk is
    inherently physical (load=Used, capacity=Used+Available). CPU
    now follows the same pattern with a separate amplification factor
    for per-range conversion.

  2. It resolves the unit mismatch between physical NodeCPULoad and
    store-level pending deltas in nodeState.adjustedCPU. MMA uses
    adjustedCPU for node-level overload detection: it starts at
    NodeCPULoad (= NodeCPURateUsage, physical ns/s) and accumulates
    pending change deltas as ranges are scheduled to move
    (applyChangeLoadDelta adds change.loadDelta[CPURate]). The
    result is compared against NodeCPUCapacity in
    loadSummaryForDimension. Under the capped model, these deltas were
    in logical units (RequestCPUNanos + RaftCPUNanos, no amplification)
    — a range using 0.5 cores of direct KV CPU produced a delta of
    0.5e9 ns/s, but its true physical impact on node CPU could be
    0.5 * ampFactor = 1.5e9 ns/s. Under the physical model,
    MakePhysicalRangeLoad applies the amplification factor, so the
    delta is already in physical ns/s and combines directly with
    NodeCPULoad.

The trade-off is that both load and capacity now include estimation
error from the amplification factor. In the capped model, store load was
the directly-measured CPUPerSecond — a ground-truth value requiring no
estimation. In the physical model, load is CPUPerSecond * ampFactor,
so diagnosing a store reporting "6 cores of load" requires knowing the
amplification factor to recover the underlying 2 cores of direct KV CPU.


kvserver: convert MMA store and range loads to physical units

This change wires the physical capacity model into all MMA callers.

MakeStoreLoadMsg now delegates to computePhysicalStore.
Range loads are converted via MakePhysicalRangeLoad using an
AmpVector from the new Store.AmplificationFactors() method,
threaded through mmaRangeLoad, tryConstructMMARangeMsg,
NonMMAPreTransferLease, and NonMMAPreChangeReplicas.

Store.AmplificationFactors() returns IdentityAmpVector() when
CachedCapacity() is zero (e.g. early startup). The simulator uses
IdentityAmpVector with a TODO for real factors.

Release note: None

wenyihu6 added 3 commits March 3, 2026 21:19
The capped multiplier model (`computeCPUCapacityWithCap`) subtracts
background CPU when deriving per-store capacity. This intentionally
breaks the invariant maintained by the naive model:

  sum(store loads) / sum(store capacities) = NodeCPURateUsage / NodeCPURateCapacity

When background CPU exceeds the cap, node-level overload detection can
no longer recover physical utilization by summing store-level values.

This change replaces the derived `NodeLoad` fields (`ReportedCPU`,
`CapacityCPU` — incremental sums of per-store values) which relied
on this invariance with explicit physical node-level CPU values
(`NodeCPULoad`, `NodeCPUCapacity`) carried through `StoreLoadMsg`
from `NodeCapacity` fields.

Note that each store generates its descriptor independently, so
`NodeCPULoad` values may differ slightly across stores on the same
node due to sampling time differences. This is acceptable since
they should be roughly the same.
mmaintegration: introduce physical capacity model for MMA loads

This change introduces a physical capacity model that expresses MMA store
loads and capacities in physical resource units (CPU ns/s, disk bytes)
and wires it into all callers, replacing the logical (capped multiplier)
model.

The capped model expressed capacity in a synthetic "logical KV CPU" unit:

  load = CPUPerSecond
  capacity = (nodeCap - estimatedBg) / estimatedMult / numStores

The physical model factors the multiplier out of the capacity formula and
into a separate amplification factor applied at the range-load boundary:

  load = CPUPerSecond * ampFactor
  capacity = (nodeCap - estimatedBg) / numStores
  range load = logicalRangeLoad * ampFactor   (via MakePhysicalRangeLoad)

Both models use the same capped-multiplier logic for background
estimation (ampFactor = clamp(nodeUsage/storesCPU, 1, 3), background =
max(0, nodeUsage - storesCPU * ampFactor)). The ratio load/capacity is
algebraically identical, so all rebalancing decisions driven by
`fractionUsed` in `loadSummaryForDimension` are unchanged.

The physical model is preferred because it separates two concerns that
the capped model conflates into a single capacity value. Capacity
answers "how much physical CPU is available for MMA work after
subtracting background" — a quantity grounded in the actual resource.
The amplification factor answers "how much physical CPU does each unit
of logical range work cost" and is applied only at the range-load
boundary via `MakePhysicalRangeLoad`.

This separation has two practical benefits:

1. It makes the `LoadVector` uniform across dimensions. Disk is
   inherently physical (`load=Used`, `capacity=Used+Available`). CPU
   now follows the same pattern with a separate amplification factor
   for per-range conversion.

2. It resolves the unit mismatch between physical `NodeCPULoad` and
   store-level pending deltas in `nodeState.adjustedCPU`. MMA uses
   `adjustedCPU` for node-level overload detection: it starts at
   `NodeCPULoad` (= `NodeCPURateUsage`, physical ns/s) and accumulates
   pending change deltas as ranges are scheduled to move
   (`applyChangeLoadDelta` adds `change.loadDelta[CPURate]`). The
   result is compared against `NodeCPUCapacity` in
   `loadSummaryForDimension`. Under the capped model, these deltas were
   in logical units (`RequestCPUNanos + RaftCPUNanos`, no amplification)
   — a range using 0.5 cores of direct KV CPU produced a delta of
   0.5e9 ns/s, but its true physical impact on node CPU could be
   0.5 * ampFactor = 1.5e9 ns/s. Under the physical model,
   `MakePhysicalRangeLoad` applies the amplification factor, so the
   delta is already in physical ns/s and combines directly with
   `NodeCPULoad`.

The trade-off is that both load and capacity now include estimation
error from the amplification factor. In the capped model, store load was
the directly-measured `CPUPerSecond` — a ground-truth value requiring no
estimation. In the physical model, load is `CPUPerSecond * ampFactor`,
so diagnosing a store reporting "6 cores of load" requires knowing the
amplification factor to recover the underlying 2 cores of direct KV CPU.
This change wires the physical capacity model into all MMA callers.

`MakeStoreLoadMsg` now delegates to `computePhysicalStore`.
Range loads are converted via `MakePhysicalRangeLoad` using an
`AmpVector` from the new `Store.AmplificationFactors()` method,
threaded through `mmaRangeLoad`, `tryConstructMMARangeMsg`,
`NonMMAPreTransferLease`, and `NonMMAPreChangeReplicas`.

`Store.AmplificationFactors()` returns `IdentityAmpVector()` when
`CachedCapacity()` is zero (e.g. early startup). The simulator uses
`IdentityAmpVector` with a TODO for real factors.

Release note: None
@wenyihu6 wenyihu6 requested review from a team as code owners March 4, 2026 02:46
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io bot commented Mar 4, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

@wenyihu6 wenyihu6 requested review from sumeerbhola and tbg and removed request for a team March 4, 2026 02:46
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

tbg added a commit to tbg/cockroach that referenced this pull request Mar 4, 2026
Add a project-level code review skill that orchestrates multiple review
aspects (redactability, commit structure, red/green testing, PR descriptions,
general quality) by referencing existing rules and the commit-arc skill
rather than duplicating their content. Includes support for posting
batched GitHub reviews with inline line comments via the API.

Also add a conventions rule covering PR descriptions, Bazel build files,
metrics, Go conventions, and CI checks.

Note: .github/workflows/pr-analyzer-threestage.yml already runs an
AI-powered bug-screening pipeline on PRs. That workflow is deliberately
narrow — it only flags likely bugs and uses a three-stage confidence
filter to minimize noise. This skill is broader (reviewability, commit
structure, redactability, etc.) and designed for interactive use. In the
future, the CI workflow could invoke this skill to get structured,
aspect-based review while keeping its conservative posting threshold.

Example reviews generated by this skill:
- PR cockroachdb#164658 (63-commit MMA repair): https://gist.github.com/tbg/ba4d528bcfa27cdae9ca587f91fb7178
- PR cockroachdb#164792 (physical modeling): https://gist.github.com/tbg/1d78a0c496069f18ec35a2b551483052
- PR cockroachdb#164677 (connection retry test): https://gist.github.com/tbg/69ec2eea0c4ab5d794a135fa62e551dc
- PR cockroachdb#79134 (SKIP LOCKED): https://gist.github.com/tbg/052dae1e3f6449811fca92e275905f1f
- PR cockroachdb#161454 (engine separation): https://gist.github.com/tbg/63ef958edd355fb166493b008a0291d5

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg added a commit to tbg/cockroach that referenced this pull request Mar 4, 2026
Add a project-level code review skill that orchestrates multiple review
aspects (redactability, commit structure, red/green testing, PR descriptions,
general quality) by referencing existing rules and the commit-arc skill
rather than duplicating their content. Includes support for posting
batched GitHub reviews with inline line comments via the API.

Also add a conventions rule covering PR descriptions, Bazel build files,
metrics, Go conventions, and CI checks.

Note: .github/workflows/pr-analyzer-threestage.yml already runs an
AI-powered bug-screening pipeline on PRs. That workflow is deliberately
narrow — it only flags likely bugs and uses a three-stage confidence
filter to minimize noise. This skill is broader (reviewability, commit
structure, redactability, etc.) and designed for interactive use. In the
future, the CI workflow could invoke this skill to get structured,
aspect-based review while keeping its conservative posting threshold.

Example reviews generated by this skill:
- PR cockroachdb#164658 (63-commit MMA repair): https://gist.github.com/tbg/ba4d528bcfa27cdae9ca587f91fb7178
- PR cockroachdb#164792 (physical modeling): https://gist.github.com/tbg/1d78a0c496069f18ec35a2b551483052
- PR cockroachdb#164677 (connection retry test): https://gist.github.com/tbg/69ec2eea0c4ab5d794a135fa62e551dc
- PR cockroachdb#79134 (SKIP LOCKED): https://gist.github.com/tbg/052dae1e3f6449811fca92e275905f1f
- PR cockroachdb#161454 (engine separation): https://gist.github.com/tbg/63ef958edd355fb166493b008a0291d5

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg added a commit to tbg/cockroach that referenced this pull request Mar 4, 2026
Add a project-level code review skill that orchestrates multiple review
aspects (redactability, commit structure, red/green testing, PR descriptions,
general quality) by referencing existing rules and the commit-arc skill
rather than duplicating their content. Includes support for posting
batched GitHub reviews with inline line comments via the API.

Also add a conventions rule covering PR descriptions, Bazel build files,
metrics, Go conventions, and CI checks.

Note: .github/workflows/pr-analyzer-threestage.yml already runs an
AI-powered bug-screening pipeline on PRs. That workflow is deliberately
narrow — it only flags likely bugs and uses a three-stage confidence
filter to minimize noise. This skill is broader (reviewability, commit
structure, redactability, etc.) and designed for interactive use. In the
future, the CI workflow could invoke this skill to get structured,
aspect-based review while keeping its conservative posting threshold.

Example reviews generated by this skill:
- PR cockroachdb#164658 (63-commit MMA repair): https://gist.github.com/tbg/ba4d528bcfa27cdae9ca587f91fb7178
- PR cockroachdb#164792 (physical modeling): https://gist.github.com/tbg/1d78a0c496069f18ec35a2b551483052
- PR cockroachdb#164677 (connection retry test): https://gist.github.com/tbg/69ec2eea0c4ab5d794a135fa62e551dc
- PR cockroachdb#79134 (SKIP LOCKED): https://gist.github.com/tbg/052dae1e3f6449811fca92e275905f1f
- PR cockroachdb#161454 (engine separation): https://gist.github.com/tbg/63ef958edd355fb166493b008a0291d5

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg added a commit to tbg/cockroach that referenced this pull request Mar 4, 2026
Add a project-level code review skill that orchestrates multiple review
aspects (redactability, commit structure, red/green testing, PR descriptions,
general quality) by referencing existing rules and the commit-arc skill
rather than duplicating their content. Includes support for posting
batched GitHub reviews with inline line comments via the API.

Also add a conventions rule covering PR descriptions, Bazel build files,
metrics, Go conventions, and CI checks.

Note: .github/workflows/pr-analyzer-threestage.yml already runs an
AI-powered bug-screening pipeline on PRs. That workflow is deliberately
narrow — it only flags likely bugs and uses a three-stage confidence
filter to minimize noise. This skill is broader (reviewability, commit
structure, redactability, etc.) and designed for interactive use. In the
future, the CI workflow could invoke this skill to get structured,
aspect-based review while keeping its conservative posting threshold.

Example reviews generated by this skill:
- PR cockroachdb#164658 (63-commit MMA repair): https://gist.github.com/tbg/ba4d528bcfa27cdae9ca587f91fb7178
- PR cockroachdb#164792 (physical modeling): https://gist.github.com/tbg/1d78a0c496069f18ec35a2b551483052
- PR cockroachdb#164677 (connection retry test): https://gist.github.com/tbg/69ec2eea0c4ab5d794a135fa62e551dc
- PR cockroachdb#79134 (SKIP LOCKED): https://gist.github.com/tbg/052dae1e3f6449811fca92e275905f1f
- PR cockroachdb#161454 (engine separation): https://gist.github.com/tbg/63ef958edd355fb166493b008a0291d5

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
tbg added a commit to tbg/cockroach that referenced this pull request Mar 4, 2026
Add a project-level code review skill that orchestrates multiple review
aspects (redactability, commit structure, red/green testing, PR descriptions,
general quality) by referencing existing rules and the commit-arc skill
rather than duplicating their content. Includes support for posting
batched GitHub reviews with inline line comments via the API.

Also add a conventions rule covering PR descriptions, Bazel build files,
metrics, Go conventions, and CI checks.

Note: .github/workflows/pr-analyzer-threestage.yml already runs an
AI-powered bug-screening pipeline on PRs. That workflow is deliberately
narrow — it only flags likely bugs and uses a three-stage confidence
filter to minimize noise. This skill is broader (reviewability, commit
structure, redactability, etc.) and designed for interactive use. In the
future, the CI workflow could invoke this skill to get structured,
aspect-based review while keeping its conservative posting threshold.

Example reviews generated by this skill:
- PR cockroachdb#164658 (63-commit MMA repair): https://gist.github.com/tbg/ba4d528bcfa27cdae9ca587f91fb7178
- PR cockroachdb#164792 (physical modeling): https://gist.github.com/tbg/1d78a0c496069f18ec35a2b551483052
- PR cockroachdb#164677 (connection retry test): https://gist.github.com/tbg/69ec2eea0c4ab5d794a135fa62e551dc
- PR cockroachdb#79134 (SKIP LOCKED): https://gist.github.com/tbg/052dae1e3f6449811fca92e275905f1f
- PR cockroachdb#161454 (engine separation): https://gist.github.com/tbg/63ef958edd355fb166493b008a0291d5

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing commits 2 and 3 only (commit 1 is from #164760).

Two clean commits: the first introduces the physical model types and functions with a comprehensive datadriven test; the second wires the new model into all callers. The algebraic equivalence argument is convincing (fractionUsed is unchanged), and the commit split follows a good "introduce abstraction, then wire" pattern.

A few stale comment references that are now out of date after commit 3's changes to MakeStoreLoadMsg — not blocking, but worth a follow-up or a fixup commit:

  • StoreLoadMsg.Capacity comment (messages.go:25) says "See computeCPUCapacityWithCap" but MakeStoreLoadMsg now uses computePhysicalStore.
  • NodeLoad comment (load.go:257) references computeCPUCapacityWithCap — same issue.
  • adjustedCPU comment (cluster_state.go:908) says "deltas are in store-level modeled CPU units while NodeCPULoad is physical." Under the physical model, range deltas are also in physical units (via MakePhysicalRangeLoad). The remaining mismatch is subtler: the source store's amplification factor is used for the delta, but the target node's actual amplification may differ.

(made with /review-crdb)

// jobs, etc. This is zero when the implicit multiplier is below the cap.
//
// 4. load = mmaAttributedLoad / numStores
// Per-store MMA-attributed physical CPU. This is what MMA can control.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Step 4 says "load = mmaAttributedLoad / numStores" but the implementation (line 99) uses desc.Capacity.CPUPerSecond * ampFactor when CPUPerSecond > 0, only falling back to the even split. The inline comment at line 96 explains this well, but the numbered summary should match (e.g., "load = storeCPU * ampFactor, where storeCPU is this store's CPUPerSecond or storesCPURate / numStores as fallback").

Comment on lines +125 to +128
// minCapacity is the floor for per-store physical capacity in any dimension.
// This prevents zero-capacity values that could arise during early node startup
// or on empty stores.
const minCapacity mmaprototype.LoadValue = 1.0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This floor is ~8 orders of magnitude lower than the old cpuCapacityFloorPerStore (0.1 cores = 1e8 ns/s). The algebraic equivalence doesn't hold at the floor: the old model bounded fractionUsed at ~20x while this allows astronomically large values. The rebalancing outcome is the same (overloadUrgent saturates at fractionUsed > 0.9), so this is fine — but worth documenting.

Suggested change
// minCapacity is the floor for per-store physical capacity in any dimension.
// This prevents zero-capacity values that could arise during early node startup
// or on empty stores.
const minCapacity mmaprototype.LoadValue = 1.0
// minCapacity is a division-by-zero guard for per-store physical capacity in
// any dimension. It is intentionally tiny (1 unit) rather than a meaningful
// capacity floor: when capacity is this small, fractionUsed can be very large,
// but loadSummaryForDimension saturates at overloadUrgent (fractionUsed > 0.9)
// regardless of magnitude.
const minCapacity mmaprototype.LoadValue = 1.0

NodeCPURateUsage: int64(nodeUsageCores * nsPerCore),
NodeCPURateCapacity: int64(nodeCapCores * nsPerCore),
},
Capacity: roachpb.StoreCapacity{
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: None of the test cases set CPUPerSecond, so the per-store CPU differentiation path (computePhysicalCPU line 99-102) is untested. That's the production path — each store reports its own CPUPerSecond. Consider adding a multi-store case where CPUPerSecond differs from the even split to exercise the load weighting.

@@ -22,23 +23,23 @@ type mmaReplica Replica
// The returned RangeLoad contains stats across multiple dimensions that MMA uses
// to determine top-k replicas and evaluate the load impact of rebalancing them.
func mmaRangeLoad(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The comment starts with "The returned RangeLoad..." without the conventional function-name lead sentence. The method variant at line 37 has the full doc. Consider: "mmaRangeLoad constructs a RangeLoad from replica load stats and MVCC stats, converting logical loads to physical units via the given amplification factors."

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tbg partially reviewed 27 files and made 1 comment.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on sumeerbhola and wenyihu6).


pkg/kv/kvserver/allocator/mmaprototype/load.go line 151 at r2 (raw file):

		av[CPURate],
		av[WriteBandwidth],
		av[ByteSize],

Add something like var _ [3 - NumLoadDimensions]struct{} near here so that we are forced to update this should NumLoadDimensions grow.

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumeerbhola made 1 comment.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on tbg and wenyihu6).


-- commits line 43 at r3:
(drive-by comment)

can we make capacity = nodeCap / numStores?
and just take estimatedBg/numStores and add it to each store's load. Then at least one number (the capacity) corresponds to the real world.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Mar 5, 2026

Closing in favor of #164900.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants