fix parity issues, memory, and 2.5pct bug#359
Merged
Conversation
motatoes
approved these changes
Jun 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Billing parity follow-ups: four fixes to close the residual drift before edge-authority cutover
The post-merge parity check (PR #352 deployed to prod) reported drift between cell and edge measurements on every checked bucket. Investigation traced the drift to
four distinct bugs in three different subsystems — none in the tick-attribution work itself, all in adjacent measurement and reporting code. This PR fixes all four.
Background
Hourly parity check on prod was flagging consistent patterns:
Drilling into each:
usage_samplesis pro-only by design (free orgs go through the CreditAccount DO/debitfan-out, separate path),so comparing free-org cell GB·s against
usage_samplesis a category mismatch. Always returns -100%, drowns out real signal.GetOrgUsage's SQL only clippedended_atto the window end when it was NULL. A scale event whoseended_atextended past the bucket end was attributed its fullduration to that bucket (e.g., 32 minutes credited when only 18 minutes fell inside the window). Triggered on any sandbox whose scale event boundaries don't align with
the hour grid.
grpc_server.go:787had aTODO: get actual memory from sandbox stateand was hardcoded to record 1024 MB / 100% CPU in cell PG after everywake — regardless of what the virtio-mem hotplug actually restored. Every woken sandbox got billed at the smallest tier on the cell side.
int(elapsed.Seconds())discarded each tick's fractional remainder. At 180 ticks/hour, average ~0.5s lost per tick= 90s lost per hour = exactly the observed 2.5% under-count on every long-running sandbox.
The fix in PR #352 was working as designed. None of the four bugs above are regressions of that work.
What this PR changes
Fix 1 —
GetOrgUsageclipsended_atto window end (internal/db/usage.go)The
LEAST(..., $3)now applies to both the non-NULL and NULL branches. Pre-fix, a scale event with non-NULLended_atpast the window end was attributed its fullduration to the in-window total.
Fix 2 — Wake handler reads actual memory from sandbox state (
internal/worker/grpc_server.go)Replaces the hardcoded
memMB := 1024; cpuPct := 100with the same derivationrecordInitialScaleEventuses:sb.MemoryMBis the manager's authoritative post-wake tier (the snapshot's plugged-back-in total via virtio-mem). The worker log line at wake —pre-resume virtio-mem plug … total=N— was already correct; only the cell-PG write was wrong.Fix 3 — Parity checker skips free orgs (
internal/controlplane/usage_parity.go+internal/db/store.go)Store.GetOrgPlan(ctx, orgID string) (string, error)— single-column lookup, faster thanGetOrg, designed for "skip if free" gating.usageParitySourceinterface extended withGetOrgPlan.tick(), before computing per-org drift, look up the plan and skip free orgs. New log fieldskipped_free=Nso the operator can see how many were filtered.Free orgs' edge-side accounting still works — it just lives in the CreditAccount DO, not in
usage_samples. A future enhancement could compare free orgs against theDO's debit total, but the immediate noise is gone.
Fix 4 — Edge ticker carries fractional remainder across emits (
internal/worker/usage_ticker.go)The robust version: a
fracRemainder map[string]float64carries each tick's leftover sub-second into the next, so cumulative drift is zero rather thanint()truncating ~0.5s every tick.
Special-case: when
intervalSecondsFordeliberately caps the elapsed time (first-observation guess, or > 2× tick interval gap from a missed wake-detection), we zero thecarried remainder. Otherwise the cap would be subverted — a prior carry-forward would let us bill back the very seconds we chose to drop.
markEmittednow takes a third arg for the new remainder.dropState/pruneStateNotInalso clean it up. Tests updated.New tests
Three regression tests in
internal/worker/usage_ticker_test.gopin Fix 4:CarriesForwardNoDrift— 200 ticks of 19.5s elapsed each must bill within 1s of200 × 19.5 = 3900sexact. Pre-fix would have billed 3800s (50s lost = 2.5%drift); post-fix bills 3899–3900s.
SubSecondTicksAccumulate— three sub-second ticks (0.4s each) must cumulatively emit 1s by the third tick. Pre-fix would have silently dropped all three.CapDiscardsRemainder— when the elapsed is capped, a previously stashed remainder must NOT carry through. Verifies the cap can't be subverted by a bufferedfractional.
Verification
go test ./internal/worker/...).go test ./internal/db/...).TestSmartScaleDownTargetsLeastLoaded,TestScaleDownSkipsAlreadyDraining,TestDrainTimeoutCancelsDrainKeepsWorker) — verified failing onmainbefore this PR; unrelated.End-to-end verification of fix #4 on prod data after deploy: a long-running sandbox with N ticks per hour should now report cell-vs-edge drift within ±1s, not the
systematic -2.5%. The parity-check log line will show
skipped_free=N flagged=…instead of every free org pulling the noise floor down.Files touched
internal/db/usage.gointernal/db/store.goGetOrgPlanmethodinternal/controlplane/usage_parity.gointernal/worker/grpc_server.gosb.MemoryMBinstead of hardcoded 1024internal/worker/usage_ticker.gofracRemaindercarry-forward + cap guardinternal/worker/usage_ticker_test.goNot touched: api-edge Worker, events-ingest Worker, D1 schema, agent, gRPC proto, SDKs. Pure CP/worker change.
Rollout
skipped_free=Nandflaggedto drop to ~0 across all pro orgs.PRO_BILLING_AUTHORITY=edgeto complete the cutover the original PR Edge billing cutover #349 set up.Migration: none — all Go-only changes.
Safe to roll back: the four fixes are independent. Revert by reverting this PR; behavior returns to the pre-PR state on the next CP/worker restart.
Test plan
go test ./internal/db/... ./internal/worker/... ./internal/controlplane/...)skipped_free=Nline + reduced drift on next hourly cyclePRO_BILLING_AUTHORITY=edge