Skip to content

fix parity issues, memory, and 2.5pct bug#359

Merged
breardon2011 merged 1 commit into
mainfrom
billing-parity-followups
Jun 8, 2026
Merged

fix parity issues, memory, and 2.5pct bug#359
breardon2011 merged 1 commit into
mainfrom
billing-parity-followups

Conversation

@breardon2011

Copy link
Copy Markdown
Contributor

Billing parity follow-ups: four fixes to close the residual drift before edge-authority cutover

The post-merge parity check (PR #352 deployed to prod) reported drift between cell and edge measurements on every checked bucket. Investigation traced the drift to
four distinct bugs in three different subsystems — none in the tick-attribution work itself, all in adjacent measurement and reporting code. This PR fixes all four.

Background

Hourly parity check on prod was flagging consistent patterns:

  • -100% drift on every free org — present in every bucket.
  • One pro org at -13%, otherwise steady.
  • Multiple pro orgs at ~-2.5% steady-state regardless of workload pattern.

Drilling into each:

  • The -100% is a parity-checker scope mistake — D1 usage_samples is pro-only by design (free orgs go through the CreditAccount DO /debit fan-out, separate path),
    so comparing free-org cell GB·s against usage_samples is a category mismatch. Always returns -100%, drowns out real signal.
  • The -13% is actually cell over-counting, in two distinct sub-bugs:
    • GetOrgUsage's SQL only clipped ended_at to the window end when it was NULL. A scale event whose ended_at extended past the bucket end was attributed its full
      duration to that bucket (e.g., 32 minutes credited when only 18 minutes fell inside the window). Triggered on any sandbox whose scale event boundaries don't align with
      the hour grid.
    • The wake handler in grpc_server.go:787 had a TODO: get actual memory from sandbox state and was hardcoded to record 1024 MB / 100% CPU in cell PG after every
      wake — regardless of what the virtio-mem hotplug actually restored. Every woken sandbox got billed at the smallest tier on the cell side.
  • The -2.5% is the steady-state edge truncation: int(elapsed.Seconds()) discarded each tick's fractional remainder. At 180 ticks/hour, average ~0.5s lost per tick
    = 90s lost per hour = exactly the observed 2.5% under-count on every long-running sandbox.

The fix in PR #352 was working as designed. None of the four bugs above are regressions of that work.

What this PR changes

Fix 1 — GetOrgUsage clips ended_at to window end (internal/db/usage.go)

-- before
COALESCE(ended_at, LEAST(now(), $3))

-- after
LEAST(COALESCE(ended_at, now()), $3)

The LEAST(..., $3) now applies to both the non-NULL and NULL branches. Pre-fix, a scale event with non-NULL ended_at past the window end was attributed its full
duration to the in-window total.

Fix 2 — Wake handler reads actual memory from sandbox state (internal/worker/grpc_server.go)

Replaces the hardcoded memMB := 1024; cpuPct := 100 with the same derivation recordInitialScaleEvent uses:

memMB := sb.MemoryMB
if memMB <= 0 { memMB = 1024 }
cpuPct := (memMB * 100) / 1024
if cpuPct < 100 { cpuPct = 100 }

sb.MemoryMB is the manager's authoritative post-wake tier (the snapshot's plugged-back-in total via virtio-mem). The worker log line at wake — pre-resume virtio-mem plug … total=N — was already correct; only the cell-PG write was wrong.

Fix 3 — Parity checker skips free orgs (internal/controlplane/usage_parity.go + internal/db/store.go)

  • New Store.GetOrgPlan(ctx, orgID string) (string, error) — single-column lookup, faster than GetOrg, designed for "skip if free" gating.
  • usageParitySource interface extended with GetOrgPlan.
  • In tick(), before computing per-org drift, look up the plan and skip free orgs. New log field skipped_free=N so the operator can see how many were filtered.

Free orgs' edge-side accounting still works — it just lives in the CreditAccount DO, not in usage_samples. A future enhancement could compare free orgs against the
DO's debit total, but the immediate noise is gone.

Fix 4 — Edge ticker carries fractional remainder across emits (internal/worker/usage_ticker.go)

The robust version: a fracRemainder map[string]float64 carries each tick's leftover sub-second into the next, so cumulative drift is zero rather than int()
truncating ~0.5s every tick.

totalSecs := elapsed.Seconds() + carried                                                 
emitSecs := int(totalSecs)              // floor
newRemainder := totalSecs - float64(emitSecs)
return emitSecs, newRemainder

Special-case: when intervalSecondsFor deliberately caps the elapsed time (first-observation guess, or > 2× tick interval gap from a missed wake-detection), we zero the
carried remainder. Otherwise the cap would be subverted — a prior carry-forward would let us bill back the very seconds we chose to drop.

markEmitted now takes a third arg for the new remainder. dropState / pruneStateNotIn also clean it up. Tests updated.

New tests

Three regression tests in internal/worker/usage_ticker_test.go pin Fix 4:

  • CarriesForwardNoDrift — 200 ticks of 19.5s elapsed each must bill within 1s of 200 × 19.5 = 3900s exact. Pre-fix would have billed 3800s (50s lost = 2.5%
    drift); post-fix bills 3899–3900s.
  • SubSecondTicksAccumulate — three sub-second ticks (0.4s each) must cumulatively emit 1s by the third tick. Pre-fix would have silently dropped all three.
  • CapDiscardsRemainder — when the elapsed is capped, a previously stashed remainder must NOT carry through. Verifies the cap can't be subverted by a buffered
    fractional.

Verification

  • All worker unit tests pass (go test ./internal/worker/...).
  • All db unit tests pass (go test ./internal/db/...).
  • All controlplane unit tests pass except three pre-existing scaler test failures (TestSmartScaleDownTargetsLeastLoaded, TestScaleDownSkipsAlreadyDraining,
    TestDrainTimeoutCancelsDrainKeepsWorker) — verified failing on main before this PR; unrelated.
  • Build clean on every affected package.

End-to-end verification of fix #4 on prod data after deploy: a long-running sandbox with N ticks per hour should now report cell-vs-edge drift within ±1s, not the
systematic -2.5%. The parity-check log line will show skipped_free=N flagged=… instead of every free org pulling the noise floor down.

Files touched

File Change
internal/db/usage.go SQL clip fix
internal/db/store.go New GetOrgPlan method
internal/controlplane/usage_parity.go Free-org skip + interface extension + log field
internal/worker/grpc_server.go Wake reads sb.MemoryMB instead of hardcoded 1024
internal/worker/usage_ticker.go fracRemainder carry-forward + cap guard
internal/worker/usage_ticker_test.go Three regression tests + helper update

Not touched: api-edge Worker, events-ingest Worker, D1 schema, agent, gRPC proto, SDKs. Pure CP/worker change.

Rollout

  1. Deploy CP + worker binaries to prod.
  2. Watch the next 1–2 hourly parity-check log lines: expect skipped_free=N and flagged to drop to ~0 across all pro orgs.
  3. Once parity is clean for a 24h soak, flip PRO_BILLING_AUTHORITY=edge to complete the cutover the original PR Edge billing cutover #349 set up.

Migration: none — all Go-only changes.

Safe to roll back: the four fixes are independent. Revert by reverting this PR; behavior returns to the pre-PR state on the next CP/worker restart.

Test plan

  • Unit tests pass (go test ./internal/db/... ./internal/worker/... ./internal/controlplane/...)
  • Three regression tests added for Fix 4
  • Deploy to dev, verify parity check shows expected skipped_free=N line + reduced drift on next hourly cycle
  • Deploy to prod
  • 24h parity soak — expect ~0% drift on pro orgs
  • Flip PRO_BILLING_AUTHORITY=edge

@breardon2011 breardon2011 marked this pull request as ready for review June 8, 2026 19:32
@breardon2011 breardon2011 merged commit b5249b3 into main Jun 8, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants