billing eventing improvement#352
Merged
Merged
Conversation
motatoes
approved these changes
Jun 5, 2026
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
While doing a 24h parity-history check on PR #349 (edge billing in shadow mode,
PRO_BILLING_AUTHORITY=cell), we pulled the full org-level drift from the paritychecker and saw consistent skew between cell-PG and edge-D1 GB-seconds. Two distinct bugs were under it; flipping
PRO_BILLING_AUTHORITY=edgewith either still inplace would have biased the bill of record.
What
1. Tick-attribution fix (
usage_ticker.go+lifecycle.go+ qemu/manager hooks)The edge measurement was a fixed-interval ticker (20s) that always emitted a full-tick cost per observation. Two structural mismatches with the cell-side
scale_eventsmeasurement:
(destroy/hibernate) or mis-attributed to the wrong config (scale).
Fix:
sandbox.LifecycleObserverinterface (internal/sandbox/lifecycle.go) —OnSandboxScale,OnSandboxDestroy,OnSandboxHibernate,OnSandboxWake. All carrystartedAtas a fallback attribution point for sandboxes that have never been ticked.internal/qemu/manager.gofires these at the right moments (scale and hibernate fire BEFORE the change so the closing slice uses old config; wake fires AFTERloadvmso post-wake ticks measure forward from wake).usage_ticker.gotracks per-sandboxlastSeen, computesintervalSecondsForas the actual elapsed time (capped at 2× the periodic interval to bound a stale-workerblast radius), and scales the cost by the slice duration.
OnSandboxWakeresetslastSeen=nowso subsequent emits don't bill the hibernation window.cmd/worker/main.gowires the ticker as the observer.usage_ticker_test.gocover interval math, scaled cost, lifecycle paths, and short-sandbox regression.2. Post-wake event-loss fix (
sqlite.go+redis_event_publisher.go)While investigating drift on prod we noticed
wokeevents were ~92% missing in D1 (21 lifetime vs 272hibernated). Root cause:<sandbox>:<sqlite_id>. The determinism is required: a re-publish (XADD succeeded butMarkSynceddidn't) mustcollide so events-ingest's
ON CONFLICT(id) DO NOTHINGdedups it.SandboxDBs.Remove()→os.RemoveAllon the per-sandbox SQLite directory. Wake callsGet()which recreates the file from scratch. The freshINTEGER PRIMARY KEY AUTOINCREMENTrestarts at 1.passes the old max.
For a sandbox with 40 pre-hibernate events (
created+ 39 ticks +hibernated), the next ~40 post-wake events (includingwokeand ~13 minutes of ticks) silentlyvanish.
Empirical confirmation on prod — gap between
hibernatedand next surviving event correlates with pre-hibernate event count × tick interval:Fix: stamp each SandboxDB with a
generation(UnixNano) on first open, stored in a newdb_metatable. Re-opens read the same value back (retry-dedup still works); arecreated file gets a new generation. Envelope ID becomes
<sandbox>:<generation>:<sqlite_id>. Old D1 rows (two-segment IDs) coexist fine — they're just strings.3 new tests in
sqlite_generation_test.gopin stability across re-opens, change acrossRemove+Get, and thatGetAllUnsyncedEventsFlatcarries the generationthrough to the publisher.
Verification on dev
Built + deployed to dev (RG
opensandbox-prodwestus2, both workers on1f0408d). Created a sandbox, slept 25s for a tick, hibernated, waited 5s, woke. D1 dev shows:sb-8bf1e9b6:1780631482378984496:1 created
sb-8bf1e9b6:1780631482378984496:2 usage_tick
sb-8bf1e9b6:1780631482378984496:3 usage_tick
sb-8bf1e9b6:1780631482378984496:4 usage_tick
sb-8bf1e9b6:1780631482378984496:5 hibernated
sb-8bf1e9b6:1780631526135659700:1 woke ← would previously collide with :1 (created)
sb-8bf1e9b6:1780631526135659700:2 usage_tick ← would previously collide with :2
sb-8bf1e9b6:1780631526135659700:3 usage_tick ← would previously collide with :3
All 8 events present; the two generations namespace the IDs cleanly.
Risk / rollout
PRO_BILLING_AUTHORITY=edge. Until then we're still shadow-mode and cellremains authority.
MarkSyncedfailed" state at the moment we restart the worker willre-emit under the new envelope-id format and create a duplicate D1 row. Only happens on a worker crash mid-flush; acceptable.
Files touched
internal/sandbox/lifecycle.go(new) —LifecycleObserverinterfaceinternal/sandbox/sqlite.go—db_metatable +Generation();SandboxEvent.Generationpropagated throughGetAllUnsyncedEventsFlatinternal/sandbox/sqlite_generation_test.go(new) — generation invariantsinternal/worker/usage_ticker.go— periodic ticker + lifecycle hooks + scaled costinternal/worker/usage_ticker_test.go(new) — 17 testsinternal/worker/redis_event_publisher.go— envelope ID now<sandbox>:<generation>:<sqlite_id>on both code pathsinternal/qemu/manager.go—SetLifecycleObserver+ hooks indestroyVM,SetResourceLimits,Hibernate,Wakecmd/worker/main.go— wires the ticker as the observerTest plan
go test ./internal/sandbox/... ./internal/worker/...)PRO_BILLING_AUTHORITY=edge