billing eventing improvement by breardon2011 · Pull Request #352 · diggerhq/opencomputer

breardon2011 · 2026-06-05T03:55:52Z

Why

While doing a 24h parity-history check on PR #349 (edge billing in shadow mode, PRO_BILLING_AUTHORITY=cell), we pulled the full org-level drift from the parity
checker and saw consistent skew between cell-PG and edge-D1 GB-seconds. Two distinct bugs were under it; flipping PRO_BILLING_AUTHORITY=edge with either still in
place would have biased the bill of record.

What

1. Tick-attribution fix (`usage_ticker.go` + `lifecycle.go` + qemu/manager hooks)

The edge measurement was a fixed-interval ticker (20s) that always emitted a full-tick cost per observation. Two structural mismatches with the cell-side scale_events
measurement:

Over-count short sandboxes. A sandbox that lived 4s and got destroyed was attributed a full 20s tick at next emit — billed 5× actual.
Under-count short-lived sandboxes that died before any tick. Sandboxes created and destroyed within a tick interval had no observation at all → un-billed.
Drift across lifecycle events. Scale, hibernate, destroy, wake all happened between periodic ticks; the slice from last-tick → lifecycle-event was either lost
(destroy/hibernate) or mis-attributed to the wrong config (scale).

Fix:

New sandbox.LifecycleObserver interface (internal/sandbox/lifecycle.go) — OnSandboxScale, OnSandboxDestroy, OnSandboxHibernate, OnSandboxWake. All carry
startedAt as a fallback attribution point for sandboxes that have never been ticked.
internal/qemu/manager.go fires these at the right moments (scale and hibernate fire BEFORE the change so the closing slice uses old config; wake fires AFTER
loadvm so post-wake ticks measure forward from wake).
usage_ticker.go tracks per-sandbox lastSeen, computes intervalSecondsFor as the actual elapsed time (capped at 2× the periodic interval to bound a stale-worker
blast radius), and scales the cost by the slice duration. OnSandboxWake resets lastSeen=now so subsequent emits don't bill the hibernation window.
cmd/worker/main.go wires the ticker as the observer.
17 new unit tests in usage_ticker_test.go cover interval math, scaled cost, lifecycle paths, and short-sandbox regression.

2. Post-wake event-loss fix (`sqlite.go` + `redis_event_publisher.go`)

While investigating drift on prod we noticed woke events were ~92% missing in D1 (21 lifetime vs 272 hibernated). Root cause:

Each upstream event has a deterministic envelope ID <sandbox>:<sqlite_id>. The determinism is required: a re-publish (XADD succeeded but MarkSynced didn't) must
collide so events-ingest's ON CONFLICT(id) DO NOTHING dedups it.
Hibernate calls SandboxDBs.Remove() → os.RemoveAll on the per-sandbox SQLite directory. Wake calls Get() which recreates the file from scratch. The fresh
INTEGER PRIMARY KEY AUTOINCREMENT restarts at 1.
So post-wake event ids 1..N collide with the pre-hibernate envelope IDs already in D1 and are silently dropped — for every hibernated sandbox, until the new counter
passes the old max.

For a sandbox with 40 pre-hibernate events (created + 39 ticks + hibernated), the next ~40 post-wake events (including woke and ~13 minutes of ticks) silently
vanish.

Empirical confirmation on prod — gap between hibernated and next surviving event correlates with pre-hibernate event count × tick interval:

sandbox	pre events	gap	post survivors
sb-eb5478a6	40	933s	93
sb-95cc1643	21	4592s	1
sb-23d85e6c	4	145s	5

Fix: stamp each SandboxDB with a generation (UnixNano) on first open, stored in a new db_meta table. Re-opens read the same value back (retry-dedup still works); a
recreated file gets a new generation. Envelope ID becomes <sandbox>:<generation>:<sqlite_id>. Old D1 rows (two-segment IDs) coexist fine — they're just strings.

3 new tests in sqlite_generation_test.go pin stability across re-opens, change across Remove+Get, and that GetAllUnsyncedEventsFlat carries the generation
through to the publisher.

Verification on dev

Built + deployed to dev (RG opensandbox-prod westus2, both workers on 1f0408d). Created a sandbox, slept 25s for a tick, hibernated, waited 5s, woke. D1 dev shows:

sb-8bf1e9b6:1780631482378984496:1 created
sb-8bf1e9b6:1780631482378984496:2 usage_tick
sb-8bf1e9b6:1780631482378984496:3 usage_tick
sb-8bf1e9b6:1780631482378984496:4 usage_tick
sb-8bf1e9b6:1780631482378984496:5 hibernated
sb-8bf1e9b6:1780631526135659700:1 woke ← would previously collide with :1 (created)
sb-8bf1e9b6:1780631526135659700:2 usage_tick ← would previously collide with :2
sb-8bf1e9b6:1780631526135659700:3 usage_tick ← would previously collide with :3

All 8 events present; the two generations namespace the IDs cleanly.

Risk / rollout

Tick fix is structural — needs a soak (24h+ parity history on prod) before flipping PRO_BILLING_AUTHORITY=edge. Until then we're still shadow-mode and cell
remains authority.
Event-loss fix is backwards-compatible at the data layer (existing two-segment D1 rows untouched, new rows are three-segment strings).
One narrow deploy-time race for the event-loss fix: an event already in the "XADD succeeded but MarkSynced failed" state at the moment we restart the worker will
re-emit under the new envelope-id format and create a duplicate D1 row. Only happens on a worker crash mid-flush; acceptable.

Files touched

internal/sandbox/lifecycle.go (new) — LifecycleObserver interface
internal/sandbox/sqlite.go — db_meta table + Generation(); SandboxEvent.Generation propagated through GetAllUnsyncedEventsFlat
internal/sandbox/sqlite_generation_test.go (new) — generation invariants
internal/worker/usage_ticker.go — periodic ticker + lifecycle hooks + scaled cost
internal/worker/usage_ticker_test.go (new) — 17 tests
internal/worker/redis_event_publisher.go — envelope ID now <sandbox>:<generation>:<sqlite_id> on both code paths
internal/qemu/manager.go — SetLifecycleObserver + hooks in destroyVM, SetResourceLimits, Hibernate, Wake
cmd/worker/main.go — wires the ticker as the observer

Test plan

Unit tests pass locally (go test ./internal/sandbox/... ./internal/worker/...)
Built + deployed to dev; manual hibernate/wake → all 8 events landed in D1 with distinct generation segments
24h parity-history check on prod after deploy, before flipping PRO_BILLING_AUTHORITY=edge

billing eventing improvement

affb86d

breardon2011 marked this pull request as ready for review June 5, 2026 03:58

motatoes approved these changes Jun 5, 2026

View reviewed changes

breardon2011 merged commit 3ca4e6c into main Jun 6, 2026
3 checks passed

breardon2011 mentioned this pull request Jun 8, 2026

fix parity issues, memory, and 2.5pct bug #359

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

billing eventing improvement#352

billing eventing improvement#352
breardon2011 merged 1 commit into
mainfrom
billing-tick-adjustment

breardon2011 commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

breardon2011 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

1. Tick-attribution fix (usage_ticker.go + lifecycle.go + qemu/manager hooks)

2. Post-wake event-loss fix (sqlite.go + redis_event_publisher.go)

Verification on dev

Risk / rollout

Files touched

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

breardon2011 commented Jun 5, 2026 •

edited

Loading

1. Tick-attribution fix (`usage_ticker.go` + `lifecycle.go` + qemu/manager hooks)

2. Post-wake event-loss fix (`sqlite.go` + `redis_event_publisher.go`)