Skip to content

Fix/terminal status publishes stopped#363

Merged
breardon2011 merged 2 commits into
mainfrom
fix/terminal-status-publishes-stopped
Jun 10, 2026
Merged

Fix/terminal status publishes stopped#363
breardon2011 merged 2 commits into
mainfrom
fix/terminal-status-publishes-stopped

Conversation

@breardon2011

Copy link
Copy Markdown
Contributor

Problem

The edge (D1 sandboxes_index + CreditAccount DO) bills a sandbox for as long as its D1 row says running. The cell flips a sandbox to a terminal state in many places, but only some of them
published the stopped lifecycle event that tells D1 to stop. The rest left D1 stuck at running → the edge billed dead sandboxes forever.

#361 fixed this for the migration / orphan-sweep / worker-prune paths. An audit of every terminal transition found 6 customer-facing paths still publishing nothing:

Path Trigger
api/sandbox.go:654 sandbox create failed
api/sandbox.go:2698 fork create failed
proxy/sandbox_api_proxy.go:404 worker gone, no hibernation
proxy/controlplane_proxy.go:223 worker gone
proxy/controlplane_proxy.go:323 not found on worker
controlplane/server.go:297 stop handler

The proxy ones are the systemic leak — they fire exactly when a worker disappears (the dead-worker case), mark the cell stopped, and never tell D1. Seen in prod: orgs billed on the edge for dead/zombie
sandboxes for days. Edge isn't authoritative yet, so no customer was charged — but parity was wrong, and a cutover would have made it real.

Fix — one structural chokepoint

Nearly all terminal transitions funnel through Store.UpdateSandboxSessionStatus, which already closes sandbox_scale_events on terminal status. Added a TerminalHook there: on any terminal transition
(stopped/error/failed) that affects a row, it fires — and the CP wires it to publish a stopped lifecycle event to the cell events stream → events-ingest → D1 → stop billing.

The publish is now a property of the transition, not of each caller — structurally impossible for a call site to forget. Covers all 6 gaps and any future path.

Also fixed a latent sub-bug: failed was excluded from the scale-event close (only stopped/error/hibernated closed), so failed-creates didn't even stop cell-side billing. failed is now terminal
everywhere. The terminal set is a pure predicate (isTerminalSessionStatus) so the classification has DB-free coverage.

What this does NOT double-publish

  • Raw-SQL terminal paths (FailMigrationPostQMP, MarkOrphaned*, ReconcileWorkerSessions) bypass UpdateSandboxSessionStatus, so the hook doesn't fire for them — they keep their existing serial migrations, in flight guard, prevent zombie ticks #361
    caller-publish. No duplication.
  • Worker-side terminal paths already pair the status write with sdb.LogEvent("stopped", …); the worker store is intentionally left without the hook.
  • Paths where the worker also emits stopped (normal destroy) may now emit a duplicate — harmless; stopped is idempotent on the edge.

Cleanup

Removed the now-redundant explicit publishStoppedLifecycleEvent call in reverse-reconcile (the hook covers it) plus the dead helper and its orphaned imports.

Full audit (every status transition)

Terminal → publishes stopped: the 6 paths above (now via hook) · create/fork-fail · image-builder cleanup · CP reverse-reconcile · FailMigrationPostQMP · MarkOrphanedOnWorker ·
MarkOrphanedSandboxes · ReconcileWorkerSessions stopped set · worker OnKill / hibernate-failed / DestroySandbox.
Hibernated → publishes hibernated: credit-halt (admin_org, enforcer) · reconcile hibernated set · worker hibernate paths.
Non-terminal → publishes nothing (correct): SetMigrating, CompleteMigration, FailMigration (reverts to running), wake paths.

All terminal paths now publish; verified end-to-end.

Tests

  • terminal_status_test.goDB-free unit test of isTerminalSessionStatus. Runs in every go test. Guards the classification (the easy regression: add a status, forget to mark it terminal).
  • terminal_hook_pgfixture_test.go (pgfixture) — integration test of the firing: fires exactly on stopped/error/failed, never on running/hibernated or a no-op transition, with the right org +
    reason, and is a safe no-op when unset.

@breardon2011 breardon2011 marked this pull request as ready for review June 10, 2026 19:32
@breardon2011 breardon2011 merged commit a6e60b1 into main Jun 10, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants