Fix/terminal status publishes stopped#363
Merged
Merged
Conversation
motatoes
approved these changes
Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The edge (D1
sandboxes_index+CreditAccountDO) bills a sandbox for as long as its D1 row saysrunning. The cell flips a sandbox to a terminal state in many places, but only some of thempublished the
stoppedlifecycle event that tells D1 to stop. The rest left D1 stuck atrunning→ the edge billed dead sandboxes forever.#361 fixed this for the migration / orphan-sweep / worker-prune paths. An audit of every terminal transition found 6 customer-facing paths still publishing nothing:
api/sandbox.go:654api/sandbox.go:2698proxy/sandbox_api_proxy.go:404proxy/controlplane_proxy.go:223proxy/controlplane_proxy.go:323controlplane/server.go:297The proxy ones are the systemic leak — they fire exactly when a worker disappears (the dead-worker case), mark the cell
stopped, and never tell D1. Seen in prod: orgs billed on the edge for dead/zombiesandboxes for days. Edge isn't authoritative yet, so no customer was charged — but parity was wrong, and a cutover would have made it real.
Fix — one structural chokepoint
Nearly all terminal transitions funnel through
Store.UpdateSandboxSessionStatus, which already closessandbox_scale_eventson terminal status. Added aTerminalHookthere: on any terminal transition(
stopped/error/failed) that affects a row, it fires — and the CP wires it to publish astoppedlifecycle event to the cell events stream → events-ingest → D1 → stop billing.The publish is now a property of the transition, not of each caller — structurally impossible for a call site to forget. Covers all 6 gaps and any future path.
Also fixed a latent sub-bug:
failedwas excluded from the scale-event close (onlystopped/error/hibernatedclosed), so failed-creates didn't even stop cell-side billing.failedis now terminaleverywhere. The terminal set is a pure predicate (
isTerminalSessionStatus) so the classification has DB-free coverage.What this does NOT double-publish
FailMigrationPostQMP,MarkOrphaned*,ReconcileWorkerSessions) bypassUpdateSandboxSessionStatus, so the hook doesn't fire for them — they keep their existing serial migrations, in flight guard, prevent zombie ticks #361caller-publish. No duplication.
sdb.LogEvent("stopped", …); the worker store is intentionally left without the hook.stopped(normal destroy) may now emit a duplicate — harmless;stoppedis idempotent on the edge.Cleanup
Removed the now-redundant explicit
publishStoppedLifecycleEventcall in reverse-reconcile (the hook covers it) plus the dead helper and its orphaned imports.Full audit (every status transition)
Terminal → publishes
stopped: the 6 paths above (now via hook) · create/fork-fail · image-builder cleanup · CP reverse-reconcile ·FailMigrationPostQMP·MarkOrphanedOnWorker·MarkOrphanedSandboxes·ReconcileWorkerSessionsstopped set · worker OnKill / hibernate-failed / DestroySandbox.Hibernated → publishes
hibernated: credit-halt (admin_org,enforcer) · reconcile hibernated set · worker hibernate paths.Non-terminal → publishes nothing (correct):
SetMigrating,CompleteMigration,FailMigration(reverts to running), wake paths.All terminal paths now publish; verified end-to-end.
Tests
terminal_status_test.go— DB-free unit test ofisTerminalSessionStatus. Runs in everygo test. Guards the classification (the easy regression: add a status, forget to mark it terminal).terminal_hook_pgfixture_test.go(pgfixture) — integration test of the firing: fires exactly onstopped/error/failed, never onrunning/hibernatedor a no-op transition, with the right org +reason, and is a safe no-op when unset.