serial migrations, in flight guard, prevent zombie ticks#361
Merged
Conversation
motatoes
approved these changes
Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix billing drift and target OOM from failed live migrations
What we noticed
During the most recent CP deploy, several live migrations failed mid-flight under parallel load. The failures cascaded into three billing/state bugs visible as drift in
usage_parity:sandbox_scale_eventsrows left open after migration aborted post-QMP — the usage ticker kept billing sandboxes that were no longer running anywhere.vmAlive, producing phantom ticks at the configured memory size for sandboxes that were effectively dead.PrepareMigrationIncomingpre-allocates the full destination VM memory beforeany transfer starts, so N parallel receivers stacked N×mem_max against the worker cgroup. The crash itself then produced the "client connection closing" errors on
source that triggered the (1) and (2) cleanup paths above.
Fixes
Per-target migration serialization (
internal/controlplane/scaler.go)findMigrationTargethard-rejects targets withstate.InFlight > 0. One in-flight migration per target, ever. Multiple targets can still receive in parallel — theconstraint is per-target, not global.
waitForMigrationTargetpolls (5s, then 15s after 1min) so a batch that briefly outpaces target availability waits for a slot instead of erroring out.Failure-path cleanup (
scaler.go,internal/db/store.go)abortIncomingOnTargetcallsDestroySandboxon target gRPC after a failed migration so the orphan QEMU receiver releases its pre-allocation. Without this, afailed migration left a held memory reservation on the target until process restart.
FailMigrationPostQMP+ newMarkOrphanedOnWorker/MarkOrphanedSandboxesare now tx-wrapped and call a sharedcloseOpenScaleEventsForSandboxeshelper, so anyterminal status write also closes open scale-event rows. This is the source of the parity drift — terminal writes outside the tx-wrapped status update path were silently
leaking open events.
replaceOneStaleno longer terminates a source worker whose drain didn't actually clear it —countSandboxesOnWorker→ fall back tohibernateAllOnWorker→ re-checkbefore terminate.
Zombie process detection (
internal/qemu/ghost_reaper.go)vmAlivenow reads/proc/<pid>/statand returns false forZ(zombie) andX(dying) states. Previously a kill -9'd QEMU whose parent hadn't reaped it wouldregister as alive and the ticker would keep billing.
Drain timeout (
scaler.go)drainTimeout: 45min → 6h. With per-target serialization a fully-loaded worker draining to one healthy target needs the headroom, plus retries and hibernate fallback.hibernateAllOnWorker: 2min total → 30min (with 2min budget per sandbox).API refactor (
internal/api/sandbox.go,internal/api/router.go,cmd/server/main.go)migrateSandboxHTTP handler had ~170 lines of duplicated migration logic that bypassed the scaler's in-flight counter — it would happily pile parallel migrationson one target. Refactored to 52 lines that delegate through a new
MigrationOrchestratorinterface, so the API and scaler paths share the sameLiveMigrateSandboxcode (and the same serialization).
Validation
Reproduced the failure on dev with a controlled test:
/dev/urandom)Before fix: target OOM-killed on the 3rd parallel
PrepareMigrationIncoming; cascade ofconnection refused/client connection closingerrors; orphaned eventsand zombies on source after the dust settled.
After fix:
Performance impact
Per-target serialization only slows down single-target drains. With N healthy targets, drains still parallelize N-way. Worst case (1 target, fully-loaded worker, ~120
GB): drain time grows from "30s but crashes" to "~2-3min and completes." 6h
drainTimeoutcovers this with 4× headroom.