serial migrations, in flight guard, prevent zombie ticks by breardon2011 · Pull Request #361 · diggerhq/opencomputer

breardon2011 · 2026-06-09T01:59:20Z

Fix billing drift and target OOM from failed live migrations

What we noticed

During the most recent CP deploy, several live migrations failed mid-flight under parallel load. The failures cascaded into three billing/state bugs visible as drift in
usage_parity:

Source-side sandbox_scale_events rows left open after migration aborted post-QMP — the usage ticker kept billing sandboxes that were no longer running anywhere.
Zombie QEMU processes counted as alive by vmAlive, producing phantom ticks at the configured memory size for sandboxes that were effectively dead.
Target worker OOM-killed when 3+ migrations landed on the same target in parallel — PrepareMigrationIncoming pre-allocates the full destination VM memory before
any transfer starts, so N parallel receivers stacked N×mem_max against the worker cgroup. The crash itself then produced the "client connection closing" errors on
source that triggered the (1) and (2) cleanup paths above.

Fixes

Per-target migration serialization (`internal/controlplane/scaler.go`)

findMigrationTarget hard-rejects targets with state.InFlight > 0. One in-flight migration per target, ever. Multiple targets can still receive in parallel — the
constraint is per-target, not global.
New waitForMigrationTarget polls (5s, then 15s after 1min) so a batch that briefly outpaces target availability waits for a slot instead of erroring out.

Failure-path cleanup (`scaler.go`, `internal/db/store.go`)

New abortIncomingOnTarget calls DestroySandbox on target gRPC after a failed migration so the orphan QEMU receiver releases its pre-allocation. Without this, a
failed migration left a held memory reservation on the target until process restart.
FailMigrationPostQMP + new MarkOrphanedOnWorker / MarkOrphanedSandboxes are now tx-wrapped and call a shared closeOpenScaleEventsForSandboxes helper, so any
terminal status write also closes open scale-event rows. This is the source of the parity drift — terminal writes outside the tx-wrapped status update path were silently
leaking open events.
replaceOneStale no longer terminates a source worker whose drain didn't actually clear it — countSandboxesOnWorker → fall back to hibernateAllOnWorker → re-check
before terminate.

Zombie process detection (`internal/qemu/ghost_reaper.go`)

vmAlive now reads /proc/<pid>/stat and returns false for Z (zombie) and X (dying) states. Previously a kill -9'd QEMU whose parent hadn't reaped it would
register as alive and the ticker would keep billing.

Drain timeout (`scaler.go`)

drainTimeout: 45min → 6h. With per-target serialization a fully-loaded worker draining to one healthy target needs the headroom, plus retries and hibernate fallback.
hibernateAllOnWorker: 2min total → 30min (with 2min budget per sandbox).

API refactor (`internal/api/sandbox.go`, `internal/api/router.go`, `cmd/server/main.go`)

The migrateSandbox HTTP handler had ~170 lines of duplicated migration logic that bypassed the scaler's in-flight counter — it would happily pile parallel migrations
on one target. Refactored to 52 lines that delegate through a new MigrationOrchestrator interface, so the API and scaler paths share the same LiveMigrateSandbox
code (and the same serialization).

Validation

Reproduced the failure on dev with a controlled test:

2 workers, cross-binary + cross-golden (target's golden bumped by appending 500MB of /dev/urandom)
Source filled to ~90% with 9 sandboxes / 44 GB total (1×16GB + 2×8GB + 2×4GB + 4×1GB)
Heavy memory workload (~70% RSS each)
Parallel batch=3 to single target

Before fix: target OOM-killed on the 3rd parallel PrepareMigrationIncoming; cascade of connection refused / client connection closing errors; orphaned events
and zombies on source after the dust settled.

After fix:

Run	Result	Notes
Serial (1 at a time)	5/5 (100%)	HTTP 200 throughout, no zombies, no orphan events
Parallel batch=3, fresh target	9/9 (100%)	No OOM, target stayed alive, in-flight serialization held

Performance impact

Per-target serialization only slows down single-target drains. With N healthy targets, drains still parallelize N-way. Worst case (1 target, fully-loaded worker, ~120
GB): drain time grows from "30s but crashes" to "~2-3min and completes." 6h drainTimeout covers this with 4× headroom.

serial migrations, in flight guard, prevent zombie ticks

213fc5f

breardon2011 marked this pull request as ready for review June 9, 2026 02:03

motatoes approved these changes Jun 9, 2026

View reviewed changes

breardon2011 merged commit 7954c07 into main Jun 9, 2026
3 checks passed

breardon2011 mentioned this pull request Jun 10, 2026

Fix/terminal status publishes stopped #363

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serial migrations, in flight guard, prevent zombie ticks#361

serial migrations, in flight guard, prevent zombie ticks#361
breardon2011 merged 1 commit into
mainfrom
error-state-cleanup

breardon2011 commented Jun 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

breardon2011 commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix billing drift and target OOM from failed live migrations

What we noticed

Fixes

Per-target migration serialization (internal/controlplane/scaler.go)

Failure-path cleanup (scaler.go, internal/db/store.go)

Zombie process detection (internal/qemu/ghost_reaper.go)

Drain timeout (scaler.go)

API refactor (internal/api/sandbox.go, internal/api/router.go, cmd/server/main.go)

Validation

Performance impact

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

breardon2011 commented Jun 9, 2026 •

edited

Loading

Per-target migration serialization (`internal/controlplane/scaler.go`)

Failure-path cleanup (`scaler.go`, `internal/db/store.go`)

Zombie process detection (`internal/qemu/ghost_reaper.go`)

Drain timeout (`scaler.go`)

API refactor (`internal/api/sandbox.go`, `internal/api/router.go`, `cmd/server/main.go`)