Skip to content

serial migrations, in flight guard, prevent zombie ticks#361

Merged
breardon2011 merged 1 commit into
mainfrom
error-state-cleanup
Jun 9, 2026
Merged

serial migrations, in flight guard, prevent zombie ticks#361
breardon2011 merged 1 commit into
mainfrom
error-state-cleanup

Conversation

@breardon2011

@breardon2011 breardon2011 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Fix billing drift and target OOM from failed live migrations

What we noticed

During the most recent CP deploy, several live migrations failed mid-flight under parallel load. The failures cascaded into three billing/state bugs visible as drift in
usage_parity:

  1. Source-side sandbox_scale_events rows left open after migration aborted post-QMP — the usage ticker kept billing sandboxes that were no longer running anywhere.
  2. Zombie QEMU processes counted as alive by vmAlive, producing phantom ticks at the configured memory size for sandboxes that were effectively dead.
  3. Target worker OOM-killed when 3+ migrations landed on the same target in parallel — PrepareMigrationIncoming pre-allocates the full destination VM memory before
    any transfer starts, so N parallel receivers stacked N×mem_max against the worker cgroup. The crash itself then produced the "client connection closing" errors on
    source that triggered the (1) and (2) cleanup paths above.

Fixes

Per-target migration serialization (internal/controlplane/scaler.go)

  • findMigrationTarget hard-rejects targets with state.InFlight > 0. One in-flight migration per target, ever. Multiple targets can still receive in parallel — the
    constraint is per-target, not global.
  • New waitForMigrationTarget polls (5s, then 15s after 1min) so a batch that briefly outpaces target availability waits for a slot instead of erroring out.

Failure-path cleanup (scaler.go, internal/db/store.go)

  • New abortIncomingOnTarget calls DestroySandbox on target gRPC after a failed migration so the orphan QEMU receiver releases its pre-allocation. Without this, a
    failed migration left a held memory reservation on the target until process restart.
  • FailMigrationPostQMP + new MarkOrphanedOnWorker / MarkOrphanedSandboxes are now tx-wrapped and call a shared closeOpenScaleEventsForSandboxes helper, so any
    terminal status write also closes open scale-event rows. This is the source of the parity drift — terminal writes outside the tx-wrapped status update path were silently
    leaking open events.
  • replaceOneStale no longer terminates a source worker whose drain didn't actually clear it — countSandboxesOnWorker → fall back to hibernateAllOnWorker → re-check
    before terminate.

Zombie process detection (internal/qemu/ghost_reaper.go)

  • vmAlive now reads /proc/<pid>/stat and returns false for Z (zombie) and X (dying) states. Previously a kill -9'd QEMU whose parent hadn't reaped it would
    register as alive and the ticker would keep billing.

Drain timeout (scaler.go)

  • drainTimeout: 45min → 6h. With per-target serialization a fully-loaded worker draining to one healthy target needs the headroom, plus retries and hibernate fallback.
  • hibernateAllOnWorker: 2min total → 30min (with 2min budget per sandbox).

API refactor (internal/api/sandbox.go, internal/api/router.go, cmd/server/main.go)

  • The migrateSandbox HTTP handler had ~170 lines of duplicated migration logic that bypassed the scaler's in-flight counter — it would happily pile parallel migrations
    on one target. Refactored to 52 lines that delegate through a new MigrationOrchestrator interface, so the API and scaler paths share the same LiveMigrateSandbox
    code (and the same serialization).

Validation

Reproduced the failure on dev with a controlled test:

  • 2 workers, cross-binary + cross-golden (target's golden bumped by appending 500MB of /dev/urandom)
  • Source filled to ~90% with 9 sandboxes / 44 GB total (1×16GB + 2×8GB + 2×4GB + 4×1GB)
  • Heavy memory workload (~70% RSS each)
  • Parallel batch=3 to single target

Before fix: target OOM-killed on the 3rd parallel PrepareMigrationIncoming; cascade of connection refused / client connection closing errors; orphaned events
and zombies on source after the dust settled.

After fix:

Run Result Notes
Serial (1 at a time) 5/5 (100%) HTTP 200 throughout, no zombies, no orphan events
Parallel batch=3, fresh target 9/9 (100%) No OOM, target stayed alive, in-flight serialization held

Performance impact

Per-target serialization only slows down single-target drains. With N healthy targets, drains still parallelize N-way. Worst case (1 target, fully-loaded worker, ~120
GB): drain time grows from "30s but crashes" to "~2-3min and completes." 6h drainTimeout covers this with 4× headroom.

@breardon2011 breardon2011 marked this pull request as ready for review June 9, 2026 02:03
@breardon2011 breardon2011 merged commit 7954c07 into main Jun 9, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants