Skip to content

fix(runtime): preserve box record on init failure as Failed state#520

Merged
DorianZheng merged 3 commits into
mainfrom
worktree-jolly-noodling-nygaard
May 14, 2026
Merged

fix(runtime): preserve box record on init failure as Failed state#520
DorianZheng merged 3 commits into
mainfrom
worktree-jolly-noodling-nygaard

Conversation

@DorianZheng

Copy link
Copy Markdown
Member

Summary

  • Replace CleanupGuard::drop's remove_box() with a Failed-state transition (preserves the DB row + error_reason). Matches the Daytona/Kata/containerd/Docker canonical pattern. Fixes the root cause of the CL84LvGx7RBE orphan on dev.boxlite.ai.
  • Add daemon-wide install_zombie_reaper() so failed shim children don't accumulate as <defunct> PIDs.
  • Wire Go runner SIGTERM -> boxliteClient.Shutdown(25s) -> apiServer.Stop() so running VMs get graceful SIGTERM; bump systemd TimeoutStopSec=60.
  • Enriched wait_for_guest_ready timeout error (shim_alive, console_bytes, ready_socket_exists, likely_cause, console tail) + injectable timeout for testability.
  • CLAUDE.md gains the test-meaningfulness rule.

Test plan

  • cargo test -p boxlite --lib litebox::state::tests (43 pass; +9 new for the Failed matrix)
  • cargo test -p boxlite --lib litebox::init::types::tests (4 pass; +1 new CleanupGuard drop preservation test)
  • cargo test -p boxlite --lib util::process::tests (21 pass; +1 new zombie-reaper test)
  • cargo test -p boxlite --lib litebox::init::tasks::guest_connect::tests (12 pass; includes the real enriched-timeout test from a prior turn)
  • Rip-out verification: reverting CleanupGuard::drop body and install_zombie_reaper body individually flips the new tests red; restoring brings them back to green.
  • Post-deploy: clean up the orphaned CL84LvGx7RBE directory on the dev runner (one-time operational step; runbook lives in the plan file).

When init-pipeline tasks fail (e.g. the 30s guest-connect timeout that
hit CL84LvGx7RBE on dev.boxlite.ai), CleanupGuard::drop used to call
remove_box() unconditionally -- orphaning the box's persistent disks on
host while the DB row disappeared, leaving the user with an
unrecoverable sandbox. Match the canonical pattern (Daytona ERROR,
Kata startVM defer, containerd status.ExitCode, Docker
SetError+CheckpointTo): preserve the record, transition to Failed with
error_reason, let DESTROY_SANDBOX be the only deletion path.

Bundled companion fixes for contributing bugs found in the same
investigation:

- install_zombie_reaper: daemon-wide SIGCHLD reaper so repeated shim
  failures don't accumulate <defunct> children (7+ observed in prod).
- Go runner SIGTERM handler now calls boxliteClient.Shutdown() before
  apiServer.Stop(), so VMs get a graceful SIGTERM instead of being
  killed mid-write by the parent exit. TimeoutStopSec=60 on the systemd
  unit leaves headroom.
- wait_for_guest_ready accepts an injectable Duration (production keeps
  the 30s constant) and the timeout error body now includes shim_alive,
  console_bytes, ready_socket_exists, likely_cause heuristic, and a
  console tail -- turning hours of forensics into a one-look diagnostic.

Tests:
- BoxStatus::Failed serde + transitions + can_remove/can_start matrix
  (state.rs, 9 tests).
- mark_failed sets status/reason/pid and preserves health (state.rs).
- CleanupGuard::drop persists Failed and keeps the row (init/types.rs).
  Reverting Drop to remove_box flips this test red.
- install_zombie_reaper consumes an unwaited child within 8s
  (util/process.rs). Reverting the reaper flips this test red.
- wait_for_guest_ready timeout branch returns the enriched error body
  via a 100ms test-side timeout (guest_connect.rs).

CLAUDE.md gains the test-meaningfulness rule: assertions must be on
data routed through production code, not on values the test body
invented.
The daemon-wide SIGCHLD reaper from the previous commit needs more
design work before it's safe to land. Concerns surfaced after merging:

- Global side effect: waitpid(-1, WNOHANG) races with any code in the
  same process that owns a Child handle and expects to call .wait().
  If the reaper consumes the child first, the owner gets ECHILD and
  loses the exit code. ProcessMonitor::try_wait already returns
  ProcessExit::Unknown for this case, but other callers of
  std::process::Child::wait() across the daemon do not.
- 5s sleep cycle is coarse -- long enough that the test had to wait
  up to 8s for verification, slowing CI.
- The reaper thread is install-once and runs for the lifetime of the
  process; there is no shutdown path or per-test isolation.

The CleanupGuard preservation fix (the root cause of the
CL84LvGx7RBE incident) is independent of the reaper and stays.
Investigation tracked in a follow-up issue.
E0004 in CI on \`sdks/node/src/info.rs:72\` (and the same shape in
sdks/python/src/info.rs and sdks/c/src/info.rs): the three SDK match
arms on \`BoxStatus\` weren't updated when \`Failed\` was added to the
enum in the previous commit, so the SDK lib targets failed to compile.

Add \`Failed => "failed"\` to each, matching the canonical string from
\`BoxStatus::as_str()\` so REST/CLI/SDK consumers see one consistent name.
@DorianZheng DorianZheng merged commit e4767e7 into main May 14, 2026
53 of 54 checks passed
@DorianZheng DorianZheng deleted the worktree-jolly-noodling-nygaard branch May 14, 2026 13:56
G4614 pushed a commit to G4614/boxlite that referenced this pull request May 28, 2026
…ives

Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper
reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered
must be left for its owner to wait(). Two-side verified — injecting a
global waitpid(-1) into the sweep makes the owner's wait() return ECHILD
("No child processes") and the test fail; the scoped sweep preserves
exit code 42.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 pushed a commit to G4614/boxlite that referenced this pull request Jun 1, 2026
Adds `ShimReaper`, a per-runtime worker that periodically calls
`waitpid(pid, WNOHANG)` on a small registered PID set. Replaces the
reverted daemon-wide reaper attempt from PR boxlite-ai#520 with a scoped design
that matches Issue boxlite-ai#523's acceptance criteria.

Problem
=======

`boxlite-shim` children that get abandoned mid-init (panic past
`CleanupGuard`, retry loop spawning a fresh shim, reattach paths where
the Rust-side `Child` was never wait()'d) accumulate as `<defunct>` PIDs
under the daemon. The `CL84LvGx7RBE` incident on dev showed 7+ zombies
piled up over time, and the contributing cause sat on top of the
CleanupGuard fix (which itself is independent and stays).

PR boxlite-ai#520 originally landed a `waitpid(-1, WNOHANG)` reaper to drain these.
That call races every other `Child::wait()` site in the workspace: if
the reaper wins, the owner's `wait()` returns ECHILD and the exit code
is lost. The reaper was reverted on the same PR pending the scoped
design this commit implements.

Design
======

`src/boxlite/src/util/reaper.rs` introduces:

- `ShimReaper` — owns a `std::thread` worker that scans an `Arc<Mutex<HashSet<u32>>>`
  registry every 250 ms and reaps any registered PID whose `waitpid`
  reports exit or ECHILD. Std thread (not tokio task) so it works
  regardless of whether `RuntimeImpl::new` was called inside a tokio
  runtime context.
- `ReaperHandle` — RAII registration handle. Drop removes the PID from
  the registry; a panic in any caller path that owns one can't leak.
- `shutdown()` — Condvar-driven, completes in well under one tick so
  `RuntimeImpl::shutdown` (sync and async paths both call it) doesn't
  block the runtime exit on a stuck reaper.

Why scoped, not daemon-wide
---------------------------

Per Issue boxlite-ai#523, an unscoped reaper races every `Child::wait()` call site.
This reaper only touches PIDs that were explicitly registered. Today's
only registrar is `ShimHandler::from_spawned`. The `let _ = process.wait();`
sites in `shim.rs:112` / `shim.rs:121` discard their results, so
ECHILD-from-reaper-win is safe. Audit of every `Child::wait()` /
`Child::try_wait()` call site in `src/boxlite/`:

| Site                                | Process    | Race-safe?       |
|-------------------------------------|------------|------------------|
| shim.rs:102 (stop's try_wait loop)  | shim       | Yes — `Err(_)` arm falls through to kill+wait, both ECHILD-discarded |
| shim.rs:112, 121 (`let _ = wait()`) | shim       | Yes — result discarded |
| watchdog.rs:256                     | test only  | N/A              |
| net/libslirp.rs:135                 | gvproxy    | Not registered, reaper doesn't touch |
| box_impl.rs:1169 (ChildGuard)       | test only  | N/A              |
| ProcessMonitor::try_wait            | any        | Already maps ECHILD to `Unknown` |

Why polling, not SIGCHLD
------------------------

SIGCHLD plumbing in async Rust (signal-hook / tokio::signal::unix) is
process-global and adds a race surface against the runtime's other
signal handlers. A 250 ms HashSet scan is cheaper than that complexity
buys back. Worst-case zombie lifetime is 250 ms; tests verify drain in
< 2 s by polling, never sleeping.

Wiring
======

- `RuntimeImpl` gains `shim_reaper: Arc<ShimReaper>`, spawned in
  `RuntimeImpl::new`, shut down in both `shutdown()` (async path) and
  `shutdown_sync()` (atexit / Drop path).
- `ShimController::new` takes `Arc<ShimReaper>`; threaded through from
  `vmm_spawn::spawn_vm` via `ctx.runtime.shim_reaper`.
- `ShimHandler::from_spawned` registers the PID and stores the
  `ReaperHandle`. `ShimHandler::from_pid` (reattach mode) does NOT
  register — `waitpid` on a non-child returns ECHILD anyway; the
  reattached shim's eventual zombie belongs to init, not us.

Tests
=====

Acceptance criteria from Issue boxlite-ai#523, restated and pinned:

- [x] Scopes to known shim PIDs, never calls `waitpid(-1, ...)` — see
      `probe_pid` in reaper.rs.
- [x] Audit of `Child::wait()` call sites — recorded in the table above.
- [x] Shutdown path tied to `RuntimeImpl::shutdown` — both async and
      sync paths call `shim_reaper.shutdown()`. Drop on the reaper is
      idempotent.
- [x] Verification test runs in < 2 s (poll-based) — three unit tests
      in `reaper.rs::tests` and two integration tests in
      `src/boxlite/tests/zombie_reaper.rs::n_init_failures_leave_zero_zombies`,
      `registry_accepts_re_registration_after_drop`. Wallclock for each
      is sub-second on the dev host.
- [x] Integration test demonstrates "N init failures leave 0 zombies" —
      `n_init_failures_leave_zero_zombies` spawns N=8 (matching the
      `CL84LvGx7RBE` count), forgets each Child to mirror the abandoned-
      mid-init path, and asserts both registry drain and `/proc/<pid>/status`
      absence-or-not-Z.

Test counts:
- Unit (reaper.rs): 3 tests, all pass in ~0.3 s total
- Integration (tests/zombie_reaper.rs): 2 tests, all pass in ~0.5 s total
- Regression: 754/754 boxlite lib tests pass with --features rest
  (excluding the pre-existing `ws_watchdog` flake unrelated to this
  change); 118/118 boxlite-cli unit tests pass.

Out of scope
============

- Reattach orphans: a reattached `ShimHandler` doesn't register with
  the reaper. If the original spawning process is gone, init will reap
  the eventual zombie. Tracking via reconciler (separate epic) covers
  the DB/runtime drift case.
- PR boxlite-ai#573 (orphan shim port retention, Issue boxlite-ai#565): independent fix
  for a different leak mechanism (`is_same_process` recovery
  misidentification). Both are needed for the full dind test reliability
  story; landing one doesn't subsume the other.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 pushed a commit to G4614/boxlite that referenced this pull request Jun 1, 2026
…ives

Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper
reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered
must be left for its owner to wait(). Two-side verified — injecting a
global waitpid(-1) into the sweep makes the owner's wait() return ECHILD
("No child processes") and the test fail; the scoped sweep preserves
exit code 42.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 pushed a commit to G4614/boxlite that referenced this pull request Jun 2, 2026
Adds `ShimReaper`, a per-runtime worker that periodically calls
`waitpid(pid, WNOHANG)` on a small registered PID set. Replaces the
reverted daemon-wide reaper attempt from PR boxlite-ai#520 with a scoped design
that matches Issue boxlite-ai#523's acceptance criteria.

Problem
=======

`boxlite-shim` children that get abandoned mid-init (panic past
`CleanupGuard`, retry loop spawning a fresh shim, reattach paths where
the Rust-side `Child` was never wait()'d) accumulate as `<defunct>` PIDs
under the daemon. The `CL84LvGx7RBE` incident on dev showed 7+ zombies
piled up over time, and the contributing cause sat on top of the
CleanupGuard fix (which itself is independent and stays).

PR boxlite-ai#520 originally landed a `waitpid(-1, WNOHANG)` reaper to drain these.
That call races every other `Child::wait()` site in the workspace: if
the reaper wins, the owner's `wait()` returns ECHILD and the exit code
is lost. The reaper was reverted on the same PR pending the scoped
design this commit implements.

Design
======

`src/boxlite/src/util/reaper.rs` introduces:

- `ShimReaper` — owns a `std::thread` worker that scans an `Arc<Mutex<HashSet<u32>>>`
  registry every 250 ms and reaps any registered PID whose `waitpid`
  reports exit or ECHILD. Std thread (not tokio task) so it works
  regardless of whether `RuntimeImpl::new` was called inside a tokio
  runtime context.
- `ReaperHandle` — RAII registration handle. Drop removes the PID from
  the registry; a panic in any caller path that owns one can't leak.
- `shutdown()` — Condvar-driven, completes in well under one tick so
  `RuntimeImpl::shutdown` (sync and async paths both call it) doesn't
  block the runtime exit on a stuck reaper.

Why scoped, not daemon-wide
---------------------------

Per Issue boxlite-ai#523, an unscoped reaper races every `Child::wait()` call site.
This reaper only touches PIDs that were explicitly registered. Today's
only registrar is `ShimHandler::from_spawned`. The `let _ = process.wait();`
sites in `shim.rs:112` / `shim.rs:121` discard their results, so
ECHILD-from-reaper-win is safe. Audit of every `Child::wait()` /
`Child::try_wait()` call site in `src/boxlite/`:

| Site                                | Process    | Race-safe?       |
|-------------------------------------|------------|------------------|
| shim.rs:102 (stop's try_wait loop)  | shim       | Yes — `Err(_)` arm falls through to kill+wait, both ECHILD-discarded |
| shim.rs:112, 121 (`let _ = wait()`) | shim       | Yes — result discarded |
| watchdog.rs:256                     | test only  | N/A              |
| net/libslirp.rs:135                 | gvproxy    | Not registered, reaper doesn't touch |
| box_impl.rs:1169 (ChildGuard)       | test only  | N/A              |
| ProcessMonitor::try_wait            | any        | Already maps ECHILD to `Unknown` |

Why polling, not SIGCHLD
------------------------

SIGCHLD plumbing in async Rust (signal-hook / tokio::signal::unix) is
process-global and adds a race surface against the runtime's other
signal handlers. A 250 ms HashSet scan is cheaper than that complexity
buys back. Worst-case zombie lifetime is 250 ms; tests verify drain in
< 2 s by polling, never sleeping.

Wiring
======

- `RuntimeImpl` gains `shim_reaper: Arc<ShimReaper>`, spawned in
  `RuntimeImpl::new`, shut down in both `shutdown()` (async path) and
  `shutdown_sync()` (atexit / Drop path).
- `ShimController::new` takes `Arc<ShimReaper>`; threaded through from
  `vmm_spawn::spawn_vm` via `ctx.runtime.shim_reaper`.
- `ShimHandler::from_spawned` registers the PID and stores the
  `ReaperHandle`. `ShimHandler::from_pid` (reattach mode) does NOT
  register — `waitpid` on a non-child returns ECHILD anyway; the
  reattached shim's eventual zombie belongs to init, not us.

Tests
=====

Acceptance criteria from Issue boxlite-ai#523, restated and pinned:

- [x] Scopes to known shim PIDs, never calls `waitpid(-1, ...)` — see
      `probe_pid` in reaper.rs.
- [x] Audit of `Child::wait()` call sites — recorded in the table above.
- [x] Shutdown path tied to `RuntimeImpl::shutdown` — both async and
      sync paths call `shim_reaper.shutdown()`. Drop on the reaper is
      idempotent.
- [x] Verification test runs in < 2 s (poll-based) — three unit tests
      in `reaper.rs::tests` and two integration tests in
      `src/boxlite/tests/zombie_reaper.rs::n_init_failures_leave_zero_zombies`,
      `registry_accepts_re_registration_after_drop`. Wallclock for each
      is sub-second on the dev host.
- [x] Integration test demonstrates "N init failures leave 0 zombies" —
      `n_init_failures_leave_zero_zombies` spawns N=8 (matching the
      `CL84LvGx7RBE` count), forgets each Child to mirror the abandoned-
      mid-init path, and asserts both registry drain and `/proc/<pid>/status`
      absence-or-not-Z.

Test counts:
- Unit (reaper.rs): 3 tests, all pass in ~0.3 s total
- Integration (tests/zombie_reaper.rs): 2 tests, all pass in ~0.5 s total
- Regression: 754/754 boxlite lib tests pass with --features rest
  (excluding the pre-existing `ws_watchdog` flake unrelated to this
  change); 118/118 boxlite-cli unit tests pass.

Out of scope
============

- Reattach orphans: a reattached `ShimHandler` doesn't register with
  the reaper. If the original spawning process is gone, init will reap
  the eventual zombie. Tracking via reconciler (separate epic) covers
  the DB/runtime drift case.
- PR boxlite-ai#573 (orphan shim port retention, Issue boxlite-ai#565): independent fix
  for a different leak mechanism (`is_same_process` recovery
  misidentification). Both are needed for the full dind test reliability
  story; landing one doesn't subsume the other.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 pushed a commit to G4614/boxlite that referenced this pull request Jun 2, 2026
…ives

Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper
reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered
must be left for its owner to wait(). Two-side verified — injecting a
global waitpid(-1) into the sweep makes the owner's wait() return ECHILD
("No child processes") and the test fail; the scoped sweep preserves
exit code 42.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 2, 2026
The reaper avoids accidentally consuming a kernel-recycled PID by relying
on one property: once `waitpid(pid, WNOHANG)` succeeds, `pid` is removed
from the registry **before the next sweep tick can fire `waitpid(pid)`
again**. If that eviction is atomic w.r.t. the sweep loop, no future PID
assignment can land back in the registry without an explicit `register()`
call.

Adds `reaped_pid_evicted_atomically_so_kernel_pid_reuse_is_safe`:

Part 1 — spawn + register + wait for `registered()` to evict the PID,
then sleep `3 × REAPER_TICK` and re-assert the registry is still empty
for that PID. Any sweep-vs-eviction race (e.g. retaining the sweep
snapshot across ticks, or re-adding from a stale Vec) would surface as
the PID re-appearing.

Part 2 — spawn an unregistered child with the distinctive exit code 73
and `child.wait()` it across many sweep ticks. The exit code must
survive. This is the production safety property: even on the off chance
the kernel recycles the just-reaped PID to this child, the reaper's
registry doesn't contain it, so the sweep skips it.

Two-side verified on this branch:

| step | code state                                          | result                                                                                          |
|------|------------------------------------------------------|-------------------------------------------------------------------------------------------------|
| A    | scoped `waitpid(pid, …)`                            | `ok` (1.87 s)                                                                                   |
| B    | unconditional `while waitpid(-1, …) > 0 {}` at sweep entry — the PR boxlite-ai#520 reversal | **FAIL** at part 2: `owner of unregistered PID 676048 could not wait(): No child processes (os error 10)` — ECHILD, exit code 73 lost |
| C    | restored                                            | `ok` (1.87 s)                                                                                   |

Why this complements `unregistered_child_exit_code_is_not_stolen`:
- That test injects `waitpid(-1)` directly into `probe_pid` to prove the
  reaper's per-PID `waitpid` is scoped. Sufficient for the "don't broaden
  the waitpid call" angle.
- This test pins the *adjacent* invariant — that registry eviction is
  atomic relative to subsequent sweep ticks — which is what makes
  kernel PID reuse safe even if the `waitpid` call site itself stays
  scoped. The two together cover both ways someone could re-introduce
  the PR boxlite-ai#520 regression: broaden the call, or fail to evict in time.

clippy `-D warnings` clean, fmt:check clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant