fix(runtime): preserve box record on init failure as Failed state#520
Merged
Conversation
When init-pipeline tasks fail (e.g. the 30s guest-connect timeout that hit CL84LvGx7RBE on dev.boxlite.ai), CleanupGuard::drop used to call remove_box() unconditionally -- orphaning the box's persistent disks on host while the DB row disappeared, leaving the user with an unrecoverable sandbox. Match the canonical pattern (Daytona ERROR, Kata startVM defer, containerd status.ExitCode, Docker SetError+CheckpointTo): preserve the record, transition to Failed with error_reason, let DESTROY_SANDBOX be the only deletion path. Bundled companion fixes for contributing bugs found in the same investigation: - install_zombie_reaper: daemon-wide SIGCHLD reaper so repeated shim failures don't accumulate <defunct> children (7+ observed in prod). - Go runner SIGTERM handler now calls boxliteClient.Shutdown() before apiServer.Stop(), so VMs get a graceful SIGTERM instead of being killed mid-write by the parent exit. TimeoutStopSec=60 on the systemd unit leaves headroom. - wait_for_guest_ready accepts an injectable Duration (production keeps the 30s constant) and the timeout error body now includes shim_alive, console_bytes, ready_socket_exists, likely_cause heuristic, and a console tail -- turning hours of forensics into a one-look diagnostic. Tests: - BoxStatus::Failed serde + transitions + can_remove/can_start matrix (state.rs, 9 tests). - mark_failed sets status/reason/pid and preserves health (state.rs). - CleanupGuard::drop persists Failed and keeps the row (init/types.rs). Reverting Drop to remove_box flips this test red. - install_zombie_reaper consumes an unwaited child within 8s (util/process.rs). Reverting the reaper flips this test red. - wait_for_guest_ready timeout branch returns the enriched error body via a 100ms test-side timeout (guest_connect.rs). CLAUDE.md gains the test-meaningfulness rule: assertions must be on data routed through production code, not on values the test body invented.
The daemon-wide SIGCHLD reaper from the previous commit needs more design work before it's safe to land. Concerns surfaced after merging: - Global side effect: waitpid(-1, WNOHANG) races with any code in the same process that owns a Child handle and expects to call .wait(). If the reaper consumes the child first, the owner gets ECHILD and loses the exit code. ProcessMonitor::try_wait already returns ProcessExit::Unknown for this case, but other callers of std::process::Child::wait() across the daemon do not. - 5s sleep cycle is coarse -- long enough that the test had to wait up to 8s for verification, slowing CI. - The reaper thread is install-once and runs for the lifetime of the process; there is no shutdown path or per-test isolation. The CleanupGuard preservation fix (the root cause of the CL84LvGx7RBE incident) is independent of the reaper and stays. Investigation tracked in a follow-up issue.
5 tasks
E0004 in CI on \`sdks/node/src/info.rs:72\` (and the same shape in sdks/python/src/info.rs and sdks/c/src/info.rs): the three SDK match arms on \`BoxStatus\` weren't updated when \`Failed\` was added to the enum in the previous commit, so the SDK lib targets failed to compile. Add \`Failed => "failed"\` to each, matching the canonical string from \`BoxStatus::as_str()\` so REST/CLI/SDK consumers see one consistent name.
G4614
pushed a commit
to G4614/boxlite
that referenced
this pull request
May 28, 2026
…ives Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered must be left for its owner to wait(). Two-side verified — injecting a global waitpid(-1) into the sweep makes the owner's wait() return ECHILD ("No child processes") and the test fail; the scoped sweep preserves exit code 42. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
pushed a commit
to G4614/boxlite
that referenced
this pull request
Jun 1, 2026
Adds `ShimReaper`, a per-runtime worker that periodically calls `waitpid(pid, WNOHANG)` on a small registered PID set. Replaces the reverted daemon-wide reaper attempt from PR boxlite-ai#520 with a scoped design that matches Issue boxlite-ai#523's acceptance criteria. Problem ======= `boxlite-shim` children that get abandoned mid-init (panic past `CleanupGuard`, retry loop spawning a fresh shim, reattach paths where the Rust-side `Child` was never wait()'d) accumulate as `<defunct>` PIDs under the daemon. The `CL84LvGx7RBE` incident on dev showed 7+ zombies piled up over time, and the contributing cause sat on top of the CleanupGuard fix (which itself is independent and stays). PR boxlite-ai#520 originally landed a `waitpid(-1, WNOHANG)` reaper to drain these. That call races every other `Child::wait()` site in the workspace: if the reaper wins, the owner's `wait()` returns ECHILD and the exit code is lost. The reaper was reverted on the same PR pending the scoped design this commit implements. Design ====== `src/boxlite/src/util/reaper.rs` introduces: - `ShimReaper` — owns a `std::thread` worker that scans an `Arc<Mutex<HashSet<u32>>>` registry every 250 ms and reaps any registered PID whose `waitpid` reports exit or ECHILD. Std thread (not tokio task) so it works regardless of whether `RuntimeImpl::new` was called inside a tokio runtime context. - `ReaperHandle` — RAII registration handle. Drop removes the PID from the registry; a panic in any caller path that owns one can't leak. - `shutdown()` — Condvar-driven, completes in well under one tick so `RuntimeImpl::shutdown` (sync and async paths both call it) doesn't block the runtime exit on a stuck reaper. Why scoped, not daemon-wide --------------------------- Per Issue boxlite-ai#523, an unscoped reaper races every `Child::wait()` call site. This reaper only touches PIDs that were explicitly registered. Today's only registrar is `ShimHandler::from_spawned`. The `let _ = process.wait();` sites in `shim.rs:112` / `shim.rs:121` discard their results, so ECHILD-from-reaper-win is safe. Audit of every `Child::wait()` / `Child::try_wait()` call site in `src/boxlite/`: | Site | Process | Race-safe? | |-------------------------------------|------------|------------------| | shim.rs:102 (stop's try_wait loop) | shim | Yes — `Err(_)` arm falls through to kill+wait, both ECHILD-discarded | | shim.rs:112, 121 (`let _ = wait()`) | shim | Yes — result discarded | | watchdog.rs:256 | test only | N/A | | net/libslirp.rs:135 | gvproxy | Not registered, reaper doesn't touch | | box_impl.rs:1169 (ChildGuard) | test only | N/A | | ProcessMonitor::try_wait | any | Already maps ECHILD to `Unknown` | Why polling, not SIGCHLD ------------------------ SIGCHLD plumbing in async Rust (signal-hook / tokio::signal::unix) is process-global and adds a race surface against the runtime's other signal handlers. A 250 ms HashSet scan is cheaper than that complexity buys back. Worst-case zombie lifetime is 250 ms; tests verify drain in < 2 s by polling, never sleeping. Wiring ====== - `RuntimeImpl` gains `shim_reaper: Arc<ShimReaper>`, spawned in `RuntimeImpl::new`, shut down in both `shutdown()` (async path) and `shutdown_sync()` (atexit / Drop path). - `ShimController::new` takes `Arc<ShimReaper>`; threaded through from `vmm_spawn::spawn_vm` via `ctx.runtime.shim_reaper`. - `ShimHandler::from_spawned` registers the PID and stores the `ReaperHandle`. `ShimHandler::from_pid` (reattach mode) does NOT register — `waitpid` on a non-child returns ECHILD anyway; the reattached shim's eventual zombie belongs to init, not us. Tests ===== Acceptance criteria from Issue boxlite-ai#523, restated and pinned: - [x] Scopes to known shim PIDs, never calls `waitpid(-1, ...)` — see `probe_pid` in reaper.rs. - [x] Audit of `Child::wait()` call sites — recorded in the table above. - [x] Shutdown path tied to `RuntimeImpl::shutdown` — both async and sync paths call `shim_reaper.shutdown()`. Drop on the reaper is idempotent. - [x] Verification test runs in < 2 s (poll-based) — three unit tests in `reaper.rs::tests` and two integration tests in `src/boxlite/tests/zombie_reaper.rs::n_init_failures_leave_zero_zombies`, `registry_accepts_re_registration_after_drop`. Wallclock for each is sub-second on the dev host. - [x] Integration test demonstrates "N init failures leave 0 zombies" — `n_init_failures_leave_zero_zombies` spawns N=8 (matching the `CL84LvGx7RBE` count), forgets each Child to mirror the abandoned- mid-init path, and asserts both registry drain and `/proc/<pid>/status` absence-or-not-Z. Test counts: - Unit (reaper.rs): 3 tests, all pass in ~0.3 s total - Integration (tests/zombie_reaper.rs): 2 tests, all pass in ~0.5 s total - Regression: 754/754 boxlite lib tests pass with --features rest (excluding the pre-existing `ws_watchdog` flake unrelated to this change); 118/118 boxlite-cli unit tests pass. Out of scope ============ - Reattach orphans: a reattached `ShimHandler` doesn't register with the reaper. If the original spawning process is gone, init will reap the eventual zombie. Tracking via reconciler (separate epic) covers the DB/runtime drift case. - PR boxlite-ai#573 (orphan shim port retention, Issue boxlite-ai#565): independent fix for a different leak mechanism (`is_same_process` recovery misidentification). Both are needed for the full dind test reliability story; landing one doesn't subsume the other. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
pushed a commit
to G4614/boxlite
that referenced
this pull request
Jun 1, 2026
…ives Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered must be left for its owner to wait(). Two-side verified — injecting a global waitpid(-1) into the sweep makes the owner's wait() return ECHILD ("No child processes") and the test fail; the scoped sweep preserves exit code 42. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
pushed a commit
to G4614/boxlite
that referenced
this pull request
Jun 2, 2026
Adds `ShimReaper`, a per-runtime worker that periodically calls `waitpid(pid, WNOHANG)` on a small registered PID set. Replaces the reverted daemon-wide reaper attempt from PR boxlite-ai#520 with a scoped design that matches Issue boxlite-ai#523's acceptance criteria. Problem ======= `boxlite-shim` children that get abandoned mid-init (panic past `CleanupGuard`, retry loop spawning a fresh shim, reattach paths where the Rust-side `Child` was never wait()'d) accumulate as `<defunct>` PIDs under the daemon. The `CL84LvGx7RBE` incident on dev showed 7+ zombies piled up over time, and the contributing cause sat on top of the CleanupGuard fix (which itself is independent and stays). PR boxlite-ai#520 originally landed a `waitpid(-1, WNOHANG)` reaper to drain these. That call races every other `Child::wait()` site in the workspace: if the reaper wins, the owner's `wait()` returns ECHILD and the exit code is lost. The reaper was reverted on the same PR pending the scoped design this commit implements. Design ====== `src/boxlite/src/util/reaper.rs` introduces: - `ShimReaper` — owns a `std::thread` worker that scans an `Arc<Mutex<HashSet<u32>>>` registry every 250 ms and reaps any registered PID whose `waitpid` reports exit or ECHILD. Std thread (not tokio task) so it works regardless of whether `RuntimeImpl::new` was called inside a tokio runtime context. - `ReaperHandle` — RAII registration handle. Drop removes the PID from the registry; a panic in any caller path that owns one can't leak. - `shutdown()` — Condvar-driven, completes in well under one tick so `RuntimeImpl::shutdown` (sync and async paths both call it) doesn't block the runtime exit on a stuck reaper. Why scoped, not daemon-wide --------------------------- Per Issue boxlite-ai#523, an unscoped reaper races every `Child::wait()` call site. This reaper only touches PIDs that were explicitly registered. Today's only registrar is `ShimHandler::from_spawned`. The `let _ = process.wait();` sites in `shim.rs:112` / `shim.rs:121` discard their results, so ECHILD-from-reaper-win is safe. Audit of every `Child::wait()` / `Child::try_wait()` call site in `src/boxlite/`: | Site | Process | Race-safe? | |-------------------------------------|------------|------------------| | shim.rs:102 (stop's try_wait loop) | shim | Yes — `Err(_)` arm falls through to kill+wait, both ECHILD-discarded | | shim.rs:112, 121 (`let _ = wait()`) | shim | Yes — result discarded | | watchdog.rs:256 | test only | N/A | | net/libslirp.rs:135 | gvproxy | Not registered, reaper doesn't touch | | box_impl.rs:1169 (ChildGuard) | test only | N/A | | ProcessMonitor::try_wait | any | Already maps ECHILD to `Unknown` | Why polling, not SIGCHLD ------------------------ SIGCHLD plumbing in async Rust (signal-hook / tokio::signal::unix) is process-global and adds a race surface against the runtime's other signal handlers. A 250 ms HashSet scan is cheaper than that complexity buys back. Worst-case zombie lifetime is 250 ms; tests verify drain in < 2 s by polling, never sleeping. Wiring ====== - `RuntimeImpl` gains `shim_reaper: Arc<ShimReaper>`, spawned in `RuntimeImpl::new`, shut down in both `shutdown()` (async path) and `shutdown_sync()` (atexit / Drop path). - `ShimController::new` takes `Arc<ShimReaper>`; threaded through from `vmm_spawn::spawn_vm` via `ctx.runtime.shim_reaper`. - `ShimHandler::from_spawned` registers the PID and stores the `ReaperHandle`. `ShimHandler::from_pid` (reattach mode) does NOT register — `waitpid` on a non-child returns ECHILD anyway; the reattached shim's eventual zombie belongs to init, not us. Tests ===== Acceptance criteria from Issue boxlite-ai#523, restated and pinned: - [x] Scopes to known shim PIDs, never calls `waitpid(-1, ...)` — see `probe_pid` in reaper.rs. - [x] Audit of `Child::wait()` call sites — recorded in the table above. - [x] Shutdown path tied to `RuntimeImpl::shutdown` — both async and sync paths call `shim_reaper.shutdown()`. Drop on the reaper is idempotent. - [x] Verification test runs in < 2 s (poll-based) — three unit tests in `reaper.rs::tests` and two integration tests in `src/boxlite/tests/zombie_reaper.rs::n_init_failures_leave_zero_zombies`, `registry_accepts_re_registration_after_drop`. Wallclock for each is sub-second on the dev host. - [x] Integration test demonstrates "N init failures leave 0 zombies" — `n_init_failures_leave_zero_zombies` spawns N=8 (matching the `CL84LvGx7RBE` count), forgets each Child to mirror the abandoned- mid-init path, and asserts both registry drain and `/proc/<pid>/status` absence-or-not-Z. Test counts: - Unit (reaper.rs): 3 tests, all pass in ~0.3 s total - Integration (tests/zombie_reaper.rs): 2 tests, all pass in ~0.5 s total - Regression: 754/754 boxlite lib tests pass with --features rest (excluding the pre-existing `ws_watchdog` flake unrelated to this change); 118/118 boxlite-cli unit tests pass. Out of scope ============ - Reattach orphans: a reattached `ShimHandler` doesn't register with the reaper. If the original spawning process is gone, init will reap the eventual zombie. Tracking via reconciler (separate epic) covers the DB/runtime drift case. - PR boxlite-ai#573 (orphan shim port retention, Issue boxlite-ai#565): independent fix for a different leak mechanism (`is_same_process` recovery misidentification). Both are needed for the full dind test reliability story; landing one doesn't subsume the other. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
pushed a commit
to G4614/boxlite
that referenced
this pull request
Jun 2, 2026
…ives Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered must be left for its owner to wait(). Two-side verified — injecting a global waitpid(-1) into the sweep makes the owner's wait() return ECHILD ("No child processes") and the test fail; the scoped sweep preserves exit code 42. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
added a commit
to G4614/boxlite
that referenced
this pull request
Jun 2, 2026
The reaper avoids accidentally consuming a kernel-recycled PID by relying
on one property: once `waitpid(pid, WNOHANG)` succeeds, `pid` is removed
from the registry **before the next sweep tick can fire `waitpid(pid)`
again**. If that eviction is atomic w.r.t. the sweep loop, no future PID
assignment can land back in the registry without an explicit `register()`
call.
Adds `reaped_pid_evicted_atomically_so_kernel_pid_reuse_is_safe`:
Part 1 — spawn + register + wait for `registered()` to evict the PID,
then sleep `3 × REAPER_TICK` and re-assert the registry is still empty
for that PID. Any sweep-vs-eviction race (e.g. retaining the sweep
snapshot across ticks, or re-adding from a stale Vec) would surface as
the PID re-appearing.
Part 2 — spawn an unregistered child with the distinctive exit code 73
and `child.wait()` it across many sweep ticks. The exit code must
survive. This is the production safety property: even on the off chance
the kernel recycles the just-reaped PID to this child, the reaper's
registry doesn't contain it, so the sweep skips it.
Two-side verified on this branch:
| step | code state | result |
|------|------------------------------------------------------|-------------------------------------------------------------------------------------------------|
| A | scoped `waitpid(pid, …)` | `ok` (1.87 s) |
| B | unconditional `while waitpid(-1, …) > 0 {}` at sweep entry — the PR boxlite-ai#520 reversal | **FAIL** at part 2: `owner of unregistered PID 676048 could not wait(): No child processes (os error 10)` — ECHILD, exit code 73 lost |
| C | restored | `ok` (1.87 s) |
Why this complements `unregistered_child_exit_code_is_not_stolen`:
- That test injects `waitpid(-1)` directly into `probe_pid` to prove the
reaper's per-PID `waitpid` is scoped. Sufficient for the "don't broaden
the waitpid call" angle.
- This test pins the *adjacent* invariant — that registry eviction is
atomic relative to subsequent sweep ticks — which is what makes
kernel PID reuse safe even if the `waitpid` call site itself stays
scoped. The two together cover both ways someone could re-introduce
the PR boxlite-ai#520 regression: broaden the call, or fail to evict in time.
clippy `-D warnings` clean, fmt:check clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CleanupGuard::drop'sremove_box()with aFailed-state transition (preserves the DB row +error_reason). Matches the Daytona/Kata/containerd/Docker canonical pattern. Fixes the root cause of theCL84LvGx7RBEorphan ondev.boxlite.ai.install_zombie_reaper()so failed shim children don't accumulate as<defunct>PIDs.boxliteClient.Shutdown(25s)->apiServer.Stop()so running VMs get graceful SIGTERM; bump systemdTimeoutStopSec=60.wait_for_guest_readytimeout error (shim_alive, console_bytes, ready_socket_exists, likely_cause, console tail) + injectable timeout for testability.Test plan
cargo test -p boxlite --lib litebox::state::tests(43 pass; +9 new for the Failed matrix)cargo test -p boxlite --lib litebox::init::types::tests(4 pass; +1 new CleanupGuard drop preservation test)cargo test -p boxlite --lib util::process::tests(21 pass; +1 new zombie-reaper test)cargo test -p boxlite --lib litebox::init::tasks::guest_connect::tests(12 pass; includes the real enriched-timeout test from a prior turn)CleanupGuard::dropbody andinstall_zombie_reaperbody individually flips the new tests red; restoring brings them back to green.CL84LvGx7RBEdirectory on the dev runner (one-time operational step; runbook lives in the plan file).