[high] fix(vmm): reap shim after kill in rm-force + async-death paths (Issue #523)#613
[high] fix(vmm): reap shim after kill in rm-force + async-death paths (Issue #523)#613G4614 wants to merge 12 commits into
Conversation
…eak a running shim (#615) health_check_transitions_to_healthy_after_startup ends with the box still Running. RuntimeImpl::Drop's safety-net shutdown_sync only sends SIGTERM and returns without waiting, so the shim is still mid-shutdown when PerTestBoxHome::Drop (#604) runs its leak assertion — `1 live shim(s)`. Adding t.bx.stop() (which stops and waits) clears it. The reaper (#613) does not help here: it reaps dead shims, not a still-running one. Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-149.ap-southeast-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
90ce4eb to
d7c509d
Compare
d7c509d to
560841c
Compare
`recovery_with_dead_process` sleeps 100 ms after SIGKILL — inside
`REAPER_TICK` (250 ms), so the killed shim may still be a zombie
when the new runtime's recovery scan reads `/proc/<pid>`. That test
proves recovery handles the zombie state, but says nothing about
what happens *after* the scoped `ShimReaper` consumes the zombie
and `/proc/<pid>` is fully gone.
Add `recovery_after_reaper_consumes_killed_shim` that pins the
post-reap state explicitly:
1. create box, capture shim PID, SIGKILL it
2. sleep 800 ms (>= 3 * REAPER_TICK) so the reaper definitely
sweeps and consumes the zombie; the first runtime is also
dropped at the end of this block, so init becomes the
belt-and-braces reap source
3. precondition assert: `/proc/<killed_pid>` must not exist
(otherwise the test isn't exercising the post-reap state)
4. spin up a fresh runtime; recovery scan must still transition
the box to `Stopped`, drop the stale PID file, and report
`info.pid == None`
This complements rather than replaces `recovery_with_dead_process`:
together they cover both sides of the zombie/vanished split that
boxlite-ai#613 introduces by accelerating reap from "indefinite zombie until
init" to "<= 250 ms via scoped reaper".
Two-side verified: replace the body of `PidFileReader::process_identity`
with `ProcessIdentity::Verified(record.pid)` (i.e. drop both the
`is_process_alive` and start-time-fingerprint checks). Test fails on
`assertion left == right failed: reaped-shim box must be marked
Stopped, got Running, left: Running, right: Stopped` — proving the
test catches a regression where recovery would silently believe a
reaped shim is still live. Restored, test passes again in 33.65 s.
The test passes both with and without the scoped `ShimReaper`
itself (init reaps the orphaned shim once the first runtime drops),
which is the property being asserted: boxlite-ai#613 is recovery-neutral.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ertion The `kill -9` and `kill -TERM` variants of `health_check_becomes_unhealthy_when_shim_killed` *simulate* the signal an OOM-killer sends. This test drives the actual `mm_oom_kill_process` codepath: create a cgroup v2 subtree with `memory.max = 20M`, move a `python3 bytearray(100 MB)` memory hog into it, let the kernel pick the victim. No userspace `kill` involved. Phase 1 — *kernel-level production fact*: the OOM-killed process IS a zombie. `std::mem::forget`s the `Child` so the Rust side never `wait()`s on it (the production analogue of boxlite's daemon not reaching a waitpid call site on the OOM path). Asserts `/proc/<pid>/status` contains `State:\tZ`. No reaper involvement — this is the kernel behavior that has to hold for PR boxlite-ai#613 to even be necessary. Phase 2 — *PR boxlite-ai#613 invariant*: spawn a real `ShimReaper`, register the PID, expect both `reaper.registered()` to evict it AND `/proc/<pid>/status` to disappear within 5 s (the PCB getting collected by the reaper's `waitpid(WNOHANG)` sweep). The second of those two conditions is what makes the test load-bearing under a `register()`-as-no-op regression: the registry stays trivially empty, but the zombie still lurks in `/proc`. Two-side verified on this branch: | step | code state | result | |------|----------------------------------------------|-------------------------------------------------------------------------------| | A | reaper enabled | `ok` (2.06 s) — phase 1 sees `State: Z`, phase 2 sees the reaper drain | | B | `register()` patched to early-return | **FAIL** at phase 2: `still_registered=false, proc_gone=false` after 5 s | | C | restored | `ok` (2.09 s) | Skips cleanly when cgroup v2 user delegation isn't available (CI sandbox, older systemd, etc.): `writable_cgroup_v2_root` walks up from `/proc/self/cgroup` until it finds an ancestor that lists `memory` in `cgroup.subtree_control` and accepts a `mkdir` probe, returning `None` otherwise. The bottom-most systemd `*.scope` typically drops the memory controller for *its* children, which is why the walk-up to `user@<uid>.service` is needed on the dev host. Closes the empirical gap the previous test plan called out where OOM killer / cgroup OOM were claimed by signal-mechanism analogy but not directly reproduced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… zombie Closes the empirical gap the rest of PR boxlite-ai#613's test plan left open: every prior test exercised `BoxliteRuntime` from inside a long-lived test process (which mirrors daemon behavior but isn't literally `boxlite serve`). This test spawns the actual `boxlite serve` binary on an ephemeral port and drives it through `boxlite --url … rm --force` so the SIGKILL path runs inside the production daemon, exactly as it does in prod. Captures both the bug *and* the production-distinguishing detail: who is the zombie's `PPid`. The failure-path panic includes both `State:` and `PPid:` lines — the PPid check is what proves *control was not handed to init*. When the test fails, PPid equals the daemon's PID (not 1), so the kernel still considers the daemon the shim's parent and only the daemon can `waitpid()` it. Flow: 1. Spawn `boxlite serve --port <ephemeral>` and record its PID. 2. CLI in `--url` mode: `create` + `start` + `inspect --format '{{.State.Pid}}'` to grab the shim PID assigned by the daemon's spawn path. 3. `boxlite --url … rm --force <id>` — the daemon's REST handler runs `runtime.remove(box_id, true)` which calls `kill_process(shim_pid)` (SIGKILL) and never `waitpid()`s. 4. From outside the daemon, poll `/proc/<shim_pid>/status` for 5 s. The reaper-enabled path observes the entry disappear within ~500 ms (REAPER_TICK × 2). Two-side verified locally: | step | code state | result | |------|----------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | A | reaper enabled | `ok` (8.46 s, /proc disappeared within ~500 ms) | | B | `register()` patched to early-return | **FAIL** at 5 s timeout: `boxlite serve daemon (PID 666169) left zombie shim PID 666209 … reports "State: Z" "PPid: 666169" past 5 s` — PPid==daemon PID confirms no reparenting to init | | C | restored | `ok` (4.02 s) | `ServeGuard::Drop` sends SIGINT (not SIGTERM) so the daemon runs `runtime.shutdown(timeout)`. Without that the box's shim would leak through the daemon-kill path itself, double-counting against the PerTestBoxHome leak check. This is the most expensive of the zombie tests in PR boxlite-ai#613 (~10 s wall-clock) and the most expensive to rebuild (touches the CLI binary + boxlite-shim embedding), so the cheaper library-level tests still carry the bulk of the regression surface; this one closes the literal "is it actually broken via REST too" question that the others only answered by analogy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oxlite-ai#523) Two production code paths spawn shim subprocesses, send them SIGKILL, and never `waitpid` — so a dead shim sits in `/proc` as `State: Z` indefinitely, holding the PID slot and (in `boxlite serve` mode) blocking init from reparenting because the daemon is still its parent. **#1: async death (OOM / external SIGKILL/SIGTERM / internal panic).** The shim dies without anyone in the runtime asking it to. The only loop already polling shim liveness is health check (`box_impl.rs:: spawn_health_check`), but its `shim_died` branch did just one thing: mark `Stopped` / `Unhealthy` in the DB and break. No `waitpid`. The `is_process_alive` detector correctly treats `State: Z` as dead, so the branch *fires* on a zombie — it just doesn't reap. Fix: in the `shim_died` arm, `libc::waitpid(pid, _, WNOHANG)` in a short bounded retry loop. SIGKILL → zombie transition is not always synchronous with the wait queue: WNOHANG can briefly return 0 before the kernel posts SIGCHLD/wait status, even when `/proc` already shows `State: Z`. 500 ms ceiling with 10 ms backoff covers the race without ever stalling the health-check loop. The fix only takes effect if a health check task is *running*, so: - `AdvancedBoxOptions::default()` now spawns a health check with `HealthCheckOptions::default()` instead of `None`. Default interval 30 s, 60 s start period, 10 s ping timeout — the pre-existing defaults; no operator-visible cost beyond one ping per box per 30 s. - The decision is documented inline on the `health_check` field: health check is the *only* always-on shim-state watcher; explicit `health_check: None` opts out of both pings and zombie reaping (operator takes responsibility). **#2: `boxlite serve` + REST `rm --force`.** Daemon's `rt_impl:: remove_box(force=true)` calls `kill_process(pid)` (SIGKILL by PID, bypassing the canonical `stop()` path that *does* reap via `Child::wait()`). The handler returns immediately and the box is removed from the DB. The shim's `PPid` stays pointed at the daemon (still alive), so init can't help. Fix: after `kill_process`, run the same `WNOHANG`-with-deadline polling loop already used in `vmm/controller/shim.rs::ShimHandler:: stop`'s SIGTERM path. 2 s deadline matches the existing `GRACEFUL_SHUTDOWN_TIMEOUT_MS`; if SIGKILL doesn't yield a reapable state in 2 s, the kernel is wedged and a longer wait won't help. Three reproducer tests, each two-side verified (revert the relevant production change → test goes red with the exact zombie signal; restore → test goes green): - `boxlite::tests::health_check:: health_check_becomes_unhealthy_when_shim_killed` (existing test on main, now actually passes — the `PerTestBoxHome::drop` zombie-leak panic is the load-bearing failure; this PR adds a pin on `/proc/<pid>/status` not being `State: Z` so the regression signal is inline) - `boxlite::tests::health_check:: health_check_becomes_unhealthy_when_shim_sigtermed` (new — same shape, SIGTERM, covers k8s liveness initial signal / cgroup OOM initial / systemctl stop / plain kill) - `boxlite_cli::tests::serve_rm_force_zombie:: boxlite_serve_rm_force_active_box_no_zombie_left_in_proc` (new — spawns real `boxlite serve` subprocess, drives REST `DELETE /v1/boxes/<id>?force=true` via CLI `--url`, watches `/proc/<shim_pid>/status` from *outside* the daemon. The failing-path panic captures `PPid:` against the daemon's PID — the load-bearing diagnostic that proves init isn't going to clean this up because the daemon is still alive and still the parent.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1d9da69 to
3fbb87b
Compare
The `shim_died` arm sits inside the health check `tokio::spawn` task, so `std::thread::sleep` between WNOHANG retries was blocking a tokio worker thread for the duration of the 10 ms tick. Single-box workloads never notice — the worst case is a 500 ms reap ceiling on one task. But during a host-wide shim death event (host reboot, batch SIGKILL, runtime tear-down), every box's `shim_died` arm fires at roughly the same time. With tokio's default worker count = host CPU count, even 8–16 concurrent `shim_died` arms can saturate the runtime workers and stall unrelated tasks (REST handlers, exec waits, snapshot work) for up to the full 500 ms reap budget. Swap to `tokio::time::sleep(...).await`. The retry now yields between WNOHANG probes, the worker thread is free to drive other tasks, and the 500 ms ceiling stays in place. No semantic change for the single-event case the 3 reproducer tests exercise — all three still green: - `health_check_becomes_unhealthy_when_shim_killed` (21.8 s) - `health_check_becomes_unhealthy_when_shim_sigtermed` (21.8 s) - `boxlite_serve_rm_force_active_box_no_zombie_left_in_proc` (12.3 s) Also: a small perf-measurement test alongside in `src/boxlite/tests/health_check_perf.rs`. Spawns one box with `health_check = None` and one with `health_check = Some(100ms tick)`, measures `/proc/self/stat` CPU delta of the boxlite test process over a 10 s window, and prints the per-tick cost so a reviewer can verify the default-on decision empirically. On this host: ~900 us per tick in the on-box minus the off-box baseline. Projected to the default 30 s interval: ~30 us/sec per box — 0.3 % of one core at 100 boxes, 3 % at 1000. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… hatch boxlite-ai#613 makes the per-box health check task default-on so the new zombie-shim reaper (Issue boxlite-ai#523) has a watcher to piggy-back on. The schema field `BoxOptions.advanced.health_check: Option<HealthCheckOptions>` already lets Rust library users opt out via `None`, but CLI users (`boxlite run` / `create`) and REST/SDK clients (Python, Node, Go all route through `POST /v1/boxes`) had no way to express that choice — exactly the gap boxlite-ai#597 hit for `cap_overrides`. Surface plumbed end-to-end: - `ManagementFlags { no_health_check: bool }` — `--no-health-check` on `boxlite run` / `create`. When set, `apply_to` clears `opts.advanced.health_check = None`. - REST client-side `CreateBoxRequest` gets `health_check_disabled: Option<bool>`; `from_options` flips it to `Some(true)` only when the operator already disabled health check in the input options, so the wire form stays byte-identical for every other call. - Server-side `CreateBoxRequest` (in `cli/src/commands/serve/types.rs`, `#[serde(deny_unknown_fields)]`) gains the same field. The `build_box_options` mapper builds a manual `AdvancedBoxOptions` instead of falling through `..Default::default()`, then forces `health_check = None` iff the wire says so. Three-state semantics for `health_check_disabled`: - `Some(true)` → explicit disable; box runs with no watcher, no zombie reaping (operator takes responsibility — documented inline). - `Some(false)` → "use the server default" (= currently `Some(...)`). Same effect as omitting the field, but documents the intent on the wire. - absent → server default. Pre-boxlite-ai#613 clients keep the legacy POST body shape and get the new default-on behaviour without code changes. Seven unit tests cover the surface symmetrically: - CLI: `management_flags_no_health_check_clears_advanced_field` / `management_flags_no_health_check_unset_keeps_default_on` — flag-on clears, flag-off preserves; pre-asserts the baseline so a regression that broke the default-on schema would also fail. - REST client: `test_create_box_request_carries_health_check_disabled_on_the_wire` / `test_create_box_request_omits_health_check_disabled_when_default` — `None` in BoxOptions → `Some(true)` on the wire; `Some(default)` → field absent (backward-compat with pre-boxlite-ai#613 servers). - REST server: `build_box_options_health_check_disabled_true_clears_health_check` / `build_box_options_no_health_check_field_keeps_default_on` / `build_box_options_health_check_disabled_false_keeps_default_on` — three-state JSON → BoxOptions mapping, including the explicit "Some(false) means default" sentinel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p_force`
`rt_impl::remove_box(force=true)` historically reached for `kill_process(pid)`
+ inline `libc::waitpid` to clean up the shim — a PID-only shortcut that
worked, but ran a different kill path from `stop` / non-force `rm` /
`restart`, all of which already went through `ShimHandler` and held the
`Child` handle. Two patterns in the codebase for the same conceptual
operation; this commit collapses them into one.
Wiring, top-down:
- `VmmHandler::stop_force` (trait): new sibling of `stop()`. Same
return type, same `&mut self`, same sync inside-of-async pattern
the existing `stop()` already uses.
- `ShimHandler::stop_force` (impl on the trait): SIGKILL via
`Child::kill()` followed by blocking `Child::wait()` when the
handler owns a `Child` (the common case for boxes spawned by this
runtime). Attached-mode fallback uses raw `libc::kill + waitpid`
with the same 2 s deadline shape as the inline block previously
held in `rt_impl::remove_box`.
- `BoxBackend::stop_force` (trait): async counterpart on the higher-
level box trait. Mirrors `stop()`.
- `BoxImpl::stop_force` (impl on the trait): shares the entire
teardown body with `BoxImpl::stop()` via a new `stop_impl(force:
bool)` helper — health-check cancel, shutdown-token cancel, PID-
file cleanup, state transition, DB sync, listener notification,
metrics counter increment all stay identical between the two
public methods. The only branches on `force`:
* Skip the 10 s `guest_session.shutdown()` await (operator asked
for `rm --force`, not a polite request).
* Pick `handler.stop_force()` over `handler.stop()`.
* Skip the trailing `auto_remove` self-call — the caller
(`LocalRuntime::remove(force=true)`) is the remover; firing
it again from inside `stop_force` would race.
- `RestBox::stop_force` (impl on the trait): forwards to the existing
graceful `stop` for now. The REST wire surface doesn't expose a
force-stop verb today; clients reach the same outcome through
`DELETE /v1/boxes/<id>?force=true`, which lands on
`RuntimeBackend::remove(force=true)` server-side and follows the
new 持-Child path there. Documented inline.
- `LiteBox::stop_force` (public, in `litebox/mod.rs`): proxy over
`box_backend.stop_force().await`. Public-facing API; what
operators / SDK users would call directly.
- `LocalRuntime::remove(id_or_name, force=true)` (the async backend
impl in `rt_impl.rs`): looks up the LiteBox by id_or_name, calls
`litebox.stop_force().await`, then delegates to the sync
`RuntimeInnerImpl::remove` for DB / disk cleanup. After
`stop_force()` returns, `mark_stop()` has already cleared
`state.pid` to `None` AND `state.status` to `Stopped`, so the
sync `remove_box`'s `if state.status.is_active() || state.pid
.is_some()` guard short-circuits and the inline-waitpid block is
not exercised on the async path. Production rm-force is now
100 % 持-Child.
The inline `kill_process + libc::waitpid` block in `rt_impl::remove_box`
stays in place as a fallback for direct sync callers (Drop paths,
unit tests that exercise `RuntimeImpl::remove_box` without the async
surface). It's no longer the production code path; documenting it
inline as such would obscure the test-runner-only nature, so left
unannotated for the follow-up that strips it (if/when the 5 sync
tests are migrated to drive the async API). Without that migration,
killing the fallback would break the test suite for cosmetic gain.
Verification:
- 3 reproducer tests (SIGKILL via health check + SIGTERM via health
check + `boxlite serve` REST `rm --force`): all green, no
regression on the zombie-reaping contract these tests pin.
- 7 sync unit tests that exercise `RuntimeImpl::remove_box(force=
true)` directly (`test_remove_box_*`, force_remove_*): all green
— they hit the sync fallback path with state.pid still set.
- 7 escape-hatch unit tests for `--no-health-check` / `health_check
_disabled` plumbing: unchanged, all green.
- `make fmt:check:rust` + workspace `cargo clippy` clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior boxlite-ai#613 commits left three near-identical `kill + WNOHANG + sleep` loops scattered across `box_impl.rs::spawn_health_check`, `rt_impl.rs::remove_box`, and `shim.rs::ShimHandler::stop_force`'s attached fallback. Each was hand-coded with a 10–50 ms sleep inside its retry, a custom deadline check, and a slightly different error- branch shape. Reviewers (rightly) called the duplication out — and a sleep-poll loop is the wrong primitive for "wake when this pid exits" anyway. This commit replaces all three with two helpers in `util::process`: - `reap_pid_blocking(pid, deadline_ms) -> ReapOutcome` - `reap_pid_async(pid, deadline_ms) -> ReapOutcome` Both open a pidfd (`pidfd_open` via `SYS_pidfd_open`, Linux 5.3+), wait for it to become readable — sync via `poll(2)`, async via `tokio::io::unix::AsyncFd` — then `waitpid(pid, _, 0)` (blocking but immediate, since the fd already signalled exit). The wait is *bounded* by the deadline (`poll`'s timeout / `tokio::time::timeout`) without a polling sleep loop, and *interruptible* in both contexts without a `spawn_blocking` thread leak on a wedged shim. Why pidfd vs blocking `waitpid(_, _, 0)` in `spawn_blocking`: the latter has no way to cancel — a wedged shim (D state) holds the blocking-pool slot until the daemon exits. With pidfd, drop of `AsyncFd` / `close(pidfd)` releases the kernel-side wait the moment the timeout fires. Fallback path: on `pidfd_open` failure (PID not our child, kernel < 5.3, etc.), the helpers fall back to a single `waitpid(WNOHANG)` attempt — never to a sleep loop, so the worst case is "reaper missed once." All three call sites contract to a one-liner around the helper. Net diff: -90 LOC duplicate retry loops, +110 LOC helpers with safety comments + ReapOutcome enum. Behaviour unchanged for the three reproducer tests (kernel-level zombie state is what they assert on, and pidfd reaches the same waitpid syscall): - `health_check_becomes_unhealthy_when_shim_killed` - `health_check_becomes_unhealthy_when_shim_sigtermed` - `boxlite_serve_rm_force_active_box_no_zombie_left_in_proc` 7 sync `RuntimeImpl::remove_box(force=true)` unit tests still green — they hit the sync fallback path, which now uses `reap_pid_blocking`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`AdvancedBoxOptions::default()` flipping `health_check` to `Some(...)` in boxlite-ai#613 surfaced two regressions in the spawn-health-check path: 1. `init_live_state` propagated `live_state.guest_session.guest().await?` as a hard error, so any recovered/attached box whose guest session can't be re-established (e.g. unit tests using a dummy `sleep` PID) failed `live_state()` and `stop_impl`'s kill branch was skipped — leaking the process. Switch the `?` to a match: log a warning and skip spawning the health task; box init still completes. 2. The health task captured `Arc<RuntimeImpl>`, pinning the runtime alive forever for detached boxes whose box-scoped `shutdown_token` is never cancelled (user drops `BoxliteRuntime` without an explicit stop). Recovery / detach integration tests then failed at the second runtime startup with "Another BoxliteRuntime is already using directory" because the first runtime's home lock was never released. Switch to `Weak<RuntimeImpl>`; the task upgrade()s once per tick and exits if the runtime is gone. Together restores 9 of the 10 previously-failing tests in test:changed: 2 unit + 4 recovery + 3 detach integration. Remaining 2 jailer + 6 CLI run failures are pre-existing on main: CLI is addressed by open PR boxlite-ai#622 (RAII cleanup before hard exit); jailer shim-SIGABRT verified independent by toggling default_health_check to None — still fails. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`libc::SYS_pidfd_open` only exists in the Linux `libc` bindings, so the macOS clippy job broke after the pidfd reap helpers landed. The cross-platform contract is already "try_pidfd_open returns -1 → caller falls back to reap_once (WNOHANG-only)", so non-Linux just needs a stub that returns -1 unconditionally. No behaviour change on Linux; macOS now compiles + uses the existing fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds a force-stop lifecycle that SIGKILLs shims with bounded PID reaping, hardens the health-check task via Weak runtime handles and non-fatal guest-session fallback, exposes CLI/REST opt-outs for health checks, and expands tests (integration + perf) to validate zombie-free removal. ChangesForce-Stop Implementation with Bounded Process Reaping
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/boxlite/tests/health_check.rs (1)
122-126:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAssert that the
killcommand actually succeeded.
output()only proves the helper process launched. Ifkillreturns non-zero, the test keeps running and later fails with a much less obvious assertion. Checkstatus.success()and include stderr in the failure message.Suggested fix
- Command::new("kill") + let out = Command::new("kill") .arg("-9") .arg(shim_pid.to_string()) .output() .expect("Failed to kill shim process"); + assert!( + out.status.success(), + "kill -9 failed for shim PID {shim_pid}: {}", + String::from_utf8_lossy(&out.stderr) + );- Command::new("kill") + let out = Command::new("kill") .arg("-TERM") .arg(shim_pid.to_string()) .output() .expect("Failed to SIGTERM shim process"); + assert!( + out.status.success(), + "kill -TERM failed for shim PID {shim_pid}: {}", + String::from_utf8_lossy(&out.stderr) + );Also applies to: 207-211
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/boxlite/tests/health_check.rs` around lines 122 - 126, The test currently spawns the external kill helper with Command::new("kill")...output() but doesn’t assert the command succeeded; change both occurrences (the blocks that construct Command::new("kill") and pass shim_pid) to capture the Output, check output.status.success(), and on failure fail the test with a message that includes stderr (use String::from_utf8_lossy on output.stderr) so the test reports the kill error immediately and clearly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/boxlite/src/rest/litebox.rs`:
- Around line 208-216: stop_force currently calls self.stop() which silently
downgrades semantics; change stop_force() to immediately return an Unsupported
error instead of delegating to stop(). Update the async fn stop_force(&self) ->
BoxliteResult<()> implementation to return Err(BoxliteError::Unsupported("force
stop not supported over REST".into())) (or the project's equivalent
Unsupported/UnsupportedOperation error constructor) so callers receive a clear
Unsupported response; reference the stop_force() method and
BoxliteResult/BoxliteError types when making the change.
In `@src/boxlite/src/rest/types.rs`:
- Around line 137-144: The code currently maps options.advanced.health_check to
a boolean health_check_disabled which drops any non-none HealthCheckOptions and
causes defaults to be rebuilt on the server; instead send the full health-check
config on the wire: replace the health_check_disabled logic with a direct
mapping that sets the request's health_check field to
options.advanced.health_check.clone() (or Some(cloned_value) / None as
appropriate) so the actual HealthCheckOptions is transmitted; if you prefer the
alternative, add client-side validation in the REST client to reject non-default
options. Ensure you reference and update the use of health_check_disabled where
the request/AdvancedBoxOptions is built so the wire payload carries the
Option<HealthCheckOptions> rather than a boolean.
In `@src/boxlite/src/runtime/rt_impl.rs`:
- Around line 845-850: The current fallback uses crate::util::reap_pid_blocking
with FORCE_REAP_DEADLINE_MS and logs outcome but then unconditionally proceeds
to persist Stopped state and perform full box deletion; change this to only
continue destructive cleanup when the reap outcome is terminal (e.g., process
reaped / exit confirmed / no such pid). After calling reap_pid_blocking and
capturing outcome, gate the subsequent persistence-to-Stopped and the full
disk/home removal on outcome indicating a terminal state; if outcome is a
timeout or other non-terminal result, abort the removal flow (return an error or
retry/rollback) and log a clear warning so the shim is not deleted while still
running.
In `@src/boxlite/src/util/process.rs`:
- Around line 363-368: The current branch unconditionally returns
ReapOutcome::Reaped after calling unsafe { libc::waitpid(...) } even though
waitpid can fail with -1/ECHILD when the PID is not our child; update the logic
in the block around the waitpid call to check the return value and errno: if
waitpid returns -1 and errno == libc::ECHILD return ReapOutcome::NotOurChild, if
waitpid returns > 0 return ReapOutcome::Reaped, and handle other errors
appropriately (e.g., propagate or log). Apply the same change to the other reap
helper (the similar waitpid usage near the later block) so both functions use
the waitpid return/errno checks instead of assuming success.
In `@src/boxlite/src/vmm/controller/shim.rs`:
- Around line 189-194: The current shutdown block in shim.rs unconditionally
returns Ok(()) after calling libc::kill and
crate::util::reap_pid_blocking(self.pid, REAP_DEADLINE_MS); change this to check
the result of reap_pid_blocking and propagate an Err when the reap timed out or
indicates the child is still running so BoxImpl::stop_impl() does not clear the
PID or mark the shim Stopped. Specifically, keep the kill call and
REAP_DEADLINE_MS, capture the value returned by reap_pid_blocking(self.pid,
REAP_DEADLINE_MS), and if it signals timeout/still-running return a suitable
error (instead of Ok(())); only return Ok(()) when the reap indicates success.
In `@src/boxlite/tests/health_check_perf.rs`:
- Around line 107-119: The loop waiting for health uses healthy_deadline and
polls t_on.runtime.get_info but doesn’t assert that the box reached
HealthState::Healthy; add a post-loop check that re-fetches the box info (via
t_on.runtime.get_info(t_on.bx.id().as_str()).await.unwrap().unwrap()) and assert
that info.health_status.state == HealthState::Healthy (or panic with a clear
message) so the test fails fast if the box never became healthy before
measuring.
- Around line 44-50: The cpu_ms_from_ticks function incorrectly assumes a fixed
clock tick rate of 100 Hz by multiplying ticks by 10. Update this function to
dynamically retrieve the actual clock tick rate at runtime using
sysconf(_SC_CLK_TCK) and convert ticks to milliseconds based on this value.
Additionally, in the perf_health_check_default_on_vs_off_cpu_delta function, add
a check to assert or early-return if the health check does not reach
HealthState::Healthy before the 15-second polling deadline to avoid measuring
CPU usage prematurely.
---
Outside diff comments:
In `@src/boxlite/tests/health_check.rs`:
- Around line 122-126: The test currently spawns the external kill helper with
Command::new("kill")...output() but doesn’t assert the command succeeded; change
both occurrences (the blocks that construct Command::new("kill") and pass
shim_pid) to capture the Output, check output.status.success(), and on failure
fail the test with a message that includes stderr (use String::from_utf8_lossy
on output.stderr) so the test reports the kill error immediately and clearly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 06d9196b-2bb0-45c7-a45c-13bcf976e894
📒 Files selected for processing (17)
src/boxlite/src/litebox/box_impl.rssrc/boxlite/src/litebox/mod.rssrc/boxlite/src/rest/litebox.rssrc/boxlite/src/rest/types.rssrc/boxlite/src/runtime/advanced_options.rssrc/boxlite/src/runtime/backend.rssrc/boxlite/src/runtime/rt_impl.rssrc/boxlite/src/util/mod.rssrc/boxlite/src/util/process.rssrc/boxlite/src/vmm/controller/handler.rssrc/boxlite/src/vmm/controller/shim.rssrc/boxlite/tests/health_check.rssrc/boxlite/tests/health_check_perf.rssrc/cli/src/cli.rssrc/cli/src/commands/serve/mod.rssrc/cli/src/commands/serve/types.rssrc/cli/tests/serve_rm_force_zombie.rs
Seven findings, six fixes (two minor + four major), all motivated by the same root issue: timed-out / non-child reap outcomes were being treated as success, so the runtime moved on to destructive cleanup while the shim could still be alive. (1) `util/process.rs` reap helpers — `reap_pid_blocking` and `reap_pid_async` returned `Reaped` unconditionally after `waitpid`, even though `pidfd_open` succeeds for non-child PIDs and `waitpid` then returns -1/ECHILD instead of reaping. Branch on the `waitpid` return value and surface `NotOurChild` when it's not > 0. (2) `vmm/controller/shim.rs:194` (`ShimHandler::stop_force` attached mode) — discarded the reap outcome and returned `Ok(())`. Now returns `Engine` error for `TimedOut` / `NotOurChild` so the caller bails before clearing the PID / persisting `Stopped`. (3) `runtime/rt_impl.rs:850` (force-remove sync fallback) — logged the reap outcome and proceeded with home dir / disk deletion. Now match-gated: only `Reaped` permits the destructive cleanup; non-terminal outcomes return an `Engine` error pointing the operator at the still-alive PID. (4) `rest/litebox.rs:216` (REST `stop_force`) — silently downgraded to graceful `stop()`. Returns `Unsupported` with a hint pointing at `runtime.remove(.., force=true)` until REST gains a real force-stop verb. (5) REST `CreateBoxRequest` / `serve::types::CreateBoxRequest` / `build_box_options` — previously only the disabled bit travelled on the wire, so any custom `HealthCheckOptions` on the caller side (non-default interval/timeout/retries/start_period) was silently rewritten to defaults server-side. Added an optional `health_check_settings: HealthCheckOptions` field; client emits it only when the caller's settings differ from `HealthCheckOptions::default()` (so the wire body is byte-identical for pre-boxlite-ai#613 clients). Server honours it under `AdvancedBoxOptions::default()` unless `health_check_disabled` is also set. (6) `tests/health_check_perf.rs:119` — wait-for-Healthy loop now asserts the box actually reached `Healthy`. A timed-out wait used to record CPU against startup / unhealthy behaviour and print it as steady-state overhead. (7) `tests/health_check_perf.rs:cpu_ms_from_ticks` — replaced the `ticks * 10` hardcode (= 100 Hz assumption) with `sysconf(_SC_CLK_TCK)` lookup. Falls back to 100 only if sysconf returns something nonsensical. ## Tests added - `test_create_box_request_carries_custom_health_check_settings_on_the_wire` — client puts the field on the wire when the caller's settings differ from default. Reverting either the new struct field or the `from_options` match arm flips it red. - `test_create_box_request_omits_health_check_settings_when_default` — wire body stays byte-identical for default settings (back-compat). - `build_box_options_honours_custom_health_check_settings` — server reads the field and produces the matching `HealthCheckOptions`. - `build_box_options_disabled_wins_over_settings` — opt-out beats custom settings when both are present. - The existing 2 client + 3 server health_check tests still pass. `HealthCheckOptions` now derives `PartialEq + Eq` so the `from_options` "are these the default?" check can do a simple equality comparison. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a single tokio task in `RuntimeImpl::new` that subscribes to `SIGCHLD` and drains every available zombie via `waitpid(-1, ..., WNOHANG)`. Health-check still calls `reap_pid_async` per-pid for the boxes it knows about; this catches the cases health- check never sees — operator disabled the watcher, the shim died during init before the watcher spawned, or a dropped `Child` handle left an unwatched pid. Disabled under `cfg(test)` because cargo runs all lib unit tests in one process and `waitpid(-1)` is process-wide — a reaper spawned by a `#[tokio::test]` would race-steal `Child`ren that adjacent `#[test]` cases are about to `try_wait()`. Production binaries and integration-test binaries (one process per `tests/*.rs`) keep the auto-spawn. The reaper's own contract is covered by the `runtime_child_reaper` integration test, which lives in its own binary precisely to avoid the same hazard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runtime-wide SIGCHLD reaper added in the previous commit already
drains every zombie via `waitpid(-1, ..., WNOHANG)` the moment the
kernel posts SIGCHLD. The `reap_pid_async` call inside the per-box
health-check loop is therefore redundant — and conceptually wrong:
liveness observation should not own teardown. Removed.
Communication flow is now:
shim dies (SIGKILL / OOM / panic / dropped Child)
→ kernel posts SIGCHLD to the runtime process
→ util::child_reaper task drains via waitpid(-1, WNOHANG)
→ /proc slot freed
health-check loop, on the next tick, observes is_process_alive=false
and flips state to Stopped + Unhealthy. It no longer touches waitpid.
Behaviour-preserving for every PR boxlite-ai#613 scenario:
- external SIGKILL/SIGTERM: SIGCHLD fires immediately, runtime reaper
drains within microseconds; health-check still observes the death
and flips state on its next tick.
- rm --force: unchanged — ShimHandler::stop_force still calls
Child::wait synchronously (the canonical, deterministic path for
pids we own a Child handle for).
- --no-health-check / health_check_disabled: runtime reaper still
drains zombies whether or not the per-box watcher exists, which is
the whole point of decoupling.
Verified:
- tests/health_check.rs — sigkill / sigterm reproducers still green
(3/3); state transition to Unhealthy is preserved.
- tests/runtime_child_reaper.rs — 2/2 green; reaper drains within
2 s of a dropped Child handle.
- 728/728 boxlite lib unit tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
open health check by default(--no-health-check support), add wait so that the zombie shim PID can be swept while
Then refactor
rm --forceto share the same hold-Child kill path asstop/ non-forcerm/restart, so the codebase has one canonical kill route instead of two.Test plan
SIGKILLshim/proc/<pid>staysState: Z;PerTestBoxHome::droppanics withlive shim(s)/procentry vanishes within one health-check tick + 500 ms reap retrySIGTERMshimState: Zpersistsboxlite serve+ RESTDELETE …?force=trueState: Z,PPid:matches daemon (init can't reap)Child::kill + Child::waithealth_check = Noneboxlite run --no-health-check; REST body"health_check_disabled": true(omitted by pre-#613 clients)rm --forcekill+reapkill_process + libc::waitpidinrt_impl::remove_box, separate fromstoppathLiteBox::stop_force→ShimHandler::stop_force→Child::wait; one canonical kill routeReproducer tests
health_check_becomes_unhealthy_when_shim_killedkill -9health_check_becomes_unhealthy_when_shim_sigtermedkill -TERMboxlite_serve_rm_force_active_box_no_zombie_left_in_procboxlite serve+ RESTDELETE …?force=trueEscape-hatch unit tests
management_flags_no_health_check_clears_advanced_fieldmanagement_flags_no_health_check_unset_keeps_default_ontest_create_box_request_carries_health_check_disabled_on_the_wiretest_create_box_request_omits_health_check_disabled_when_defaultbuild_box_options_health_check_disabled_true_clears_health_checkbuild_box_options_no_health_check_field_keeps_default_onbuild_box_options_health_check_disabled_false_keeps_default_onRefactor coverage
LocalRuntime::remove(force=true)LiteBox::stop_force→ShimHandler::stop_forceChild::kill + Child::waitRuntimeImpl::remove_box(force=true)(Drop + 7 internal tests)kill_process + libc::waitpidRefactor three-side verification (rm-force production path)
State: Z(zombie)State: Z,PPid: <daemon>Perf measurement
perf_health_check_default_on_vs_off_cpu_deltameasures/proc/self/statCPU delta with vs without health check at 100 ms interval. On this host: ~900 µs per tick. At the default 30 s interval: ~30 µs/sec per box (0.3 % of one core at 100 boxes, 3 % at 1000).Summary by CodeRabbit
New Features
Bug Fixes
Tests