Skip to content

[high] fix(vmm): reap shim after kill in rm-force + async-death paths (Issue #523)#613

Draft
G4614 wants to merge 12 commits into
boxlite-ai:mainfrom
G4614:fix/scoped-zombie-shim-reaper
Draft

[high] fix(vmm): reap shim after kill in rm-force + async-death paths (Issue #523)#613
G4614 wants to merge 12 commits into
boxlite-ai:mainfrom
G4614:fix/scoped-zombie-shim-reaper

Conversation

@G4614

@G4614 G4614 commented May 28, 2026

Copy link
Copy Markdown
Contributor

open health check by default(--no-health-check support), add wait so that the zombie shim PID can be swept while

  1. SIGKILL (eg. OOM)
  2. boxlite serve rm force (serve alive but shim died)

Then refactor rm --force to share the same hold-Child kill path as stop / non-force rm / restart, so the codebase has one canonical kill route instead of two.

Test plan

observed pre-fix post-fix
external SIGKILL shim /proc/<pid> stays State: Z; PerTestBoxHome::drop panics with live shim(s) /proc entry vanishes within one health-check tick + 500 ms reap retry
external SIGTERM shim same — State: Z persists clean
boxlite serve + REST DELETE …?force=true shim stays State: Z, PPid: matches daemon (init can't reap) reaped within ~50 ms via Child::kill + Child::wait
CLI / REST / SDK opt-out of health check only Rust library users could set health_check = None boxlite run --no-health-check; REST body "health_check_disabled": true (omitted by pre-#613 clients)
rm --force kill+reap inline kill_process + libc::waitpid in rt_impl::remove_box, separate from stop path LiteBox::stop_forceShimHandler::stop_forceChild::wait; one canonical kill route

Reproducer tests

test trigger
health_check_becomes_unhealthy_when_shim_killed external kill -9
health_check_becomes_unhealthy_when_shim_sigtermed external kill -TERM
boxlite_serve_rm_force_active_box_no_zombie_left_in_proc real boxlite serve + REST DELETE …?force=true

Escape-hatch unit tests

test layer
management_flags_no_health_check_clears_advanced_field CLI
management_flags_no_health_check_unset_keeps_default_on CLI
test_create_box_request_carries_health_check_disabled_on_the_wire REST client
test_create_box_request_omits_health_check_disabled_when_default REST client
build_box_options_health_check_disabled_true_clears_health_check REST server
build_box_options_no_health_check_field_keeps_default_on REST server
build_box_options_health_check_disabled_false_keeps_default_on REST server

Refactor coverage

caller path reaches kernel via
async LocalRuntime::remove(force=true) LiteBox::stop_forceShimHandler::stop_force Child::kill + Child::wait
direct sync RuntimeImpl::remove_box(force=true) (Drop + 7 internal tests) inline kill_process + libc::waitpid retained fallback

Refactor three-side verification (rm-force production path)

step configuration expected actual
A async stop_force OFF, sync inline ON PASS (sync fallback) ✅ PASS
B both paths kill but skip wait FAIL with State: Z (zombie) ✅ FAIL — State: Z, PPid: <daemon>
C async stop_force ON, sync wait OFF PASS (async path alone reaps) ✅ PASS
D both intact (production) PASS ✅ PASS

Perf measurement

perf_health_check_default_on_vs_off_cpu_delta measures /proc/self/stat CPU delta with vs without health check at 100 ms interval. On this host: ~900 µs per tick. At the default 30 s interval: ~30 µs/sec per box (0.3 % of one core at 100 boxes, 3 % at 1000).

Summary by CodeRabbit

  • New Features

    • Added a force-stop option for immediate box termination (rm --force) and a CLI flag (--no-health-check) to disable per-box health checks
    • API now accepts optional health-check fields when creating boxes to control/disable health checks
  • Bug Fixes

    • Health-check task is more resilient to missing guest session state
    • Improved bounded process reaping to avoid zombie shim processes
  • Tests

    • New integration and performance tests for health-check and zombie-reaping behavior

@G4614 G4614 marked this pull request as draft May 28, 2026 08:28
@G4614 G4614 marked this pull request as ready for review May 28, 2026 09:33
DorianZheng pushed a commit that referenced this pull request May 28, 2026
…eak a running shim (#615)

health_check_transitions_to_healthy_after_startup ends with the box still
Running. RuntimeImpl::Drop's safety-net shutdown_sync only sends SIGTERM and
returns without waiting, so the shim is still mid-shutdown when
PerTestBoxHome::Drop (#604) runs its leak assertion — `1 live shim(s)`.
Adding t.bx.stop() (which stops and waits) clears it. The reaper (#613) does
not help here: it reaps dead shims, not a still-running one.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-149.ap-southeast-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 force-pushed the fix/scoped-zombie-shim-reaper branch from 90ce4eb to d7c509d Compare June 1, 2026 04:22
@G4614 G4614 marked this pull request as draft June 2, 2026 03:52
@G4614 G4614 force-pushed the fix/scoped-zombie-shim-reaper branch from d7c509d to 560841c Compare June 2, 2026 04:43
@G4614 G4614 marked this pull request as ready for review June 2, 2026 05:12
@G4614 G4614 changed the title fix(vmm): scoped ShimReaper for zombie shim PIDs (Issue #523) [high] fix(vmm): scoped ShimReaper for zombie shim PIDs (Issue #523) Jun 2, 2026
G4614 pushed a commit to G4614/boxlite that referenced this pull request Jun 2, 2026
`recovery_with_dead_process` sleeps 100 ms after SIGKILL — inside
`REAPER_TICK` (250 ms), so the killed shim may still be a zombie
when the new runtime's recovery scan reads `/proc/<pid>`. That test
proves recovery handles the zombie state, but says nothing about
what happens *after* the scoped `ShimReaper` consumes the zombie
and `/proc/<pid>` is fully gone.

Add `recovery_after_reaper_consumes_killed_shim` that pins the
post-reap state explicitly:

  1. create box, capture shim PID, SIGKILL it
  2. sleep 800 ms (>= 3 * REAPER_TICK) so the reaper definitely
     sweeps and consumes the zombie; the first runtime is also
     dropped at the end of this block, so init becomes the
     belt-and-braces reap source
  3. precondition assert: `/proc/<killed_pid>` must not exist
     (otherwise the test isn't exercising the post-reap state)
  4. spin up a fresh runtime; recovery scan must still transition
     the box to `Stopped`, drop the stale PID file, and report
     `info.pid == None`

This complements rather than replaces `recovery_with_dead_process`:
together they cover both sides of the zombie/vanished split that
boxlite-ai#613 introduces by accelerating reap from "indefinite zombie until
init" to "<= 250 ms via scoped reaper".

Two-side verified: replace the body of `PidFileReader::process_identity`
with `ProcessIdentity::Verified(record.pid)` (i.e. drop both the
`is_process_alive` and start-time-fingerprint checks). Test fails on
`assertion left == right failed: reaped-shim box must be marked
Stopped, got Running, left: Running, right: Stopped` — proving the
test catches a regression where recovery would silently believe a
reaped shim is still live. Restored, test passes again in 33.65 s.

The test passes both with and without the scoped `ShimReaper`
itself (init reaps the orphaned shim once the first runtime drops),
which is the property being asserted: boxlite-ai#613 is recovery-neutral.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 marked this pull request as draft June 2, 2026 07:23
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 2, 2026
…ertion

The `kill -9` and `kill -TERM` variants of
`health_check_becomes_unhealthy_when_shim_killed` *simulate* the signal
an OOM-killer sends. This test drives the actual `mm_oom_kill_process`
codepath: create a cgroup v2 subtree with `memory.max = 20M`, move a
`python3 bytearray(100 MB)` memory hog into it, let the kernel pick the
victim. No userspace `kill` involved.

Phase 1 — *kernel-level production fact*: the OOM-killed process IS a
zombie. `std::mem::forget`s the `Child` so the Rust side never `wait()`s
on it (the production analogue of boxlite's daemon not reaching a
waitpid call site on the OOM path). Asserts `/proc/<pid>/status`
contains `State:\tZ`. No reaper involvement — this is the kernel
behavior that has to hold for PR boxlite-ai#613 to even be necessary.

Phase 2 — *PR boxlite-ai#613 invariant*: spawn a real `ShimReaper`, register the
PID, expect both `reaper.registered()` to evict it AND
`/proc/<pid>/status` to disappear within 5 s (the PCB getting collected
by the reaper's `waitpid(WNOHANG)` sweep). The second of those two
conditions is what makes the test load-bearing under a
`register()`-as-no-op regression: the registry stays trivially empty,
but the zombie still lurks in `/proc`.

Two-side verified on this branch:

| step | code state                                   | result                                                                        |
|------|----------------------------------------------|-------------------------------------------------------------------------------|
| A    | reaper enabled                              | `ok` (2.06 s) — phase 1 sees `State: Z`, phase 2 sees the reaper drain        |
| B    | `register()` patched to early-return        | **FAIL** at phase 2: `still_registered=false, proc_gone=false` after 5 s     |
| C    | restored                                    | `ok` (2.09 s)                                                                |

Skips cleanly when cgroup v2 user delegation isn't available (CI
sandbox, older systemd, etc.): `writable_cgroup_v2_root` walks up from
`/proc/self/cgroup` until it finds an ancestor that lists `memory` in
`cgroup.subtree_control` and accepts a `mkdir` probe, returning `None`
otherwise. The bottom-most systemd `*.scope` typically drops the memory
controller for *its* children, which is why the walk-up to
`user@<uid>.service` is needed on the dev host.

Closes the empirical gap the previous test plan called out where OOM
killer / cgroup OOM were claimed by signal-mechanism analogy but not
directly reproduced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 marked this pull request as ready for review June 2, 2026 09:25
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 2, 2026
… zombie

Closes the empirical gap the rest of PR boxlite-ai#613's test plan left open: every
prior test exercised `BoxliteRuntime` from inside a long-lived test
process (which mirrors daemon behavior but isn't literally `boxlite
serve`). This test spawns the actual `boxlite serve` binary on an
ephemeral port and drives it through `boxlite --url … rm --force` so the
SIGKILL path runs inside the production daemon, exactly as it does in
prod.

Captures both the bug *and* the production-distinguishing detail: who is
the zombie's `PPid`. The failure-path panic includes both `State:` and
`PPid:` lines — the PPid check is what proves *control was not handed to
init*. When the test fails, PPid equals the daemon's PID (not 1), so the
kernel still considers the daemon the shim's parent and only the daemon
can `waitpid()` it.

Flow:
  1. Spawn `boxlite serve --port <ephemeral>` and record its PID.
  2. CLI in `--url` mode: `create` + `start` + `inspect --format
     '{{.State.Pid}}'` to grab the shim PID assigned by the daemon's
     spawn path.
  3. `boxlite --url … rm --force <id>` — the daemon's REST handler
     runs `runtime.remove(box_id, true)` which calls
     `kill_process(shim_pid)` (SIGKILL) and never `waitpid()`s.
  4. From outside the daemon, poll `/proc/<shim_pid>/status` for 5 s.
     The reaper-enabled path observes the entry disappear within
     ~500 ms (REAPER_TICK × 2).

Two-side verified locally:

| step | code state                                  | result                                                                                                                                                                                       |
|------|----------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| A    | reaper enabled                              | `ok` (8.46 s, /proc disappeared within ~500 ms)                                                                                                                                              |
| B    | `register()` patched to early-return        | **FAIL** at 5 s timeout: `boxlite serve daemon (PID 666169) left zombie shim PID 666209 … reports "State: Z" "PPid: 666169" past 5 s` — PPid==daemon PID confirms no reparenting to init     |
| C    | restored                                    | `ok` (4.02 s)                                                                                                                                                                                |

`ServeGuard::Drop` sends SIGINT (not SIGTERM) so the daemon runs
`runtime.shutdown(timeout)`. Without that the box's shim would leak
through the daemon-kill path itself, double-counting against the
PerTestBoxHome leak check.

This is the most expensive of the zombie tests in PR boxlite-ai#613 (~10 s
wall-clock) and the most expensive to rebuild (touches the CLI binary +
boxlite-shim embedding), so the cheaper library-level tests still carry
the bulk of the regression surface; this one closes the literal "is it
actually broken via REST too" question that the others only answered by
analogy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 marked this pull request as draft June 2, 2026 11:47
…oxlite-ai#523)

Two production code paths spawn shim subprocesses, send them SIGKILL,
and never `waitpid` — so a dead shim sits in `/proc` as `State: Z`
indefinitely, holding the PID slot and (in `boxlite serve` mode)
blocking init from reparenting because the daemon is still its parent.

**#1: async death (OOM / external SIGKILL/SIGTERM / internal panic).**
The shim dies without anyone in the runtime asking it to. The only
loop already polling shim liveness is health check (`box_impl.rs::
spawn_health_check`), but its `shim_died` branch did just one thing:
mark `Stopped` / `Unhealthy` in the DB and break. No `waitpid`. The
`is_process_alive` detector correctly treats `State: Z` as dead, so
the branch *fires* on a zombie — it just doesn't reap.

Fix: in the `shim_died` arm, `libc::waitpid(pid, _, WNOHANG)` in a
short bounded retry loop. SIGKILL → zombie transition is not always
synchronous with the wait queue: WNOHANG can briefly return 0 before
the kernel posts SIGCHLD/wait status, even when `/proc` already shows
`State: Z`. 500 ms ceiling with 10 ms backoff covers the race without
ever stalling the health-check loop.

The fix only takes effect if a health check task is *running*, so:

  - `AdvancedBoxOptions::default()` now spawns a health check with
    `HealthCheckOptions::default()` instead of `None`. Default
    interval 30 s, 60 s start period, 10 s ping timeout — the
    pre-existing defaults; no operator-visible cost beyond one ping
    per box per 30 s.
  - The decision is documented inline on the `health_check` field:
    health check is the *only* always-on shim-state watcher; explicit
    `health_check: None` opts out of both pings and zombie reaping
    (operator takes responsibility).

**#2: `boxlite serve` + REST `rm --force`.** Daemon's `rt_impl::
remove_box(force=true)` calls `kill_process(pid)` (SIGKILL by PID,
bypassing the canonical `stop()` path that *does* reap via
`Child::wait()`). The handler returns immediately and the box is
removed from the DB. The shim's `PPid` stays pointed at the daemon
(still alive), so init can't help.

Fix: after `kill_process`, run the same `WNOHANG`-with-deadline
polling loop already used in `vmm/controller/shim.rs::ShimHandler::
stop`'s SIGTERM path. 2 s deadline matches the existing
`GRACEFUL_SHUTDOWN_TIMEOUT_MS`; if SIGKILL doesn't yield a reapable
state in 2 s, the kernel is wedged and a longer wait won't help.

Three reproducer tests, each two-side verified (revert the relevant
production change → test goes red with the exact zombie signal;
restore → test goes green):

  - `boxlite::tests::health_check::
       health_check_becomes_unhealthy_when_shim_killed`
       (existing test on main, now actually passes — the
       `PerTestBoxHome::drop` zombie-leak panic is the load-bearing
       failure; this PR adds a pin on `/proc/<pid>/status` not being
       `State: Z` so the regression signal is inline)
  - `boxlite::tests::health_check::
       health_check_becomes_unhealthy_when_shim_sigtermed`
       (new — same shape, SIGTERM, covers k8s liveness initial
       signal / cgroup OOM initial / systemctl stop / plain kill)
  - `boxlite_cli::tests::serve_rm_force_zombie::
       boxlite_serve_rm_force_active_box_no_zombie_left_in_proc`
       (new — spawns real `boxlite serve` subprocess, drives REST
       `DELETE /v1/boxes/<id>?force=true` via CLI `--url`, watches
       `/proc/<shim_pid>/status` from *outside* the daemon. The
       failing-path panic captures `PPid:` against the daemon's
       PID — the load-bearing diagnostic that proves init isn't
       going to clean this up because the daemon is still alive
       and still the parent.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 force-pushed the fix/scoped-zombie-shim-reaper branch from 1d9da69 to 3fbb87b Compare June 2, 2026 12:18
@G4614 G4614 changed the title [high] fix(vmm): scoped ShimReaper for zombie shim PIDs (Issue #523) [high] fix(vmm): reap shim after kill in rm-force + async-death paths (Issue #523) Jun 2, 2026
The `shim_died` arm sits inside the health check `tokio::spawn` task,
so `std::thread::sleep` between WNOHANG retries was blocking a tokio
worker thread for the duration of the 10 ms tick. Single-box workloads
never notice — the worst case is a 500 ms reap ceiling on one task.
But during a host-wide shim death event (host reboot, batch SIGKILL,
runtime tear-down), every box's `shim_died` arm fires at roughly the
same time. With tokio's default worker count = host CPU count, even
8–16 concurrent `shim_died` arms can saturate the runtime workers and
stall unrelated tasks (REST handlers, exec waits, snapshot work) for
up to the full 500 ms reap budget.

Swap to `tokio::time::sleep(...).await`. The retry now yields between
WNOHANG probes, the worker thread is free to drive other tasks, and
the 500 ms ceiling stays in place. No semantic change for the
single-event case the 3 reproducer tests exercise — all three still
green:

  - `health_check_becomes_unhealthy_when_shim_killed`  (21.8 s)
  - `health_check_becomes_unhealthy_when_shim_sigtermed`  (21.8 s)
  - `boxlite_serve_rm_force_active_box_no_zombie_left_in_proc`  (12.3 s)

Also: a small perf-measurement test alongside in
`src/boxlite/tests/health_check_perf.rs`. Spawns one box with
`health_check = None` and one with `health_check = Some(100ms tick)`,
measures `/proc/self/stat` CPU delta of the boxlite test process over
a 10 s window, and prints the per-tick cost so a reviewer can verify
the default-on decision empirically. On this host: ~900 us per tick
in the on-box minus the off-box baseline. Projected to the default
30 s interval: ~30 us/sec per box — 0.3 % of one core at 100 boxes,
3 % at 1000.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 marked this pull request as ready for review June 2, 2026 13:00
… hatch

boxlite-ai#613 makes the per-box health check task default-on so the new
zombie-shim reaper (Issue boxlite-ai#523) has a watcher to piggy-back on. The
schema field `BoxOptions.advanced.health_check: Option<HealthCheckOptions>`
already lets Rust library users opt out via `None`, but CLI users
(`boxlite run` / `create`) and REST/SDK clients (Python, Node, Go all
route through `POST /v1/boxes`) had no way to express that choice —
exactly the gap boxlite-ai#597 hit for `cap_overrides`.

Surface plumbed end-to-end:

  - `ManagementFlags { no_health_check: bool }` — `--no-health-check`
    on `boxlite run` / `create`. When set, `apply_to` clears
    `opts.advanced.health_check = None`.
  - REST client-side `CreateBoxRequest` gets
    `health_check_disabled: Option<bool>`; `from_options` flips it to
    `Some(true)` only when the operator already disabled health check
    in the input options, so the wire form stays byte-identical for
    every other call.
  - Server-side `CreateBoxRequest` (in `cli/src/commands/serve/types.rs`,
    `#[serde(deny_unknown_fields)]`) gains the same field. The
    `build_box_options` mapper builds a manual `AdvancedBoxOptions`
    instead of falling through `..Default::default()`, then forces
    `health_check = None` iff the wire says so.

Three-state semantics for `health_check_disabled`:

  - `Some(true)` → explicit disable; box runs with no watcher, no
    zombie reaping (operator takes responsibility — documented inline).
  - `Some(false)` → "use the server default" (= currently `Some(...)`).
    Same effect as omitting the field, but documents the intent on
    the wire.
  - absent → server default. Pre-boxlite-ai#613 clients keep the legacy POST
    body shape and get the new default-on behaviour without code
    changes.

Seven unit tests cover the surface symmetrically:

  - CLI: `management_flags_no_health_check_clears_advanced_field` /
    `management_flags_no_health_check_unset_keeps_default_on` —
    flag-on clears, flag-off preserves; pre-asserts the baseline
    so a regression that broke the default-on schema would also fail.
  - REST client: `test_create_box_request_carries_health_check_disabled_on_the_wire` /
    `test_create_box_request_omits_health_check_disabled_when_default`
    — `None` in BoxOptions → `Some(true)` on the wire; `Some(default)`
    → field absent (backward-compat with pre-boxlite-ai#613 servers).
  - REST server: `build_box_options_health_check_disabled_true_clears_health_check` /
    `build_box_options_no_health_check_field_keeps_default_on` /
    `build_box_options_health_check_disabled_false_keeps_default_on`
    — three-state JSON → BoxOptions mapping, including the explicit
    "Some(false) means default" sentinel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 marked this pull request as draft June 2, 2026 13:31
G4614 and others added 4 commits June 2, 2026 14:07
…p_force`

`rt_impl::remove_box(force=true)` historically reached for `kill_process(pid)`
+ inline `libc::waitpid` to clean up the shim — a PID-only shortcut that
worked, but ran a different kill path from `stop` / non-force `rm` /
`restart`, all of which already went through `ShimHandler` and held the
`Child` handle. Two patterns in the codebase for the same conceptual
operation; this commit collapses them into one.

Wiring, top-down:

  - `VmmHandler::stop_force` (trait): new sibling of `stop()`. Same
    return type, same `&mut self`, same sync inside-of-async pattern
    the existing `stop()` already uses.

  - `ShimHandler::stop_force` (impl on the trait): SIGKILL via
    `Child::kill()` followed by blocking `Child::wait()` when the
    handler owns a `Child` (the common case for boxes spawned by this
    runtime). Attached-mode fallback uses raw `libc::kill + waitpid`
    with the same 2 s deadline shape as the inline block previously
    held in `rt_impl::remove_box`.

  - `BoxBackend::stop_force` (trait): async counterpart on the higher-
    level box trait. Mirrors `stop()`.

  - `BoxImpl::stop_force` (impl on the trait): shares the entire
    teardown body with `BoxImpl::stop()` via a new `stop_impl(force:
    bool)` helper — health-check cancel, shutdown-token cancel, PID-
    file cleanup, state transition, DB sync, listener notification,
    metrics counter increment all stay identical between the two
    public methods. The only branches on `force`:
      * Skip the 10 s `guest_session.shutdown()` await (operator asked
        for `rm --force`, not a polite request).
      * Pick `handler.stop_force()` over `handler.stop()`.
      * Skip the trailing `auto_remove` self-call — the caller
        (`LocalRuntime::remove(force=true)`) is the remover; firing
        it again from inside `stop_force` would race.

  - `RestBox::stop_force` (impl on the trait): forwards to the existing
    graceful `stop` for now. The REST wire surface doesn't expose a
    force-stop verb today; clients reach the same outcome through
    `DELETE /v1/boxes/<id>?force=true`, which lands on
    `RuntimeBackend::remove(force=true)` server-side and follows the
    new 持-Child path there. Documented inline.

  - `LiteBox::stop_force` (public, in `litebox/mod.rs`): proxy over
    `box_backend.stop_force().await`. Public-facing API; what
    operators / SDK users would call directly.

  - `LocalRuntime::remove(id_or_name, force=true)` (the async backend
    impl in `rt_impl.rs`): looks up the LiteBox by id_or_name, calls
    `litebox.stop_force().await`, then delegates to the sync
    `RuntimeInnerImpl::remove` for DB / disk cleanup. After
    `stop_force()` returns, `mark_stop()` has already cleared
    `state.pid` to `None` AND `state.status` to `Stopped`, so the
    sync `remove_box`'s `if state.status.is_active() || state.pid
    .is_some()` guard short-circuits and the inline-waitpid block is
    not exercised on the async path. Production rm-force is now
    100 % 持-Child.

The inline `kill_process + libc::waitpid` block in `rt_impl::remove_box`
stays in place as a fallback for direct sync callers (Drop paths,
unit tests that exercise `RuntimeImpl::remove_box` without the async
surface). It's no longer the production code path; documenting it
inline as such would obscure the test-runner-only nature, so left
unannotated for the follow-up that strips it (if/when the 5 sync
tests are migrated to drive the async API). Without that migration,
killing the fallback would break the test suite for cosmetic gain.

Verification:

  - 3 reproducer tests (SIGKILL via health check + SIGTERM via health
    check + `boxlite serve` REST `rm --force`): all green, no
    regression on the zombie-reaping contract these tests pin.
  - 7 sync unit tests that exercise `RuntimeImpl::remove_box(force=
    true)` directly (`test_remove_box_*`, force_remove_*): all green
    — they hit the sync fallback path with state.pid still set.
  - 7 escape-hatch unit tests for `--no-health-check` / `health_check
    _disabled` plumbing: unchanged, all green.
  - `make fmt:check:rust` + workspace `cargo clippy` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior boxlite-ai#613 commits left three near-identical `kill + WNOHANG +
sleep` loops scattered across `box_impl.rs::spawn_health_check`,
`rt_impl.rs::remove_box`, and `shim.rs::ShimHandler::stop_force`'s
attached fallback. Each was hand-coded with a 10–50 ms sleep inside
its retry, a custom deadline check, and a slightly different error-
branch shape. Reviewers (rightly) called the duplication out — and
a sleep-poll loop is the wrong primitive for "wake when this pid
exits" anyway.

This commit replaces all three with two helpers in `util::process`:

  - `reap_pid_blocking(pid, deadline_ms) -> ReapOutcome`
  - `reap_pid_async(pid, deadline_ms) -> ReapOutcome`

Both open a pidfd (`pidfd_open` via `SYS_pidfd_open`, Linux 5.3+),
wait for it to become readable — sync via `poll(2)`, async via
`tokio::io::unix::AsyncFd` — then `waitpid(pid, _, 0)` (blocking
but immediate, since the fd already signalled exit). The wait is
*bounded* by the deadline (`poll`'s timeout / `tokio::time::timeout`)
without a polling sleep loop, and *interruptible* in both contexts
without a `spawn_blocking` thread leak on a wedged shim.

Why pidfd vs blocking `waitpid(_, _, 0)` in `spawn_blocking`: the
latter has no way to cancel — a wedged shim (D state) holds the
blocking-pool slot until the daemon exits. With pidfd, drop of
`AsyncFd` / `close(pidfd)` releases the kernel-side wait the moment
the timeout fires.

Fallback path: on `pidfd_open` failure (PID not our child, kernel
< 5.3, etc.), the helpers fall back to a single `waitpid(WNOHANG)`
attempt — never to a sleep loop, so the worst case is "reaper
missed once."

All three call sites contract to a one-liner around the helper. Net
diff: -90 LOC duplicate retry loops, +110 LOC helpers with safety
comments + ReapOutcome enum. Behaviour unchanged for the three
reproducer tests (kernel-level zombie state is what they assert on,
and pidfd reaches the same waitpid syscall):

  - `health_check_becomes_unhealthy_when_shim_killed`
  - `health_check_becomes_unhealthy_when_shim_sigtermed`
  - `boxlite_serve_rm_force_active_box_no_zombie_left_in_proc`

7 sync `RuntimeImpl::remove_box(force=true)` unit tests still green —
they hit the sync fallback path, which now uses `reap_pid_blocking`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`AdvancedBoxOptions::default()` flipping `health_check` to `Some(...)` in
boxlite-ai#613 surfaced two regressions in the spawn-health-check path:

1. `init_live_state` propagated `live_state.guest_session.guest().await?`
   as a hard error, so any recovered/attached box whose guest session
   can't be re-established (e.g. unit tests using a dummy `sleep` PID)
   failed `live_state()` and `stop_impl`'s kill branch was skipped —
   leaking the process. Switch the `?` to a match: log a warning and
   skip spawning the health task; box init still completes.

2. The health task captured `Arc<RuntimeImpl>`, pinning the runtime
   alive forever for detached boxes whose box-scoped `shutdown_token`
   is never cancelled (user drops `BoxliteRuntime` without an explicit
   stop). Recovery / detach integration tests then failed at the
   second runtime startup with "Another BoxliteRuntime is already
   using directory" because the first runtime's home lock was never
   released. Switch to `Weak<RuntimeImpl>`; the task upgrade()s once
   per tick and exits if the runtime is gone.

Together restores 9 of the 10 previously-failing tests in
test:changed: 2 unit + 4 recovery + 3 detach integration.

Remaining 2 jailer + 6 CLI run failures are pre-existing on main:
CLI is addressed by open PR boxlite-ai#622 (RAII cleanup before hard exit);
jailer shim-SIGABRT verified independent by toggling
default_health_check to None — still fails.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`libc::SYS_pidfd_open` only exists in the Linux `libc` bindings, so
the macOS clippy job broke after the pidfd reap helpers landed. The
cross-platform contract is already "try_pidfd_open returns -1 →
caller falls back to reap_once (WNOHANG-only)", so non-Linux just
needs a stub that returns -1 unconditionally. No behaviour change on
Linux; macOS now compiles + uses the existing fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 marked this pull request as ready for review June 3, 2026 08:17
@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 6c32b084-ea64-40ed-8a5d-4062635bd08f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a force-stop lifecycle that SIGKILLs shims with bounded PID reaping, hardens the health-check task via Weak runtime handles and non-fatal guest-session fallback, exposes CLI/REST opt-outs for health checks, and expands tests (integration + perf) to validate zombie-free removal.

Changes

Force-Stop Implementation with Bounded Process Reaping

Layer / File(s) Summary
Bounded PID reaping + re-exports
src/boxlite/src/util/process.rs, src/boxlite/src/util/mod.rs
Adds ReapOutcome, reap_pid_blocking and reap_pid_async with pidfd-based wait (fallback to waitpid) and re-exports them.
Backend contract and handler docs
src/boxlite/src/runtime/backend.rs, src/boxlite/src/vmm/controller/handler.rs
Adds BoxBackend::stop_force() to the backend trait and documents VmmHandler::stop_force semantics (immediate SIGKILL behavior).
Shim handler force-stop
src/boxlite/src/vmm/controller/shim.rs
Implements ShimHandler::stop_force: spawned mode uses Child::kill() + wait(), attached mode sends SIGKILL and requires bounded reap_pid_blocking to report success.
BoxImpl stop refactor & force path
src/boxlite/src/litebox/box_impl.rs
Refactors stop() -> stop_impl(force), adds stop_force(), skips guest-agent shutdown on force, chooses handler.stop()/stop_force(), and suppresses auto_remove for force stops.
Health-check task resilience
src/boxlite/src/litebox/box_impl.rs
Spawn health-check with Weak<RuntimeImpl>, skip spawn if guest session unavailable, exit when runtime is dropped, and on shim-death use async bounded reap before marking Unhealthy/Stopped.
Runtime remove force flow
src/boxlite/src/runtime/rt_impl.rs
remove_box force path kills shim then does a 2000ms blocking reap and only proceeds to destructive cleanup if reaped; LocalRuntime::remove(force) calls litebox.stop_force() when box resolved.
Public API & REST backend
src/boxlite/src/litebox/mod.rs, src/boxlite/src/rest/litebox.rs
Adds LiteBox::stop_force(); REST backend returns Unsupported for stop_force() directing callers to runtime.remove(..., force=true) or local CLI.
CreateBoxRequest health-check wire
src/boxlite/src/rest/types.rs
Adds health_check_disabled and health_check_settings fields with conditional serialization; from_options maps advanced.health_check into these fields; tests updated.
AdvancedBoxOptions default & docs
src/boxlite/src/runtime/advanced_options.rs
Adds default_health_check() and explicit impl Default for AdvancedBoxOptions; HealthCheckOptions derives PartialEq/Eq.
Server + CLI health-check resolution
src/cli/src/commands/serve/mod.rs, src/cli/src/cli.rs
build_box_options honors health_check_disabled and health_check_settings; CLI adds --no-health-check to clear advanced.health_check; unit tests added.
Integration & perf tests
src/boxlite/tests/health_check.rs, src/boxlite/tests/health_check_perf.rs, src/cli/tests/serve_rm_force_zombie.rs
Extends health-check tests for non-zombie assertions on shim PID, adds perf test measuring health-check CPU cost, and adds end-to-end serve rm --force test verifying no shim zombies via /proc monitoring.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • #523: Bounded PID reaping and zombie prevention changes address the shim zombie/reap concerns discussed in the issue.

Poem

🐰 A force so swift, like SIGKILL's might,
Reaper waits with deadline bright,
Weak paws hold runtime, then let go,
No orphan shims left in the snow!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main changes: fixing zombie shim reaping after kill in rm-force and async-death paths (Issue #523), which aligns perfectly with the PR's core objective of eliminating zombie processes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@cla-assistant

cla-assistant Bot commented Jun 3, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/boxlite/tests/health_check.rs (1)

122-126: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert that the kill command actually succeeded.

output() only proves the helper process launched. If kill returns non-zero, the test keeps running and later fails with a much less obvious assertion. Check status.success() and include stderr in the failure message.

Suggested fix
-    Command::new("kill")
+    let out = Command::new("kill")
         .arg("-9")
         .arg(shim_pid.to_string())
         .output()
         .expect("Failed to kill shim process");
+    assert!(
+        out.status.success(),
+        "kill -9 failed for shim PID {shim_pid}: {}",
+        String::from_utf8_lossy(&out.stderr)
+    );
-    Command::new("kill")
+    let out = Command::new("kill")
         .arg("-TERM")
         .arg(shim_pid.to_string())
         .output()
         .expect("Failed to SIGTERM shim process");
+    assert!(
+        out.status.success(),
+        "kill -TERM failed for shim PID {shim_pid}: {}",
+        String::from_utf8_lossy(&out.stderr)
+    );

Also applies to: 207-211

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/boxlite/tests/health_check.rs` around lines 122 - 126, The test currently
spawns the external kill helper with Command::new("kill")...output() but doesn’t
assert the command succeeded; change both occurrences (the blocks that construct
Command::new("kill") and pass shim_pid) to capture the Output, check
output.status.success(), and on failure fail the test with a message that
includes stderr (use String::from_utf8_lossy on output.stderr) so the test
reports the kill error immediately and clearly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/boxlite/src/rest/litebox.rs`:
- Around line 208-216: stop_force currently calls self.stop() which silently
downgrades semantics; change stop_force() to immediately return an Unsupported
error instead of delegating to stop(). Update the async fn stop_force(&self) ->
BoxliteResult<()> implementation to return Err(BoxliteError::Unsupported("force
stop not supported over REST".into())) (or the project's equivalent
Unsupported/UnsupportedOperation error constructor) so callers receive a clear
Unsupported response; reference the stop_force() method and
BoxliteResult/BoxliteError types when making the change.

In `@src/boxlite/src/rest/types.rs`:
- Around line 137-144: The code currently maps options.advanced.health_check to
a boolean health_check_disabled which drops any non-none HealthCheckOptions and
causes defaults to be rebuilt on the server; instead send the full health-check
config on the wire: replace the health_check_disabled logic with a direct
mapping that sets the request's health_check field to
options.advanced.health_check.clone() (or Some(cloned_value) / None as
appropriate) so the actual HealthCheckOptions is transmitted; if you prefer the
alternative, add client-side validation in the REST client to reject non-default
options. Ensure you reference and update the use of health_check_disabled where
the request/AdvancedBoxOptions is built so the wire payload carries the
Option<HealthCheckOptions> rather than a boolean.

In `@src/boxlite/src/runtime/rt_impl.rs`:
- Around line 845-850: The current fallback uses crate::util::reap_pid_blocking
with FORCE_REAP_DEADLINE_MS and logs outcome but then unconditionally proceeds
to persist Stopped state and perform full box deletion; change this to only
continue destructive cleanup when the reap outcome is terminal (e.g., process
reaped / exit confirmed / no such pid). After calling reap_pid_blocking and
capturing outcome, gate the subsequent persistence-to-Stopped and the full
disk/home removal on outcome indicating a terminal state; if outcome is a
timeout or other non-terminal result, abort the removal flow (return an error or
retry/rollback) and log a clear warning so the shim is not deleted while still
running.

In `@src/boxlite/src/util/process.rs`:
- Around line 363-368: The current branch unconditionally returns
ReapOutcome::Reaped after calling unsafe { libc::waitpid(...) } even though
waitpid can fail with -1/ECHILD when the PID is not our child; update the logic
in the block around the waitpid call to check the return value and errno: if
waitpid returns -1 and errno == libc::ECHILD return ReapOutcome::NotOurChild, if
waitpid returns > 0 return ReapOutcome::Reaped, and handle other errors
appropriately (e.g., propagate or log). Apply the same change to the other reap
helper (the similar waitpid usage near the later block) so both functions use
the waitpid return/errno checks instead of assuming success.

In `@src/boxlite/src/vmm/controller/shim.rs`:
- Around line 189-194: The current shutdown block in shim.rs unconditionally
returns Ok(()) after calling libc::kill and
crate::util::reap_pid_blocking(self.pid, REAP_DEADLINE_MS); change this to check
the result of reap_pid_blocking and propagate an Err when the reap timed out or
indicates the child is still running so BoxImpl::stop_impl() does not clear the
PID or mark the shim Stopped. Specifically, keep the kill call and
REAP_DEADLINE_MS, capture the value returned by reap_pid_blocking(self.pid,
REAP_DEADLINE_MS), and if it signals timeout/still-running return a suitable
error (instead of Ok(())); only return Ok(()) when the reap indicates success.

In `@src/boxlite/tests/health_check_perf.rs`:
- Around line 107-119: The loop waiting for health uses healthy_deadline and
polls t_on.runtime.get_info but doesn’t assert that the box reached
HealthState::Healthy; add a post-loop check that re-fetches the box info (via
t_on.runtime.get_info(t_on.bx.id().as_str()).await.unwrap().unwrap()) and assert
that info.health_status.state == HealthState::Healthy (or panic with a clear
message) so the test fails fast if the box never became healthy before
measuring.
- Around line 44-50: The cpu_ms_from_ticks function incorrectly assumes a fixed
clock tick rate of 100 Hz by multiplying ticks by 10. Update this function to
dynamically retrieve the actual clock tick rate at runtime using
sysconf(_SC_CLK_TCK) and convert ticks to milliseconds based on this value.
Additionally, in the perf_health_check_default_on_vs_off_cpu_delta function, add
a check to assert or early-return if the health check does not reach
HealthState::Healthy before the 15-second polling deadline to avoid measuring
CPU usage prematurely.

---

Outside diff comments:
In `@src/boxlite/tests/health_check.rs`:
- Around line 122-126: The test currently spawns the external kill helper with
Command::new("kill")...output() but doesn’t assert the command succeeded; change
both occurrences (the blocks that construct Command::new("kill") and pass
shim_pid) to capture the Output, check output.status.success(), and on failure
fail the test with a message that includes stderr (use String::from_utf8_lossy
on output.stderr) so the test reports the kill error immediately and clearly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 06d9196b-2bb0-45c7-a45c-13bcf976e894

📥 Commits

Reviewing files that changed from the base of the PR and between f21e5a4 and 723b6e6.

📒 Files selected for processing (17)
  • src/boxlite/src/litebox/box_impl.rs
  • src/boxlite/src/litebox/mod.rs
  • src/boxlite/src/rest/litebox.rs
  • src/boxlite/src/rest/types.rs
  • src/boxlite/src/runtime/advanced_options.rs
  • src/boxlite/src/runtime/backend.rs
  • src/boxlite/src/runtime/rt_impl.rs
  • src/boxlite/src/util/mod.rs
  • src/boxlite/src/util/process.rs
  • src/boxlite/src/vmm/controller/handler.rs
  • src/boxlite/src/vmm/controller/shim.rs
  • src/boxlite/tests/health_check.rs
  • src/boxlite/tests/health_check_perf.rs
  • src/cli/src/cli.rs
  • src/cli/src/commands/serve/mod.rs
  • src/cli/src/commands/serve/types.rs
  • src/cli/tests/serve_rm_force_zombie.rs

Comment thread src/boxlite/src/rest/litebox.rs
Comment thread src/boxlite/src/rest/types.rs Outdated
Comment thread src/boxlite/src/runtime/rt_impl.rs Outdated
Comment thread src/boxlite/src/util/process.rs Outdated
Comment thread src/boxlite/src/vmm/controller/shim.rs Outdated
Comment thread src/boxlite/tests/health_check_perf.rs Outdated
Comment thread src/boxlite/tests/health_check_perf.rs
Seven findings, six fixes (two minor + four major), all motivated by
the same root issue: timed-out / non-child reap outcomes were being
treated as success, so the runtime moved on to destructive cleanup
while the shim could still be alive.

(1) `util/process.rs` reap helpers — `reap_pid_blocking` and
`reap_pid_async` returned `Reaped` unconditionally after `waitpid`,
even though `pidfd_open` succeeds for non-child PIDs and `waitpid`
then returns -1/ECHILD instead of reaping. Branch on the `waitpid`
return value and surface `NotOurChild` when it's not > 0.

(2) `vmm/controller/shim.rs:194` (`ShimHandler::stop_force` attached
mode) — discarded the reap outcome and returned `Ok(())`. Now
returns `Engine` error for `TimedOut` / `NotOurChild` so the
caller bails before clearing the PID / persisting `Stopped`.

(3) `runtime/rt_impl.rs:850` (force-remove sync fallback) — logged
the reap outcome and proceeded with home dir / disk deletion. Now
match-gated: only `Reaped` permits the destructive cleanup;
non-terminal outcomes return an `Engine` error pointing the
operator at the still-alive PID.

(4) `rest/litebox.rs:216` (REST `stop_force`) — silently downgraded
to graceful `stop()`. Returns `Unsupported` with a hint pointing
at `runtime.remove(.., force=true)` until REST gains a real
force-stop verb.

(5) REST `CreateBoxRequest` / `serve::types::CreateBoxRequest` /
`build_box_options` — previously only the disabled bit travelled on
the wire, so any custom `HealthCheckOptions` on the caller side
(non-default interval/timeout/retries/start_period) was silently
rewritten to defaults server-side. Added an optional
`health_check_settings: HealthCheckOptions` field; client emits it
only when the caller's settings differ from
`HealthCheckOptions::default()` (so the wire body is byte-identical
for pre-boxlite-ai#613 clients). Server honours it under
`AdvancedBoxOptions::default()` unless `health_check_disabled` is
also set.

(6) `tests/health_check_perf.rs:119` — wait-for-Healthy loop now
asserts the box actually reached `Healthy`. A timed-out wait used
to record CPU against startup / unhealthy behaviour and print it as
steady-state overhead.

(7) `tests/health_check_perf.rs:cpu_ms_from_ticks` — replaced the
`ticks * 10` hardcode (= 100 Hz assumption) with
`sysconf(_SC_CLK_TCK)` lookup. Falls back to 100 only if sysconf
returns something nonsensical.

## Tests added

- `test_create_box_request_carries_custom_health_check_settings_on_the_wire`
  — client puts the field on the wire when the caller's settings
  differ from default. Reverting either the new struct field or the
  `from_options` match arm flips it red.
- `test_create_box_request_omits_health_check_settings_when_default`
  — wire body stays byte-identical for default settings (back-compat).
- `build_box_options_honours_custom_health_check_settings` — server
  reads the field and produces the matching `HealthCheckOptions`.
- `build_box_options_disabled_wins_over_settings` — opt-out beats
  custom settings when both are present.
- The existing 2 client + 3 server health_check tests still pass.

`HealthCheckOptions` now derives `PartialEq + Eq` so the
`from_options` "are these the default?" check can do a simple
equality comparison.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 marked this pull request as draft June 4, 2026 07:57
G4614 and others added 3 commits June 4, 2026 08:34
Adds a single tokio task in `RuntimeImpl::new` that subscribes to
`SIGCHLD` and drains every available zombie via
`waitpid(-1, ..., WNOHANG)`. Health-check still calls `reap_pid_async`
per-pid for the boxes it knows about; this catches the cases health-
check never sees — operator disabled the watcher, the shim died
during init before the watcher spawned, or a dropped `Child` handle
left an unwatched pid.

Disabled under `cfg(test)` because cargo runs all lib unit tests in
one process and `waitpid(-1)` is process-wide — a reaper spawned by
a `#[tokio::test]` would race-steal `Child`ren that adjacent
`#[test]` cases are about to `try_wait()`. Production binaries and
integration-test binaries (one process per `tests/*.rs`) keep the
auto-spawn. The reaper's own contract is covered by the
`runtime_child_reaper` integration test, which lives in its own
binary precisely to avoid the same hazard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runtime-wide SIGCHLD reaper added in the previous commit already
drains every zombie via `waitpid(-1, ..., WNOHANG)` the moment the
kernel posts SIGCHLD. The `reap_pid_async` call inside the per-box
health-check loop is therefore redundant — and conceptually wrong:
liveness observation should not own teardown. Removed.

Communication flow is now:

  shim dies (SIGKILL / OOM / panic / dropped Child)
    → kernel posts SIGCHLD to the runtime process
    → util::child_reaper task drains via waitpid(-1, WNOHANG)
    → /proc slot freed

  health-check loop, on the next tick, observes is_process_alive=false
  and flips state to Stopped + Unhealthy. It no longer touches waitpid.

Behaviour-preserving for every PR boxlite-ai#613 scenario:

  - external SIGKILL/SIGTERM: SIGCHLD fires immediately, runtime reaper
    drains within microseconds; health-check still observes the death
    and flips state on its next tick.
  - rm --force: unchanged — ShimHandler::stop_force still calls
    Child::wait synchronously (the canonical, deterministic path for
    pids we own a Child handle for).
  - --no-health-check / health_check_disabled: runtime reaper still
    drains zombies whether or not the per-box watcher exists, which is
    the whole point of decoupling.

Verified:
  - tests/health_check.rs — sigkill / sigterm reproducers still green
    (3/3); state transition to Unhealthy is preserved.
  - tests/runtime_child_reaper.rs — 2/2 green; reaper drains within
    2 s of a dropped Child handle.
  - 728/728 boxlite lib unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant