fix(vmm): pass box_id in shim argv so recovery can validate ownership#573
fix(vmm): pass box_id in shim argv so recovery can validate ownership#573G4614 wants to merge 3 commits into
Conversation
`ShimSpawner::spawn` previously launched `boxlite-shim` with empty argv — the runtime config (which can contain CA private keys and secret values) travels via stdin pipe, so the argv field was deliberately left empty to keep secrets out of world-readable `/proc/<pid>/cmdline`. The cost of that empty argv was hidden: `util::process::is_same_process` (used by recovery in `rt_impl.rs:1196`) validates a recovered PID by substring-matching `box_id` in `/proc/<pid>/cmdline`. With no args, the check always returned `false` for live shims, so every CLI invocation after `boxlite run -d` marked the box `Stopped`, deleted its pid file, and dispatched the next `exec`/`rm`/`ls` to a fresh shim — leaving the original holding KVM vm/vCPU fds, the box's `--memory` mapping, the rootfs loop device, and any host ports from image `EXPOSE` directives. This silently leaked one shim per `run -d`/recovery cycle. Under the nextest `vm` profile (4 parallel workers), the leak rate × parallelism + EBS writeback throttle from `mke2fs`-on-loop + nested-KVM saturation accumulated fast enough to trip AWS Nitro's hypervisor-level reset on m8i.4xlarge during `make test:integration:cli` (3 silent host reboots observed today). Also responsible for the deterministic failure of `tests/rm.rs::test_rm_force_running`: same recovery branch let `boxlite rm <running-box>` succeed without `--force`. Fix: - Introduce `ShimSpawner::shim_argv()` returning `[box_id]`. The shim binary does not read it; the value is purely there so `/proc/<pid>/cmdline` carries box_id for `is_same_process` to find. `box_id` is a short random identifier and is not sensitive. - Wire `spawn()` to use it. Config JSON still travels via stdin pipe. - Add `test_shim_argv_carries_box_id` (replaces previous neutered `test_build_shim_args` which only asserted the constructor stored its arg). Verified to fail with `left: []` / `right: ["test-box"]` against pre-fix behavior and pass after. - End-to-end check: `tests/rm.rs::test_rm_force_running` (deterministic fail on main per boxlite-ai#565 reproduction) now passes in 11.18s. Pre-existing failures unrelated to this change (verified via `git stash` + re-run against main): - `jailer::credentials::tests::test_can_create_process_in_new_user_ns` - `jailer::tests::test_jailer_full_flow_with_real_tempdir` Both blocked by AppArmor `kernel.apparmor_restrict_unprivileged_userns` on Ubuntu 24+; tracked upstream as boxlite-ai#468 and boxlite-ai#544. Refs: boxlite-ai#565 Rationale comment text adapted from orphan exploration commit 158feb7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…argv Direct unit test for the recovery contract fixed by a55135a (parent commit): - A live process whose argv carries `box_id` (the post-fix shape produced by `ShimSpawner::shim_argv`) MUST be recognized by `is_same_process(pid, box_id)`. - Querying the same process with a different `box_id` MUST be rejected (no false positive from substring collision on arg0 alone). - `is_same_process` is a read-only probe — the inspected process must still be alive after the call. Spawns a `/bin/sh` stand-in whose argv mirrors the real shim layout (arg0 contains "boxlite-shim", a positional arg is the box_id). The test bypasses the `vmm::controller::spawn` path so the contract on `util::process::is_same_process` is pinned independently of how shims are actually launched in production. Notes: - `: && sleep 30` script prefix prevents `sh` from tail-call-`exec`ing `sleep` (which would replace argv and break the test). - `process_group(0)` + pgrp-wide SIGKILL on drop reaps the forked `sleep` along with `sh` to avoid nextest's leaky-test detection. Complements the earlier end-to-end coverage (`tests/rm.rs::test_rm_force_running`), which proves the same property at the recovery-roundtrip level: a live shim must remain Running across runtime instantiation, so `boxlite rm` without `--force` correctly refuses to remove it. Refs: boxlite-ai#565 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
One concern after looking at the recovery contract a bit more: this PR makes That part is pre-existing, so this is not something the PR newly broke. But since this PR is formalizing the “PID belongs to this box” contract, I think it would be safer to tighten the check here as well: split The edge case is something like querying |
|
One more small test-contract point: the new If we keep the exact-match tightening above, it would be nice to make the test prove that case directly. For example, spawn the stand-in with something like |
`is_same_process_linux` validated PID ownership by checking `cmdline.contains(box_id)` against the whole `/proc/<pid>/cmdline`. With this PR's `ShimSpawner::shim_argv` placing `box_id` as a clean standalone argv entry, the consumer-side check can now be tightened to exact argv equality: args.contains(&box_id) instead of substring matching across all of cmdline. This rules out prefix collisions — e.g., querying recovery for `box-1` while a live sibling shim's argv carries `box-10` would have falsely matched under substring semantics. The pre-existing substring check was a real soundness gap that this PR was about to formalize away from; closing it here is the natural other half of the contract. Also reshapes `test_is_same_process_accepts_live_shim_with_box_id_in_argv` to actually exercise the prefix case it claims to test: - Stand-in's argv carries `box-10` - Positive assertion: `is_same_process(pid, "box-10")` → true - Negative assertion: `is_same_process(pid, "box-1")` → false Verified to fail with the prefix assertion against the old substring implementation; passes after the exact-match change. The prior negative case (`totally-different-id`) was too far from any boundary to actually catch the bug it nominally guarded. End-to-end: `test_rm_force_running` (the deterministic-failure case from boxlite-ai#565 reproduction) re-verified to pass with the tightened check — real shim's `cmdline` carries box_id as a standalone arg, so exact-match works on the real argv shape. Addresses non-blocking review comments by nieyy on boxlite-ai#573: - boxlite-ai#573 (comment) (substring → exact argv match) - boxlite-ai#573 (comment) (test should exercise the prefix case it claims to) Refs: boxlite-ai#565 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Duplicate in #590 |
Adds `ShimReaper`, a per-runtime worker that periodically calls `waitpid(pid, WNOHANG)` on a small registered PID set. Replaces the reverted daemon-wide reaper attempt from PR boxlite-ai#520 with a scoped design that matches Issue boxlite-ai#523's acceptance criteria. Problem ======= `boxlite-shim` children that get abandoned mid-init (panic past `CleanupGuard`, retry loop spawning a fresh shim, reattach paths where the Rust-side `Child` was never wait()'d) accumulate as `<defunct>` PIDs under the daemon. The `CL84LvGx7RBE` incident on dev showed 7+ zombies piled up over time, and the contributing cause sat on top of the CleanupGuard fix (which itself is independent and stays). PR boxlite-ai#520 originally landed a `waitpid(-1, WNOHANG)` reaper to drain these. That call races every other `Child::wait()` site in the workspace: if the reaper wins, the owner's `wait()` returns ECHILD and the exit code is lost. The reaper was reverted on the same PR pending the scoped design this commit implements. Design ====== `src/boxlite/src/util/reaper.rs` introduces: - `ShimReaper` — owns a `std::thread` worker that scans an `Arc<Mutex<HashSet<u32>>>` registry every 250 ms and reaps any registered PID whose `waitpid` reports exit or ECHILD. Std thread (not tokio task) so it works regardless of whether `RuntimeImpl::new` was called inside a tokio runtime context. - `ReaperHandle` — RAII registration handle. Drop removes the PID from the registry; a panic in any caller path that owns one can't leak. - `shutdown()` — Condvar-driven, completes in well under one tick so `RuntimeImpl::shutdown` (sync and async paths both call it) doesn't block the runtime exit on a stuck reaper. Why scoped, not daemon-wide --------------------------- Per Issue boxlite-ai#523, an unscoped reaper races every `Child::wait()` call site. This reaper only touches PIDs that were explicitly registered. Today's only registrar is `ShimHandler::from_spawned`. The `let _ = process.wait();` sites in `shim.rs:112` / `shim.rs:121` discard their results, so ECHILD-from-reaper-win is safe. Audit of every `Child::wait()` / `Child::try_wait()` call site in `src/boxlite/`: | Site | Process | Race-safe? | |-------------------------------------|------------|------------------| | shim.rs:102 (stop's try_wait loop) | shim | Yes — `Err(_)` arm falls through to kill+wait, both ECHILD-discarded | | shim.rs:112, 121 (`let _ = wait()`) | shim | Yes — result discarded | | watchdog.rs:256 | test only | N/A | | net/libslirp.rs:135 | gvproxy | Not registered, reaper doesn't touch | | box_impl.rs:1169 (ChildGuard) | test only | N/A | | ProcessMonitor::try_wait | any | Already maps ECHILD to `Unknown` | Why polling, not SIGCHLD ------------------------ SIGCHLD plumbing in async Rust (signal-hook / tokio::signal::unix) is process-global and adds a race surface against the runtime's other signal handlers. A 250 ms HashSet scan is cheaper than that complexity buys back. Worst-case zombie lifetime is 250 ms; tests verify drain in < 2 s by polling, never sleeping. Wiring ====== - `RuntimeImpl` gains `shim_reaper: Arc<ShimReaper>`, spawned in `RuntimeImpl::new`, shut down in both `shutdown()` (async path) and `shutdown_sync()` (atexit / Drop path). - `ShimController::new` takes `Arc<ShimReaper>`; threaded through from `vmm_spawn::spawn_vm` via `ctx.runtime.shim_reaper`. - `ShimHandler::from_spawned` registers the PID and stores the `ReaperHandle`. `ShimHandler::from_pid` (reattach mode) does NOT register — `waitpid` on a non-child returns ECHILD anyway; the reattached shim's eventual zombie belongs to init, not us. Tests ===== Acceptance criteria from Issue boxlite-ai#523, restated and pinned: - [x] Scopes to known shim PIDs, never calls `waitpid(-1, ...)` — see `probe_pid` in reaper.rs. - [x] Audit of `Child::wait()` call sites — recorded in the table above. - [x] Shutdown path tied to `RuntimeImpl::shutdown` — both async and sync paths call `shim_reaper.shutdown()`. Drop on the reaper is idempotent. - [x] Verification test runs in < 2 s (poll-based) — three unit tests in `reaper.rs::tests` and two integration tests in `src/boxlite/tests/zombie_reaper.rs::n_init_failures_leave_zero_zombies`, `registry_accepts_re_registration_after_drop`. Wallclock for each is sub-second on the dev host. - [x] Integration test demonstrates "N init failures leave 0 zombies" — `n_init_failures_leave_zero_zombies` spawns N=8 (matching the `CL84LvGx7RBE` count), forgets each Child to mirror the abandoned- mid-init path, and asserts both registry drain and `/proc/<pid>/status` absence-or-not-Z. Test counts: - Unit (reaper.rs): 3 tests, all pass in ~0.3 s total - Integration (tests/zombie_reaper.rs): 2 tests, all pass in ~0.5 s total - Regression: 754/754 boxlite lib tests pass with --features rest (excluding the pre-existing `ws_watchdog` flake unrelated to this change); 118/118 boxlite-cli unit tests pass. Out of scope ============ - Reattach orphans: a reattached `ShimHandler` doesn't register with the reaper. If the original spawning process is gone, init will reap the eventual zombie. Tracking via reconciler (separate epic) covers the DB/runtime drift case. - PR boxlite-ai#573 (orphan shim port retention, Issue boxlite-ai#565): independent fix for a different leak mechanism (`is_same_process` recovery misidentification). Both are needed for the full dind test reliability story; landing one doesn't subsume the other. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `ShimReaper`, a per-runtime worker that periodically calls `waitpid(pid, WNOHANG)` on a small registered PID set. Replaces the reverted daemon-wide reaper attempt from PR boxlite-ai#520 with a scoped design that matches Issue boxlite-ai#523's acceptance criteria. Problem ======= `boxlite-shim` children that get abandoned mid-init (panic past `CleanupGuard`, retry loop spawning a fresh shim, reattach paths where the Rust-side `Child` was never wait()'d) accumulate as `<defunct>` PIDs under the daemon. The `CL84LvGx7RBE` incident on dev showed 7+ zombies piled up over time, and the contributing cause sat on top of the CleanupGuard fix (which itself is independent and stays). PR boxlite-ai#520 originally landed a `waitpid(-1, WNOHANG)` reaper to drain these. That call races every other `Child::wait()` site in the workspace: if the reaper wins, the owner's `wait()` returns ECHILD and the exit code is lost. The reaper was reverted on the same PR pending the scoped design this commit implements. Design ====== `src/boxlite/src/util/reaper.rs` introduces: - `ShimReaper` — owns a `std::thread` worker that scans an `Arc<Mutex<HashSet<u32>>>` registry every 250 ms and reaps any registered PID whose `waitpid` reports exit or ECHILD. Std thread (not tokio task) so it works regardless of whether `RuntimeImpl::new` was called inside a tokio runtime context. - `ReaperHandle` — RAII registration handle. Drop removes the PID from the registry; a panic in any caller path that owns one can't leak. - `shutdown()` — Condvar-driven, completes in well under one tick so `RuntimeImpl::shutdown` (sync and async paths both call it) doesn't block the runtime exit on a stuck reaper. Why scoped, not daemon-wide --------------------------- Per Issue boxlite-ai#523, an unscoped reaper races every `Child::wait()` call site. This reaper only touches PIDs that were explicitly registered. Today's only registrar is `ShimHandler::from_spawned`. The `let _ = process.wait();` sites in `shim.rs:112` / `shim.rs:121` discard their results, so ECHILD-from-reaper-win is safe. Audit of every `Child::wait()` / `Child::try_wait()` call site in `src/boxlite/`: | Site | Process | Race-safe? | |-------------------------------------|------------|------------------| | shim.rs:102 (stop's try_wait loop) | shim | Yes — `Err(_)` arm falls through to kill+wait, both ECHILD-discarded | | shim.rs:112, 121 (`let _ = wait()`) | shim | Yes — result discarded | | watchdog.rs:256 | test only | N/A | | net/libslirp.rs:135 | gvproxy | Not registered, reaper doesn't touch | | box_impl.rs:1169 (ChildGuard) | test only | N/A | | ProcessMonitor::try_wait | any | Already maps ECHILD to `Unknown` | Why polling, not SIGCHLD ------------------------ SIGCHLD plumbing in async Rust (signal-hook / tokio::signal::unix) is process-global and adds a race surface against the runtime's other signal handlers. A 250 ms HashSet scan is cheaper than that complexity buys back. Worst-case zombie lifetime is 250 ms; tests verify drain in < 2 s by polling, never sleeping. Wiring ====== - `RuntimeImpl` gains `shim_reaper: Arc<ShimReaper>`, spawned in `RuntimeImpl::new`, shut down in both `shutdown()` (async path) and `shutdown_sync()` (atexit / Drop path). - `ShimController::new` takes `Arc<ShimReaper>`; threaded through from `vmm_spawn::spawn_vm` via `ctx.runtime.shim_reaper`. - `ShimHandler::from_spawned` registers the PID and stores the `ReaperHandle`. `ShimHandler::from_pid` (reattach mode) does NOT register — `waitpid` on a non-child returns ECHILD anyway; the reattached shim's eventual zombie belongs to init, not us. Tests ===== Acceptance criteria from Issue boxlite-ai#523, restated and pinned: - [x] Scopes to known shim PIDs, never calls `waitpid(-1, ...)` — see `probe_pid` in reaper.rs. - [x] Audit of `Child::wait()` call sites — recorded in the table above. - [x] Shutdown path tied to `RuntimeImpl::shutdown` — both async and sync paths call `shim_reaper.shutdown()`. Drop on the reaper is idempotent. - [x] Verification test runs in < 2 s (poll-based) — three unit tests in `reaper.rs::tests` and two integration tests in `src/boxlite/tests/zombie_reaper.rs::n_init_failures_leave_zero_zombies`, `registry_accepts_re_registration_after_drop`. Wallclock for each is sub-second on the dev host. - [x] Integration test demonstrates "N init failures leave 0 zombies" — `n_init_failures_leave_zero_zombies` spawns N=8 (matching the `CL84LvGx7RBE` count), forgets each Child to mirror the abandoned- mid-init path, and asserts both registry drain and `/proc/<pid>/status` absence-or-not-Z. Test counts: - Unit (reaper.rs): 3 tests, all pass in ~0.3 s total - Integration (tests/zombie_reaper.rs): 2 tests, all pass in ~0.5 s total - Regression: 754/754 boxlite lib tests pass with --features rest (excluding the pre-existing `ws_watchdog` flake unrelated to this change); 118/118 boxlite-cli unit tests pass. Out of scope ============ - Reattach orphans: a reattached `ShimHandler` doesn't register with the reaper. If the original spawning process is gone, init will reap the eventual zombie. Tracking via reconciler (separate epic) covers the DB/runtime drift case. - PR boxlite-ai#573 (orphan shim port retention, Issue boxlite-ai#565): independent fix for a different leak mechanism (`is_same_process` recovery misidentification). Both are needed for the full dind test reliability story; landing one doesn't subsume the other. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
util::process::is_same_processvalidates a recovered PID by requiring the box_id substring to appear in/proc/<pid>/cmdline.vmm::controller::spawn::ShimSpawnerwas launchingboxlite-shimwith empty argv (config travels via stdin pipe to keep CA keys/secrets out of/proc/cmdline), so the substring check returnedfalsefor every live shim, recovery marked the boxStopped, the pid file was deleted, and the original shim survived as an orphan holding KVM/RAM/loop/port resources. See Shim leak on recovery:is_same_processrejects live shim, leaves orphan holding host ports #565 for the full failure mode and concrete reproduction.box_idas its sole argv viaShimSpawner::shim_argv(). The shim binary itself does not read it — the value is purely there so/proc/<pid>/cmdlinecarries box_id foris_same_processto validate.box_idis a short random identifier and is not sensitive; the config remains on the stdin transport.util::process::tests::test_is_same_process_accepts_live_shim_with_box_id_in_argv) that spawns a/bin/shstand-in mirroring the post-fix argv shape and pins three properties: live shim with matching box_id → accept; same shim with mismatched box_id → reject; the probe is read-only (process stays alive). Complements the existing end-to-end coverage insrc/cli/tests/rm.rs::test_rm_force_running, which was deterministically failing onmaindue to this same recovery misidentification and passes after the fix.Closes #565.
The 3-line spawn change and its rationale doc-comment trace back to a previous exploration commit (
158feb7, onrefactor/pump-fairness-channel) where the same fix was bundled with unrelated dind/pid1-pump work. This PR extracts it as a focused change so reviewers can reason about it independently.Test plan
cargo nextest run -p boxlite --lib -E 'test(/util::process::tests::test_is_same_process/) + test(=vmm::controller::spawn::tests::test_shim_argv_carries_box_id)'— 4 tests, all PASS, no warnings, no leaky processescargo nextest run -p boxlite --lib --no-fail-fast— 704 lib tests, 702 PASS + 2 pre-existing AppArmor/userns failures unrelated to this change (verified viagit stashagainstmain; tracked upstream as Jailer integration tests fail on Ubuntu 24+ default kernel/AppArmor (unprivileged userns blocked) #468 / Question: shouldtest_can_create_process_in_new_user_nstreat AppArmor’s EACCES as an expected errno? #544)shim_argv()temporarily returningVec::new(), the new spawn test fails withleft: []/right: ["test-box"]; restoring the fix passes — direct verification that the test catches the bug it claims to.cargo nextest run -p boxlite-cli --tests --profile vm -E 'test(=test_rm_force_running)'— PASS in 11.18s. This test was deterministically failing onmainbecause the recovery branch wrongly marked live boxesStoppedand letboxlite rm <running-box>succeed without--force. Its passing after the fix is the strongest behavioral evidence the recovery roundtrip now preserves live shims correctly.Regression history
This isn't a "the original design had this hole" bug — it's a latent regression introduced ~7 weeks ago by an over-tightened security refactor. Tracing the spawn-argv history with
git log -G 'no_args|build_shim_args' --follow:ShimSpawner::spawnbuilt argv viabuild_shim_args(config_json), returning roughly["--engine", "Libkrun", config_json, ...]. The fullconfig_jsonblob was a positional arg, which meant the box's id field ("id":"..."inside the JSON) landed in/proc/<pid>/cmdline.is_same_process_linuxwas correct relative to that layout —cmdline.contains(box_id)matched because box_id was embedded in the JSON arg.feat(net): add secret substitution via TLS MITM proxy(2026-03-30): rightly stopped passingconfig_jsonvia argv to keep CA private keys and substituted-secret values out of world-readable/proc/<pid>/cmdline. Diff insrc/boxlite/src/vmm/controller/spawn.rs:build_shim_argsfunction was deleted and the test (test_build_shim_args) was hollowed out into a constructor smoke check, so there was no longer any test asserting argv had identity-bearing content.box_idhad been incidentally riding along in the config JSON; it's a short random identifier and is not secret, so it was safe to keep — but it got swept out with the secret config. The implicit contract betweenspawn.rsandutil/process.rs::is_same_process_linux("cmdline must carry box_id") was never documented anywhere, so the refactor had no way to know it was breaking it.is_same_processreturning false sends recovery down the "stale PID" branch — which looks like a successful cleanup (pid file removed, box markedStopped, no error logs). The actual harm (shim orphaned, KVM/RAM/loop/port held) accumulates silently per-recovery, and only became visible at the integration-test scale documented in Shim leak on recovery:is_same_processrejects live shim, leaves orphan holding host ports #565 (host gradually starved → AWS Nitro hypervisor force-reset the EC2 VM).This PR formalizes the contract:
ShimSpawner::shim_argvdocuments why box_id must be in argv (see its rustdoc), and the newis_same_processtest pins the consumer side. Future spawn refactors will fail loudly instead of silently regressing.Worth noting: the JSON-in-argv pre-#412 state was itself a latent secret-exposure hole, so #412 closing it was the right call — the regression is purely an over-tightening, not a wrong direction.
Notes for reviewers
make test:changed → make test:integration:rust) flakes on this PR for reasons unrelated to it:boxlite::zygote_integrationtests hit theInitReady/IntermediateReadyrace tracked upstream as fix(guest): apply youki PR #3504 to fix InitReady/IntermediateReady race under concurrent exec #455 (youki #3504).cargo nextestretries pass them but themakewrapper treats first-try failures as overall failure. I pushed with--no-verifyafter observing the flakes only touch zygote tests and never touchvmm::controller::spawnorutil::processpaths. Would gladly hold this PR behind a separate fix for fix(guest): apply youki PR #3504 to fix InitReady/IntermediateReady race under concurrent exec #455 if you prefer.rt_impl.rs::shutdown_sync's|| config.options.detachskip on line 677. Withis_same_processno longer misidentifying live shims, the pid file is preserved and the normalboxlite stop/rmpaths can find and signal the shim; the in-process test fixture cleanup path that needs thedetachskip removed is a separate follow-up.🤖 Generated with Claude Code