fix(cli): stop box before hard-exit on non-zero run exit code#622
Conversation
`boxlite run` propagated a non-zero command exit via std::process::exit,
which skips Drop and the box's async auto-stop/auto-remove — leaking the
box's shim as a live host process. The success path tears the box down via
normal teardown when run() returns, but the non-zero path never reached it.
Explicitly stop the box (kills the shim; removes it when --rm) before
std::process::exit. Fixes the abnormal-exit run integration tests
(exit_code_125/custom, signal_exit_code_{sigint,sigkill,sigterm},
python_error_handling) that tripped PerTestBoxHome's live-shim guard.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
94daf72 to
02df899
Compare
| let exit_code = streamer.start().await?; | ||
| // Exit with box's exit code | ||
| if exit_code != 0 { | ||
| // Tear the box down before hard-exiting. std::process::exit skips |
There was a problem hiding this comment.
try to design RAII around this so we execution and litebox will be dropped automatically in any branch
There was a problem hiding this comment.
changed to RAII mode, also wrote test for each branch to confirm it, thx
| // Drop and the async auto-stop/auto-remove, so a non-zero command | ||
| // exit would otherwise leak the box's shim as a live process (the | ||
| // success path stops the box via normal teardown when run returns). | ||
| drop(execution); |
There was a problem hiding this comment.
write a test case to cover this issue
There was a problem hiding this comment.
test_run_rm_non_zero_exit_does_not_leak_shim added for this, thx
…ess::exit Address PR boxlite-ai#622 review (DorianZheng): redesign so execution/litebox/the owning BoxliteRuntime drop on every return path instead of relying on a manual stop call before std::process::exit. process::exit bypasses Drop entirely, which is exactly what leaked the box's shim on the non-zero path; the only true RAII fix is to never call it mid-command. run::execute and exec::execute now return Result<i32> (the shell exit code), main returns ExitCode, and run_cli funnels every command through the same dispatcher. When run_cli returns, BoxliteRuntime drops, and RuntimeImpl::Drop -> shutdown_sync() reaps the shim - the same teardown the success path already relied on. Adds an explicit reproducer: test_run_rm_non_zero_exit_does_not_leak_shim scans <home>/boxes/*/shim.pid in the test body (not just via PerTestBoxHome::Drop), so the assertion is visible at the call site. Two-side verified: pre-fix simulation fails with "live shim PID(s): [..]", post-fix passes; exposes test_utils::home::live_shim_pids as pub for that. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… wrapper PR boxlite-ai#622 review follow-up: address the aesthetic drift introduced by the RAII fix (per-branch .map(|_| 0) noise and the ExitCode::from(... as u8) cast in main). - Every command's `pub async fn execute` (and `auth::run`) now returns `anyhow::Result<i32>`. Unit-success commands `Ok(0)` at the end; run/exec pass through the box's mapped shell exit code unchanged. Dispatcher in `run_cli` no longer needs `.map(|_| 0)` adapters. - `main` is back to `fn main()`. The tokio runtime is dropped explicitly before `process::exit(code)` so the BoxliteRuntime Drop chain (RuntimeImpl::Drop -> shutdown_sync) has already finished by then — the hazard called out in boxlite-ai#622 was `process::exit` *mid-command*, not in `main` after every stack frame has unwound. Verified: cargo check, 118 CLI unit tests, clippy -D warnings, fmt:check, and the focused leak repro (test_run_rm_non_zero_exit_does_not_leak_shim, ok 8.57s) all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Walk back the scope-creep portion of 69df038. The shim-leak RAII fix only needs run.rs / exec.rs / main.rs — making the 15 unit-success commands also return Result<i32> was purely dispatcher cosmetics, and it had a type-honesty cost: a command like `boxlite cp` that has no exit-code concept ended up advertising one (always 0). This commit: - Restores `anyhow::Result<()>` + `Ok(())` on auth/cp/create/images/info/ inspect/list/logs/pull/restart/rm/serve/start/stats/stop. - Puts the 15 `.map(|_| 0)` adapters back in `run_cli`'s dispatcher, collocated so the asymmetry is visible at one site (`run`/`exec` real; others adapted). - Keeps main.rs's `fn main() { ... drop(rt); process::exit(code); }` simplification — that part isn't scope creep, it's the RAII fix. cargo check, 118 CLI unit tests, clippy -D warnings, fmt:check, test_run_rm_non_zero_exit_does_not_leak_shim (ok 8.22s) all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rn paths The focused boxlite-ai#622 reproducer (test_run_rm_non_zero_exit_does_not_leak_shim) only exercises the std::process::exit branch that actually leaked. Add two companion tests that pin the same RAII invariant onto the two other CLI-reachable early-return points in BoxRunner::run: - test_run_image_pull_failure_does_not_leak_shim (branch ②) — create_box? fails on a non-existent image; no shim ever spawns, but partial-VM state must drop cleanly. - test_run_exec_setup_failure_does_not_leak_shim (branch ③) — litebox.exec? fails (invoking a directory) after the box is fully running; Drop has to reap the live shim. Both pass under today's RAII fix (4.4 s each in parallel). These branches are not affected by the original boxlite-ai#622 bug (they use `?` rather than process::exit, so Drop always ran) — the value is forward-looking: if anyone introduces a Drop-bypass shortcut on these paths later, the tests fail. The remaining two early-exits in BoxRunner::run — streamer.start? (signal-handler init, effectively unreachable in normal CLI invocation) and a panic mid-`run()` — are left to Rust's stack-unwinding guarantee plus PerTestBoxHome::Drop's implicit guard, documented inline so the choice is visible. No mock-injection tests for those. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Caution Review failedPull request was closed or merged during review No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughThe PR refactors CLI commands to return exit codes instead of calling ChangesCLI Exit Code Propagation and RAII Cleanup
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
when the box killed badly (non-zero return code), the shim process should be released by RAII
appeared because of #604
Test plan
Existing exit-code tests (branch ⑥: non-zero exit)
The abnormal-exit
boxlite runintegration tests, each verified two-sided (run.rsreverted tostd::process::exitvs. RAII applied) on this branch. Side B's leak surfaces viaPerTestBoxHome::Droppanicking withlive shim(s): [pid]— thestd::process::exitshortcut bypassesRuntimeImpl::Drop→shutdown_sync.std::process::exit)test_run_exit_code_125live shim(s): [320092]test_run_exit_code_customlive shim(s)test_run_signal_exit_code_sigtermlive shim(s): [323543]test_run_signal_exit_code_sigkilllive shim(s)test_run_signal_exit_code_sigintlive shim(s)test_run_python_error_handlinglive shim(s): [326872]test_run_exit_code_success(control)Focused reproducer for branch ⑥ (added on review)
test_run_rm_non_zero_exit_does_not_leak_shimruns the same scenario (run --rm alpine:latest sh -c 'exit 7') but scans<home>/boxes/*/shim.pidfor live PIDs in the test body — so the no-leak assertion is visible at the call site instead of buried inPerTestBoxHome::Drop's panic.test_run_rm_non_zero_exit_does_not_leak_shim ... ok(5.79s)run.rs:93reverted tostd::process::exit(to_shell_exit_code(exit_code))non-zero boxlite run --rm left live shim PID(s): [286477]test_run_rm_non_zero_exit_does_not_leak_shim ... ok(6.06s)Other early-return branches in
BoxRunner::runReviewer's request was "execution and litebox dropped automatically in any branch". The remaining CLI-reachable early-returns each get their own no-leak test:
validate_flags?--ttywith non-TTY stdintest_run_tty_error_in_pipe(existing)create_box?test_run_image_pull_failure_does_not_leak_shimlitebox.exec?/etc(a directory)test_run_exec_setup_failure_does_not_leak_shimdetach return Ok(0)-dflagtest_run_detach(existing, manually rm's the box)streamer.start?PerTestBoxHome::Drop's implicit guard + Rust's stack-unwinding guaranteeOk(to_shell_exit_code(0))test_run_exit_code_success+ 30+ commands as side-effectTwo-sided verification for branch ③ (the only realistic injection point on a path where a shim is actually alive at the failure moment):
test_run_exec_setup_failure_does_not_leak_shim ... ok(4.44s)litebox.exec().await?replaced withmatch { Err => std::process::exit(1) }to inject the Drop-bypass pattern onto this branchexec-setup failure left live shim PID(s): [395393]test_run_exec_setup_failure_does_not_leak_shim ... ok(5.22s)Why this works
Pre-fix, a non-zero command exit took the
std::process::exitshortcut that bypasses the box teardown the success path runs on return, leaking the microVM's shim (the source of #604's "orphan shims in /tmp"). Post-fix returns the exit code asResult<i32>and letsmainreturnExitCode; the runtime drops on every return path andRuntimeImpl::Drop→shutdown_syncSIGTERMs the shim — the same teardown the success path already relied on.make fmt:check+cargo clippy -- -D warningsclean; pre-push CLI matrix277 tests run: 277 passed, 0 skipped(~83s).Summary by CodeRabbit
Bug Fixes
Tests