fix(make): propagate test failures instead of masking via POSIX ; chaining (#550)#1
Closed
G4614 wants to merge 1 commit into
Closed
fix(make): propagate test failures instead of masking via POSIX ; chaining (#550)#1G4614 wants to merge 1 commit into
; chaining (#550)#1G4614 wants to merge 1 commit into
Conversation
…ining Fixes boxlite-ai#550. Three recipes chained multiple commands inside one shell with `;` instead of `&&` / status-accumulation. POSIX evaluates `if A; then B; C; fi` to the rc of C, so: - test:unit:rust: a `-p boxlite` lib-test failure was hidden by a passing `-p boxlite-shared` run — CI saw exit 0 with 29 failed tests in the log. Now uses status-accumulation (both crates always run, final rc reflects any failure), plus `--no-tests=warn` to avoid a false-fail when FILTER scopes to a test that only lives in one crate (nextest's default --no-tests=fail exits 4 with zero matches). - dist:python Linux branch: build-guest.sh failure was masked by cibuildwheel's rc. Switched to && so cibuildwheel can't proceed against a stale guest binary. - dist:c per-platform branches: a failing dynamic-lib cp left the static-lib cp to "succeed", producing a half-staged sdks/c/dist with exit 0. Switched to && so any cp failure aborts the platform branch. Also widens .pre-commit-config.yaml `lint-fix` hook's files filter from `make/quality\.mk` to `make/.*\.mk` so future make/test.mk / make/dist.mk edits run through the same fmt+lint autofix gate (the original narrow filter let this exact class of issue land without local validation). Verified locally: - happy path: make test:unit:rust FILTER=container_layout → exit 0 (matches 1 in boxlite-shared, 0 in boxlite — no false-fail) - failure path: make test:unit:rust FILTER=can_create_process_in_new_user_ns → recipe exit 100, make exit 2 (the boxlite-ai#544 EACCES panic in -p boxlite correctly propagates instead of being swallowed by -p boxlite-shared finishing clean) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
|
Re-opening against upstream boxlite-ai/boxlite — wrong target repo. |
G4614
added a commit
that referenced
this pull request
Jun 1, 2026
The pre-fix GC scope was a single sweep of images/disk-images/*.ext4.
On a long-running `boxlite serve` (or any host that crash-leaked artifacts
without a runtime restart), the two biggest reclaim sources — orphan
boxes/<id>/ dirs (per-box qcow2 overlay + logs + sockets) and orphan
bases/*.qcow2 (snapshot/clone bases left when the `try_gc_base` cascade
crashed) — were unreachable from `boxlite gc`. Cleanup ran only at
startup or as a remove_box() side effect.
Now `collect_garbage` runs three sweeps in a fixed order:
1. boxes/<id>/ — id absent from box DB, past 10-min mtime grace
2. bases/*.qcow2 — base_path absent from base_disk table, past grace
3. images/disk-images/*.ext4 — unchanged behavior, but pinning is built
from live-only boxes (id in DB) so an orphan's stale qcow2 backing
chain doesn't keep image cache pinned
Empty `known` (DB read failure or fresh runtime with no boxes) short-
circuits #1 and treats every on-disk box as live in #3 — safer than
mistakenly deleting every box dir or every image-disk under a transient
DB error.
GcReport now carries box_dirs_removed / bases_removed alongside the
existing disk_images_removed counters; `boxlite gc` prints all three.
Tests: 3 new (sweeps_orphan_box_dirs_only,
orphan_box_dir_does_not_pin_image_disk_image, sweeps_orphan_bases_only)
plus existing 4 still green; CLI gc_cli updated for the new output shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
added a commit
that referenced
this pull request
Jun 2, 2026
…oxlite-ai#523) Two production code paths spawn shim subprocesses, send them SIGKILL, and never `waitpid` — so a dead shim sits in `/proc` as `State: Z` indefinitely, holding the PID slot and (in `boxlite serve` mode) blocking init from reparenting because the daemon is still its parent. **#1: async death (OOM / external SIGKILL/SIGTERM / internal panic).** The shim dies without anyone in the runtime asking it to. The only loop already polling shim liveness is health check (`box_impl.rs:: spawn_health_check`), but its `shim_died` branch did just one thing: mark `Stopped` / `Unhealthy` in the DB and break. No `waitpid`. The `is_process_alive` detector correctly treats `State: Z` as dead, so the branch *fires* on a zombie — it just doesn't reap. Fix: in the `shim_died` arm, `libc::waitpid(pid, _, WNOHANG)` in a short bounded retry loop. SIGKILL → zombie transition is not always synchronous with the wait queue: WNOHANG can briefly return 0 before the kernel posts SIGCHLD/wait status, even when `/proc` already shows `State: Z`. 500 ms ceiling with 10 ms backoff covers the race without ever stalling the health-check loop. The fix only takes effect if a health check task is *running*, so: - `AdvancedBoxOptions::default()` now spawns a health check with `HealthCheckOptions::default()` instead of `None`. Default interval 30 s, 60 s start period, 10 s ping timeout — the pre-existing defaults; no operator-visible cost beyond one ping per box per 30 s. - The decision is documented inline on the `health_check` field: health check is the *only* always-on shim-state watcher; explicit `health_check: None` opts out of both pings and zombie reaping (operator takes responsibility). **#2: `boxlite serve` + REST `rm --force`.** Daemon's `rt_impl:: remove_box(force=true)` calls `kill_process(pid)` (SIGKILL by PID, bypassing the canonical `stop()` path that *does* reap via `Child::wait()`). The handler returns immediately and the box is removed from the DB. The shim's `PPid` stays pointed at the daemon (still alive), so init can't help. Fix: after `kill_process`, run the same `WNOHANG`-with-deadline polling loop already used in `vmm/controller/shim.rs::ShimHandler:: stop`'s SIGTERM path. 2 s deadline matches the existing `GRACEFUL_SHUTDOWN_TIMEOUT_MS`; if SIGKILL doesn't yield a reapable state in 2 s, the kernel is wedged and a longer wait won't help. Three reproducer tests, each two-side verified (revert the relevant production change → test goes red with the exact zombie signal; restore → test goes green): - `boxlite::tests::health_check:: health_check_becomes_unhealthy_when_shim_killed` (existing test on main, now actually passes — the `PerTestBoxHome::drop` zombie-leak panic is the load-bearing failure; this PR adds a pin on `/proc/<pid>/status` not being `State: Z` so the regression signal is inline) - `boxlite::tests::health_check:: health_check_becomes_unhealthy_when_shim_sigtermed` (new — same shape, SIGTERM, covers k8s liveness initial signal / cgroup OOM initial / systemctl stop / plain kill) - `boxlite_cli::tests::serve_rm_force_zombie:: boxlite_serve_rm_force_active_box_no_zombie_left_in_proc` (new — spawns real `boxlite serve` subprocess, drives REST `DELETE /v1/boxes/<id>?force=true` via CLI `--url`, watches `/proc/<shim_pid>/status` from *outside* the daemon. The failing-path panic captures `PPid:` against the daemon's PID — the load-bearing diagnostic that proves init isn't going to clean this up because the daemon is still alive and still the parent.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three Makefile recipes chained commands with
;instead of&&/ status-accumulation. POSIX evaluatesif A; then B; C; fito the rc ofC, so an earlier failure was silently swallowed — CI sees green even when tests fail. Meta-bug: this masks every other bug, so it's the highest-leverage fix to land first.What was broken
test:unit:rust— a-p boxlitelib-test failure was hidden by a passing-p boxlite-sharedrun. Captured log in Should test:unit:rust / dist:python / dist:c chain commands with && (or accumulate $status) instead of ;? boxlite-ai/boxlite#550 shows674 passed; 29 failedfollowed by recipeexit=0.dist:pythonLinux branch —build-guest.shfailure masked bycibuildwheel's rc; wheel built against stale guest.dist:cper-platform branch — failing dynamic-libcpleft the static-libcpto "succeed", producing a half-stagedsdks/c/dist/with exit 0.Fix
test:unit:rust: status accumulation (|| rc=\$?) so both crates always run and final rc reflects any failure. Added--no-tests=warnto keepmake test:unit:rust FILTER=<test-only-in-one-crate>working (nextest's default--no-tests=failwould otherwise exit 4 from the crate with zero matches).dist:python/dist:c: switched intra-branch chains to&&so downstream commands can't run against a failed upstream..pre-commit-config.yaml: widenedlint-fixhook's files filter frommake/quality\\.mktomake/.*\\.mkso futuremake/test.mk/make/dist.mkedits go through the same autofix gate (the original narrow filter is exactly what let this class of issue land without local validation).Test plan
make test:unit:rust FILTER=container_layout(matches 1 in boxlite-shared, 0 in boxlite) → exit 0 — no false-fail from the empty crate.make test:unit:rust FILTER=can_create_process_in_new_user_nson an AppArmor-restricted host (the Question: shouldtest_can_create_process_in_new_user_nstreat AppArmor’s EACCES as an expected errno? boxlite-ai/boxlite#544 EACCES panic in-p boxlite) → recipe exit 100, make exit 2 — the failure correctly propagates instead of being swallowed by-p boxlite-sharedfinishing clean.bash -c 'rc=0; false || rc=\$?; true || rc=\$?; exit \$rc'→ exit 1.Closes boxlite-ai#550.
🤖 Generated with Claude Code