fix(make): propagate test failures instead of masking via POSIX `;` chaining (#550) by G4614 · Pull Request #1 · G4614/boxlite

G4614 · 2026-05-25T04:09:32Z

Summary

Three Makefile recipes chained commands with ; instead of && / status-accumulation. POSIX evaluates if A; then B; C; fi to the rc of C, so an earlier failure was silently swallowed — CI sees green even when tests fail. Meta-bug: this masks every other bug, so it's the highest-leverage fix to land first.

What was broken

test:unit:rust — a -p boxlite lib-test failure was hidden by a passing -p boxlite-shared run. Captured log in Should test:unit:rust / dist:python / dist:c chain commands with && (or accumulate $status) instead of ;? boxlite-ai/boxlite#550 shows 674 passed; 29 failed followed by recipe exit=0.
dist:python Linux branch — build-guest.sh failure masked by cibuildwheel's rc; wheel built against stale guest.
dist:c per-platform branch — failing dynamic-lib cp left the static-lib cp to "succeed", producing a half-staged sdks/c/dist/ with exit 0.

Fix

test:unit:rust: status accumulation (|| rc=\$?) so both crates always run and final rc reflects any failure. Added --no-tests=warn to keep make test:unit:rust FILTER=<test-only-in-one-crate> working (nextest's default --no-tests=fail would otherwise exit 4 from the crate with zero matches).
dist:python / dist:c: switched intra-branch chains to && so downstream commands can't run against a failed upstream.
.pre-commit-config.yaml: widened lint-fix hook's files filter from make/quality\\.mk to make/.*\\.mk so future make/test.mk / make/dist.mk edits go through the same autofix gate (the original narrow filter is exactly what let this class of issue land without local validation).

Test plan

Happy path: make test:unit:rust FILTER=container_layout (matches 1 in boxlite-shared, 0 in boxlite) → exit 0 — no false-fail from the empty crate.
Failure path: make test:unit:rust FILTER=can_create_process_in_new_user_ns on an AppArmor-restricted host (the Question: should test_can_create_process_in_new_user_ns treat AppArmor’s EACCES as an expected errno? boxlite-ai/boxlite#544 EACCES panic in -p boxlite) → recipe exit 100, make exit 2 — the failure correctly propagates instead of being swallowed by -p boxlite-shared finishing clean.
Shell logic sanity-checked with bash -c 'rc=0; false || rc=\$?; true || rc=\$?; exit \$rc' → exit 1.

Closes boxlite-ai#550.

🤖 Generated with Claude Code

…ining Fixes boxlite-ai#550. Three recipes chained multiple commands inside one shell with `;` instead of `&&` / status-accumulation. POSIX evaluates `if A; then B; C; fi` to the rc of C, so: - test:unit:rust: a `-p boxlite` lib-test failure was hidden by a passing `-p boxlite-shared` run — CI saw exit 0 with 29 failed tests in the log. Now uses status-accumulation (both crates always run, final rc reflects any failure), plus `--no-tests=warn` to avoid a false-fail when FILTER scopes to a test that only lives in one crate (nextest's default --no-tests=fail exits 4 with zero matches). - dist:python Linux branch: build-guest.sh failure was masked by cibuildwheel's rc. Switched to && so cibuildwheel can't proceed against a stale guest binary. - dist:c per-platform branches: a failing dynamic-lib cp left the static-lib cp to "succeed", producing a half-staged sdks/c/dist with exit 0. Switched to && so any cp failure aborts the platform branch. Also widens .pre-commit-config.yaml `lint-fix` hook's files filter from `make/quality\.mk` to `make/.*\.mk` so future make/test.mk / make/dist.mk edits run through the same fmt+lint autofix gate (the original narrow filter let this exact class of issue land without local validation). Verified locally: - happy path: make test:unit:rust FILTER=container_layout → exit 0 (matches 1 in boxlite-shared, 0 in boxlite — no false-fail) - failure path: make test:unit:rust FILTER=can_create_process_in_new_user_ns → recipe exit 100, make exit 2 (the boxlite-ai#544 EACCES panic in -p boxlite correctly propagates instead of being swallowed by -p boxlite-shared finishing clean) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

G4614 · 2026-05-25T04:12:35Z

Re-opening against upstream boxlite-ai/boxlite — wrong target repo.

The pre-fix GC scope was a single sweep of images/disk-images/*.ext4. On a long-running `boxlite serve` (or any host that crash-leaked artifacts without a runtime restart), the two biggest reclaim sources — orphan boxes/<id>/ dirs (per-box qcow2 overlay + logs + sockets) and orphan bases/*.qcow2 (snapshot/clone bases left when the `try_gc_base` cascade crashed) — were unreachable from `boxlite gc`. Cleanup ran only at startup or as a remove_box() side effect. Now `collect_garbage` runs three sweeps in a fixed order: 1. boxes/<id>/ — id absent from box DB, past 10-min mtime grace 2. bases/*.qcow2 — base_path absent from base_disk table, past grace 3. images/disk-images/*.ext4 — unchanged behavior, but pinning is built from live-only boxes (id in DB) so an orphan's stale qcow2 backing chain doesn't keep image cache pinned Empty `known` (DB read failure or fresh runtime with no boxes) short- circuits #1 and treats every on-disk box as live in #3 — safer than mistakenly deleting every box dir or every image-disk under a transient DB error. GcReport now carries box_dirs_removed / bases_removed alongside the existing disk_images_removed counters; `boxlite gc` prints all three. Tests: 3 new (sweeps_orphan_box_dirs_only, orphan_box_dir_does_not_pin_image_disk_image, sweeps_orphan_bases_only) plus existing 4 still green; CLI gc_cli updated for the new output shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oxlite-ai#523) Two production code paths spawn shim subprocesses, send them SIGKILL, and never `waitpid` — so a dead shim sits in `/proc` as `State: Z` indefinitely, holding the PID slot and (in `boxlite serve` mode) blocking init from reparenting because the daemon is still its parent. **#1: async death (OOM / external SIGKILL/SIGTERM / internal panic).** The shim dies without anyone in the runtime asking it to. The only loop already polling shim liveness is health check (`box_impl.rs:: spawn_health_check`), but its `shim_died` branch did just one thing: mark `Stopped` / `Unhealthy` in the DB and break. No `waitpid`. The `is_process_alive` detector correctly treats `State: Z` as dead, so the branch *fires* on a zombie — it just doesn't reap. Fix: in the `shim_died` arm, `libc::waitpid(pid, _, WNOHANG)` in a short bounded retry loop. SIGKILL → zombie transition is not always synchronous with the wait queue: WNOHANG can briefly return 0 before the kernel posts SIGCHLD/wait status, even when `/proc` already shows `State: Z`. 500 ms ceiling with 10 ms backoff covers the race without ever stalling the health-check loop. The fix only takes effect if a health check task is *running*, so: - `AdvancedBoxOptions::default()` now spawns a health check with `HealthCheckOptions::default()` instead of `None`. Default interval 30 s, 60 s start period, 10 s ping timeout — the pre-existing defaults; no operator-visible cost beyond one ping per box per 30 s. - The decision is documented inline on the `health_check` field: health check is the *only* always-on shim-state watcher; explicit `health_check: None` opts out of both pings and zombie reaping (operator takes responsibility). **#2: `boxlite serve` + REST `rm --force`.** Daemon's `rt_impl:: remove_box(force=true)` calls `kill_process(pid)` (SIGKILL by PID, bypassing the canonical `stop()` path that *does* reap via `Child::wait()`). The handler returns immediately and the box is removed from the DB. The shim's `PPid` stays pointed at the daemon (still alive), so init can't help. Fix: after `kill_process`, run the same `WNOHANG`-with-deadline polling loop already used in `vmm/controller/shim.rs::ShimHandler:: stop`'s SIGTERM path. 2 s deadline matches the existing `GRACEFUL_SHUTDOWN_TIMEOUT_MS`; if SIGKILL doesn't yield a reapable state in 2 s, the kernel is wedged and a longer wait won't help. Three reproducer tests, each two-side verified (revert the relevant production change → test goes red with the exact zombie signal; restore → test goes green): - `boxlite::tests::health_check:: health_check_becomes_unhealthy_when_shim_killed` (existing test on main, now actually passes — the `PerTestBoxHome::drop` zombie-leak panic is the load-bearing failure; this PR adds a pin on `/proc/<pid>/status` not being `State: Z` so the regression signal is inline) - `boxlite::tests::health_check:: health_check_becomes_unhealthy_when_shim_sigtermed` (new — same shape, SIGTERM, covers k8s liveness initial signal / cgroup OOM initial / systemctl stop / plain kill) - `boxlite_cli::tests::serve_rm_force_zombie:: boxlite_serve_rm_force_active_box_no_zombie_left_in_proc` (new — spawns real `boxlite serve` subprocess, drives REST `DELETE /v1/boxes/<id>?force=true` via CLI `--url`, watches `/proc/<shim_pid>/status` from *outside* the daemon. The failing-path panic captures `PPid:` against the daemon's PID — the load-bearing diagnostic that proves init isn't going to clean this up because the daemon is still alive and still the parent.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

G4614 closed this May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(make): propagate test failures instead of masking via POSIX `;` chaining (#550)#1

fix(make): propagate test failures instead of masking via POSIX `;` chaining (#550)#1
G4614 wants to merge 1 commit into
mainfrom
fix/issue-550-makefile-chain-status

G4614 commented May 25, 2026

Uh oh!

G4614 commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

G4614 commented May 25, 2026

Summary

What was broken

Fix

Test plan

Uh oh!

G4614 commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant