Skip to content

fix(make): propagate test failures instead of masking via POSIX ; chaining (#550)#1

Closed
G4614 wants to merge 1 commit into
mainfrom
fix/issue-550-makefile-chain-status
Closed

fix(make): propagate test failures instead of masking via POSIX ; chaining (#550)#1
G4614 wants to merge 1 commit into
mainfrom
fix/issue-550-makefile-chain-status

Conversation

@G4614

@G4614 G4614 commented May 25, 2026

Copy link
Copy Markdown
Owner

Summary

Three Makefile recipes chained commands with ; instead of && / status-accumulation. POSIX evaluates if A; then B; C; fi to the rc of C, so an earlier failure was silently swallowed — CI sees green even when tests fail. Meta-bug: this masks every other bug, so it's the highest-leverage fix to land first.

What was broken

Fix

  • test:unit:rust: status accumulation (|| rc=\$?) so both crates always run and final rc reflects any failure. Added --no-tests=warn to keep make test:unit:rust FILTER=<test-only-in-one-crate> working (nextest's default --no-tests=fail would otherwise exit 4 from the crate with zero matches).
  • dist:python / dist:c: switched intra-branch chains to && so downstream commands can't run against a failed upstream.
  • .pre-commit-config.yaml: widened lint-fix hook's files filter from make/quality\\.mk to make/.*\\.mk so future make/test.mk / make/dist.mk edits go through the same autofix gate (the original narrow filter is exactly what let this class of issue land without local validation).

Test plan

  • Happy path: make test:unit:rust FILTER=container_layout (matches 1 in boxlite-shared, 0 in boxlite) → exit 0 — no false-fail from the empty crate.
  • Failure path: make test:unit:rust FILTER=can_create_process_in_new_user_ns on an AppArmor-restricted host (the Question: should test_can_create_process_in_new_user_ns treat AppArmor’s EACCES as an expected errno? boxlite-ai/boxlite#544 EACCES panic in -p boxlite) → recipe exit 100, make exit 2 — the failure correctly propagates instead of being swallowed by -p boxlite-shared finishing clean.
  • Shell logic sanity-checked with bash -c 'rc=0; false || rc=\$?; true || rc=\$?; exit \$rc' → exit 1.

Closes boxlite-ai#550.

🤖 Generated with Claude Code

…ining

Fixes boxlite-ai#550. Three recipes chained multiple commands inside one shell
with `;` instead of `&&` / status-accumulation. POSIX evaluates
`if A; then B; C; fi` to the rc of C, so:

  - test:unit:rust: a `-p boxlite` lib-test failure was hidden by a
    passing `-p boxlite-shared` run — CI saw exit 0 with 29 failed tests
    in the log. Now uses status-accumulation (both crates always run,
    final rc reflects any failure), plus `--no-tests=warn` to avoid a
    false-fail when FILTER scopes to a test that only lives in one crate
    (nextest's default --no-tests=fail exits 4 with zero matches).
  - dist:python Linux branch: build-guest.sh failure was masked by
    cibuildwheel's rc. Switched to && so cibuildwheel can't proceed
    against a stale guest binary.
  - dist:c per-platform branches: a failing dynamic-lib cp left the
    static-lib cp to "succeed", producing a half-staged sdks/c/dist with
    exit 0. Switched to && so any cp failure aborts the platform branch.

Also widens .pre-commit-config.yaml `lint-fix` hook's files filter from
`make/quality\.mk` to `make/.*\.mk` so future make/test.mk / make/dist.mk
edits run through the same fmt+lint autofix gate (the original narrow
filter let this exact class of issue land without local validation).

Verified locally:
  - happy path: make test:unit:rust FILTER=container_layout → exit 0
    (matches 1 in boxlite-shared, 0 in boxlite — no false-fail)
  - failure path: make test:unit:rust FILTER=can_create_process_in_new_user_ns
    → recipe exit 100, make exit 2 (the boxlite-ai#544 EACCES panic in -p boxlite
    correctly propagates instead of being swallowed by -p boxlite-shared
    finishing clean)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614

G4614 commented May 25, 2026

Copy link
Copy Markdown
Owner Author

Re-opening against upstream boxlite-ai/boxlite — wrong target repo.

@G4614 G4614 closed this May 25, 2026
G4614 added a commit that referenced this pull request Jun 1, 2026
The pre-fix GC scope was a single sweep of images/disk-images/*.ext4.
On a long-running `boxlite serve` (or any host that crash-leaked artifacts
without a runtime restart), the two biggest reclaim sources — orphan
boxes/<id>/ dirs (per-box qcow2 overlay + logs + sockets) and orphan
bases/*.qcow2 (snapshot/clone bases left when the `try_gc_base` cascade
crashed) — were unreachable from `boxlite gc`. Cleanup ran only at
startup or as a remove_box() side effect.

Now `collect_garbage` runs three sweeps in a fixed order:

  1. boxes/<id>/ — id absent from box DB, past 10-min mtime grace
  2. bases/*.qcow2 — base_path absent from base_disk table, past grace
  3. images/disk-images/*.ext4 — unchanged behavior, but pinning is built
     from live-only boxes (id in DB) so an orphan's stale qcow2 backing
     chain doesn't keep image cache pinned

Empty `known` (DB read failure or fresh runtime with no boxes) short-
circuits #1 and treats every on-disk box as live in #3 — safer than
mistakenly deleting every box dir or every image-disk under a transient
DB error.

GcReport now carries box_dirs_removed / bases_removed alongside the
existing disk_images_removed counters; `boxlite gc` prints all three.

Tests: 3 new (sweeps_orphan_box_dirs_only,
orphan_box_dir_does_not_pin_image_disk_image, sweeps_orphan_bases_only)
plus existing 4 still green; CLI gc_cli updated for the new output shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit that referenced this pull request Jun 2, 2026
…oxlite-ai#523)

Two production code paths spawn shim subprocesses, send them SIGKILL,
and never `waitpid` — so a dead shim sits in `/proc` as `State: Z`
indefinitely, holding the PID slot and (in `boxlite serve` mode)
blocking init from reparenting because the daemon is still its parent.

**#1: async death (OOM / external SIGKILL/SIGTERM / internal panic).**
The shim dies without anyone in the runtime asking it to. The only
loop already polling shim liveness is health check (`box_impl.rs::
spawn_health_check`), but its `shim_died` branch did just one thing:
mark `Stopped` / `Unhealthy` in the DB and break. No `waitpid`. The
`is_process_alive` detector correctly treats `State: Z` as dead, so
the branch *fires* on a zombie — it just doesn't reap.

Fix: in the `shim_died` arm, `libc::waitpid(pid, _, WNOHANG)` in a
short bounded retry loop. SIGKILL → zombie transition is not always
synchronous with the wait queue: WNOHANG can briefly return 0 before
the kernel posts SIGCHLD/wait status, even when `/proc` already shows
`State: Z`. 500 ms ceiling with 10 ms backoff covers the race without
ever stalling the health-check loop.

The fix only takes effect if a health check task is *running*, so:

  - `AdvancedBoxOptions::default()` now spawns a health check with
    `HealthCheckOptions::default()` instead of `None`. Default
    interval 30 s, 60 s start period, 10 s ping timeout — the
    pre-existing defaults; no operator-visible cost beyond one ping
    per box per 30 s.
  - The decision is documented inline on the `health_check` field:
    health check is the *only* always-on shim-state watcher; explicit
    `health_check: None` opts out of both pings and zombie reaping
    (operator takes responsibility).

**#2: `boxlite serve` + REST `rm --force`.** Daemon's `rt_impl::
remove_box(force=true)` calls `kill_process(pid)` (SIGKILL by PID,
bypassing the canonical `stop()` path that *does* reap via
`Child::wait()`). The handler returns immediately and the box is
removed from the DB. The shim's `PPid` stays pointed at the daemon
(still alive), so init can't help.

Fix: after `kill_process`, run the same `WNOHANG`-with-deadline
polling loop already used in `vmm/controller/shim.rs::ShimHandler::
stop`'s SIGTERM path. 2 s deadline matches the existing
`GRACEFUL_SHUTDOWN_TIMEOUT_MS`; if SIGKILL doesn't yield a reapable
state in 2 s, the kernel is wedged and a longer wait won't help.

Three reproducer tests, each two-side verified (revert the relevant
production change → test goes red with the exact zombie signal;
restore → test goes green):

  - `boxlite::tests::health_check::
       health_check_becomes_unhealthy_when_shim_killed`
       (existing test on main, now actually passes — the
       `PerTestBoxHome::drop` zombie-leak panic is the load-bearing
       failure; this PR adds a pin on `/proc/<pid>/status` not being
       `State: Z` so the regression signal is inline)
  - `boxlite::tests::health_check::
       health_check_becomes_unhealthy_when_shim_sigtermed`
       (new — same shape, SIGTERM, covers k8s liveness initial
       signal / cgroup OOM initial / systemctl stop / plain kill)
  - `boxlite_cli::tests::serve_rm_force_zombie::
       boxlite_serve_rm_force_active_box_no_zombie_left_in_proc`
       (new — spawns real `boxlite serve` subprocess, drives REST
       `DELETE /v1/boxes/<id>?force=true` via CLI `--url`, watches
       `/proc/<shim_pid>/status` from *outside* the daemon. The
       failing-path panic captures `PPid:` against the daemon's
       PID — the load-bearing diagnostic that proves init isn't
       going to clean this up because the daemon is still alive
       and still the parent.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Should test:unit:rust / dist:python / dist:c chain commands with && (or accumulate $status) instead of ;?

1 participant