Skip to content

feat(kernel): build-time kernel selection with lean/net variants#2

Closed
G4614 wants to merge 2 commits into
mainfrom
feat/kernel-selection
Closed

feat(kernel): build-time kernel selection with lean/net variants#2
G4614 wants to merge 2 commits into
mainfrom
feat/kernel-selection

Conversation

@G4614

@G4614 G4614 commented May 26, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add kernel-lean (default) and kernel-net cargo features to control which kernel blob is embedded at build time
  • --kernel net runtime flag selects the net kernel in dual-mode builds
  • Net kernel adds ~50 netfilter/nf_tables/bridge/NET_NS modules on top of lean, required by dockerd/dind
  • Pre-built net blob auto-downloaded from GitHub releases (same pipeline as lean kernel)
  • BOXLITE_LIBKRUNFW_NET_PATH env var for developers building the kernel locally
  • Includes BoxCleanup RAII test guard extraction to test-utils

Test plan

  • Unit: kernel_net_without_blob_returns_clear_error — feature mismatch reports clear error
  • Unit: kernel_default_succeeds_without_net_blob — lean-only build works
  • Unit: CLI flag parsing (4 tests: net/default/lean/none)
  • E2E: kernel_net_has_iptables--kernel net box has iptables
  • E2E: kernel_lean_no_iptables — lean box does NOT have iptables
  • aarch64: net kernel blob SHA256 pending (x86_64 only for now)

🤖 Generated with Claude Code

G4614 and others added 2 commits May 26, 2026 13:23
RAII guard that SIGKILLs detached boxes on Drop. Scans /proc/*/fd
for FDs referencing the box's working directory — the only reliable
fingerprint after the shim daemonizes and removes its PID file.

Runs on panic too, preventing test leakage of libkrun VMs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kernel blob selection is now split across two layers:

**Build time** (cargo features):
  cargo build                                    → lean only (default)
  cargo build --features kernel-net              → net only
  cargo build --features kernel-lean,kernel-net  → both (dual mode)

**Runtime** (CLI flag, only meaningful in dual mode):
  boxlite run alpine                  → uses default (lean) kernel
  boxlite run --kernel net alpine     → uses net kernel

Single-kernel builds ignore --kernel; mismatch (e.g. --kernel net on
a lean-only build) produces a clear error pointing to the missing
feature flag.

The net kernel adds ~50 modules (netfilter/nf_tables/bridge/NET_NS)
on top of the lean kernel. Required by dockerd/dind workloads that
need iptables and bridge networking inside the VM.

Build infra: kconfig overlays, build-libkrunfw-net.sh, auto-download
from GitHub releases (same pipeline as lean kernel). Developers can
override with BOXLITE_LIBKRUNFW_NET_PATH for locally built blobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 closed this May 26, 2026
G4614 pushed a commit that referenced this pull request May 28, 2026
…ives

Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper
reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered
must be left for its owner to wait(). Two-side verified — injecting a
global waitpid(-1) into the sweep makes the owner's wait() return ECHILD
("No child processes") and the test fail; the scoped sweep preserves
exit code 42.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 pushed a commit that referenced this pull request Jun 1, 2026
…ives

Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper
reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered
must be left for its owner to wait(). Two-side verified — injecting a
global waitpid(-1) into the sweep makes the owner's wait() return ECHILD
("No child processes") and the test fail; the scoped sweep preserves
exit code 42.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 pushed a commit that referenced this pull request Jun 2, 2026
…ives

Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper
reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered
must be left for its owner to wait(). Two-side verified — injecting a
global waitpid(-1) into the sweep makes the owner's wait() return ECHILD
("No child processes") and the test fail; the scoped sweep preserves
exit code 42.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit that referenced this pull request Jun 2, 2026
…oxlite-ai#523)

Two production code paths spawn shim subprocesses, send them SIGKILL,
and never `waitpid` — so a dead shim sits in `/proc` as `State: Z`
indefinitely, holding the PID slot and (in `boxlite serve` mode)
blocking init from reparenting because the daemon is still its parent.

**#1: async death (OOM / external SIGKILL/SIGTERM / internal panic).**
The shim dies without anyone in the runtime asking it to. The only
loop already polling shim liveness is health check (`box_impl.rs::
spawn_health_check`), but its `shim_died` branch did just one thing:
mark `Stopped` / `Unhealthy` in the DB and break. No `waitpid`. The
`is_process_alive` detector correctly treats `State: Z` as dead, so
the branch *fires* on a zombie — it just doesn't reap.

Fix: in the `shim_died` arm, `libc::waitpid(pid, _, WNOHANG)` in a
short bounded retry loop. SIGKILL → zombie transition is not always
synchronous with the wait queue: WNOHANG can briefly return 0 before
the kernel posts SIGCHLD/wait status, even when `/proc` already shows
`State: Z`. 500 ms ceiling with 10 ms backoff covers the race without
ever stalling the health-check loop.

The fix only takes effect if a health check task is *running*, so:

  - `AdvancedBoxOptions::default()` now spawns a health check with
    `HealthCheckOptions::default()` instead of `None`. Default
    interval 30 s, 60 s start period, 10 s ping timeout — the
    pre-existing defaults; no operator-visible cost beyond one ping
    per box per 30 s.
  - The decision is documented inline on the `health_check` field:
    health check is the *only* always-on shim-state watcher; explicit
    `health_check: None` opts out of both pings and zombie reaping
    (operator takes responsibility).

**#2: `boxlite serve` + REST `rm --force`.** Daemon's `rt_impl::
remove_box(force=true)` calls `kill_process(pid)` (SIGKILL by PID,
bypassing the canonical `stop()` path that *does* reap via
`Child::wait()`). The handler returns immediately and the box is
removed from the DB. The shim's `PPid` stays pointed at the daemon
(still alive), so init can't help.

Fix: after `kill_process`, run the same `WNOHANG`-with-deadline
polling loop already used in `vmm/controller/shim.rs::ShimHandler::
stop`'s SIGTERM path. 2 s deadline matches the existing
`GRACEFUL_SHUTDOWN_TIMEOUT_MS`; if SIGKILL doesn't yield a reapable
state in 2 s, the kernel is wedged and a longer wait won't help.

Three reproducer tests, each two-side verified (revert the relevant
production change → test goes red with the exact zombie signal;
restore → test goes green):

  - `boxlite::tests::health_check::
       health_check_becomes_unhealthy_when_shim_killed`
       (existing test on main, now actually passes — the
       `PerTestBoxHome::drop` zombie-leak panic is the load-bearing
       failure; this PR adds a pin on `/proc/<pid>/status` not being
       `State: Z` so the regression signal is inline)
  - `boxlite::tests::health_check::
       health_check_becomes_unhealthy_when_shim_sigtermed`
       (new — same shape, SIGTERM, covers k8s liveness initial
       signal / cgroup OOM initial / systemctl stop / plain kill)
  - `boxlite_cli::tests::serve_rm_force_zombie::
       boxlite_serve_rm_force_active_box_no_zombie_left_in_proc`
       (new — spawns real `boxlite serve` subprocess, drives REST
       `DELETE /v1/boxes/<id>?force=true` via CLI `--url`, watches
       `/proc/<shim_pid>/status` from *outside* the daemon. The
       failing-path panic captures `PPid:` against the daemon's
       PID — the load-bearing diagnostic that proves init isn't
       going to clean this up because the daemon is still alive
       and still the parent.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant