test(health_check): explicit stop() for the healthy-box test#615
Merged
DorianZheng merged 1 commit intoMay 28, 2026
Merged
Conversation
…eak a running shim health_check_transitions_to_healthy_after_startup ends with the box still Running. RuntimeImpl::Drop's safety-net shutdown_sync only sends SIGTERM and returns without waiting, so the shim is still mid-shutdown when PerTestBoxHome::Drop (boxlite-ai#604) runs its leak assertion — `1 live shim(s)`. Adding t.bx.stop() (which stops and waits) clears it. The reaper (boxlite-ai#613) does not help here: it reaps dead shims, not a still-running one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pull Bot
pushed a commit
to CrazyForks/boxlite
that referenced
this pull request
Jun 12, 2026
…sun_path) (boxlite-ai#742) ## Problem A deep `BOXLITE_HOME` (e.g. a repo-scoped `<repo>/.apps-local/boxlite`) pushes `boxes/<id>/sockets/net.sock-krun.sock` past the 104-byte `sun_path` limit on macOS. libkrun's bind fails **silently**: gvproxy comes up, devices configure, but no vCPU is ever created — empty console, 30s guest-ready timeout, zero diagnostics. (`net.sock-krun.sock` is derived by vendored libkrun by appending `-krun.sock` to the gvproxy socket path — `vendor/libkrun/src/devices/src/virtio/net/unixgram.rs:79`.) The repo already contained an OVS-style `SocketShortener` for exactly this — but it was **dead code** (zero production callers), and its threshold was sized against `ready.sock` instead of the true longest name, leaving a 9-char dead zone where our sockets bind but krun's silently fails. Two further construction sites (`filenames::unix_socket_path`, a raw `join("ready.sock")`) bypassed the layout getters entirely and **persisted** the resulting long paths into the DB via `BoxConfig`. ## Design `BoxSockets` (net/socket_path.rs) becomes the single authority for box socket paths: - **Unconditional short binding**: every socket binds/dials through `/tmp/bl-<uid>/<box_id>` → `<box_home>/sockets/`. One code path for every box — no length-conditional behavior, no dead zones, and the `sun_path` budget constant is demoted to a sanity check. Precedent: OVS `shorten_name_via_symlink()` (`lib/socket-util-unix.c`), containerd's always-short `SocketAddress()` (`runtime/v2/shim/util_unix.go`). - **tmux-style parent dir**: `/tmp/bl-<uid>` is created 0700 and owner-verified (permissions repaired, foreign ownership refused loudly) — predictable symlink names can't be squatted by other local users. - **Paths derive from identity, never persisted**: `BoxConfig.transport`/`ready_socket_path` fields are deleted; `transport()`/`ready_transport()` derive from `(box_home, id)` at point of use. Legacy DB rows with the old fields still deserialize (unknown-field tolerance, tested). - **Lifecycle**: created at `prepare()`; every removal path unlinks it; runtime startup sweeps dangling leftovers (crash/temp-home runs); guest-connect re-ensures — self-healing against the macOS periodic /tmp cleaner. Upgrade invariant: host and shim may know a socket under different strings; correctness is resolve-equality (same inode), never string equality. - **Seatbelt**: `canonicalize_or_original` keeps symlink leaves literal (parent-canonicalized) — full resolution emitted policy rules for the *target* and the sandbox denied the bind at the symlink path. The socket policy entries (real dir RW, binding symlink RW, parent RO) are emitted deterministically, gated only on the real dir existing. ## Second commit: pre-existing test leak unblocked `health_check_becomes_unhealthy_when_shim_killed` fails the harness leak check on main too (reproduced at the merge-base 577f4a5 with this branch's src flipped back): a live shim survives the test's kill even though the box reports Stopped — `BoxInfo.pid` is evidently not the process that wrote `shim.pid`. Since the pre-push hook runs this suite for any rust change, the test now tears down through the production remove path (same class of fix as boxlite-ai#615 applied to the sibling test), with a `FIXME(health)` for the owners on the pid question. ## Tests - 15 `socket_path` unit tests incl. the **dead-zone regression** (real dir where `ready.sock` fits but `net.sock-krun.sock` doesn't) and a **vendor-pin test** that fails if vendored libkrun stops deriving `-krun.sock`. - Sandbox regression guards: `test_canonicalize_keeps_symlink_leaf_literal`, `test_build_path_access_socket_policy_entries`. - `BoxConfig` legacy-row deserialization test; `remove_box` unlinks-symlink assertion; layout-level long-home test. ## Verification - 721 boxlite lib tests green; `make lint:rust` clean; pre-push VM integration suite **229/229**. - Long-home boot probe (the original repro): VM boots; `/tmp/bl-<uid>/<id>` present, parent 0700; gone after remove. - Reaper simulation: deleted the entire `/tmp/bl-<uid>` tree under a running box → fresh process reattach + exec succeeds (self-heal). - 11-service infra-local integration round-trip passes (24s warm). ## Notes for reviewers - Rolling back to an older binary after this change: old binaries can't read new (field-less) config rows; forward migration is seamless. - The infra-local `.apps-local` workspace consolidation that *surfaced* this bug is a separate changeset/PR. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Improved socket path handling using symlink-based binding to overcome filesystem path length constraints. * Enhanced box recovery with automatic cleanup of stale socket resources. * **Bug Fixes** * Fixed handling of legacy persisted configurations containing outdated socket path references. * Improved sandbox policy enforcement for socket-related filesystem permissions. * Enhanced symlink path canonicalization to preserve literal symlink references. * **Improvements** * Strengthened box cleanup processes to remove socket binding artifacts. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After #604 , the healthy box test should be stopped manually after tasks finished
Test plan
--features krun,gvproxy, nextest--profile vm.health_check_transitions_to_healthy_after_startupstop()PerTestBoxHome dropped with 1 live shim(s): [<pid>](shim still Running at the leak check)stop()