Skip to content

test(health_check): explicit stop() for the healthy-box test#615

Merged
DorianZheng merged 1 commit into
boxlite-ai:mainfrom
G4614:fix/health-check-test1-explicit-stop
May 28, 2026
Merged

test(health_check): explicit stop() for the healthy-box test#615
DorianZheng merged 1 commit into
boxlite-ai:mainfrom
G4614:fix/health-check-test1-explicit-stop

Conversation

@G4614

@G4614 G4614 commented May 28, 2026

Copy link
Copy Markdown
Contributor

After #604 , the healthy box test should be stopped manually after tasks finished

Test plan

--features krun,gvproxy, nextest --profile vm.

health_check_transitions_to_healthy_after_startup result
without stop() FAILPerTestBoxHome dropped with 1 live shim(s): [<pid>] (shim still Running at the leak check)
with stop() PASS

…eak a running shim

health_check_transitions_to_healthy_after_startup ends with the box still
Running. RuntimeImpl::Drop's safety-net shutdown_sync only sends SIGTERM and
returns without waiting, so the shim is still mid-shutdown when
PerTestBoxHome::Drop (boxlite-ai#604) runs its leak assertion — `1 live shim(s)`.
Adding t.bx.stop() (which stops and waits) clears it. The reaper (boxlite-ai#613) does
not help here: it reaps dead shims, not a still-running one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 marked this pull request as ready for review May 28, 2026 09:54
@DorianZheng DorianZheng merged commit d5b93da into boxlite-ai:main May 28, 2026
21 checks passed
@G4614 G4614 deleted the fix/health-check-test1-explicit-stop branch May 28, 2026 10:42
pull Bot pushed a commit to CrazyForks/boxlite that referenced this pull request Jun 12, 2026
…sun_path) (boxlite-ai#742)

## Problem

A deep `BOXLITE_HOME` (e.g. a repo-scoped `<repo>/.apps-local/boxlite`)
pushes `boxes/<id>/sockets/net.sock-krun.sock` past the 104-byte
`sun_path` limit on macOS. libkrun's bind fails **silently**: gvproxy
comes up, devices configure, but no vCPU is ever created — empty
console, 30s guest-ready timeout, zero diagnostics.
(`net.sock-krun.sock` is derived by vendored libkrun by appending
`-krun.sock` to the gvproxy socket path —
`vendor/libkrun/src/devices/src/virtio/net/unixgram.rs:79`.)

The repo already contained an OVS-style `SocketShortener` for exactly
this — but it was **dead code** (zero production callers), and its
threshold was sized against `ready.sock` instead of the true longest
name, leaving a 9-char dead zone where our sockets bind but krun's
silently fails. Two further construction sites
(`filenames::unix_socket_path`, a raw `join("ready.sock")`) bypassed the
layout getters entirely and **persisted** the resulting long paths into
the DB via `BoxConfig`.

## Design

`BoxSockets` (net/socket_path.rs) becomes the single authority for box
socket paths:

- **Unconditional short binding**: every socket binds/dials through
`/tmp/bl-<uid>/<box_id>` → `<box_home>/sockets/`. One code path for
every box — no length-conditional behavior, no dead zones, and the
`sun_path` budget constant is demoted to a sanity check. Precedent: OVS
`shorten_name_via_symlink()` (`lib/socket-util-unix.c`), containerd's
always-short `SocketAddress()` (`runtime/v2/shim/util_unix.go`).
- **tmux-style parent dir**: `/tmp/bl-<uid>` is created 0700 and
owner-verified (permissions repaired, foreign ownership refused loudly)
— predictable symlink names can't be squatted by other local users.
- **Paths derive from identity, never persisted**:
`BoxConfig.transport`/`ready_socket_path` fields are deleted;
`transport()`/`ready_transport()` derive from `(box_home, id)` at point
of use. Legacy DB rows with the old fields still deserialize
(unknown-field tolerance, tested).
- **Lifecycle**: created at `prepare()`; every removal path unlinks it;
runtime startup sweeps dangling leftovers (crash/temp-home runs);
guest-connect re-ensures — self-healing against the macOS periodic /tmp
cleaner. Upgrade invariant: host and shim may know a socket under
different strings; correctness is resolve-equality (same inode), never
string equality.
- **Seatbelt**: `canonicalize_or_original` keeps symlink leaves literal
(parent-canonicalized) — full resolution emitted policy rules for the
*target* and the sandbox denied the bind at the symlink path. The socket
policy entries (real dir RW, binding symlink RW, parent RO) are emitted
deterministically, gated only on the real dir existing.

## Second commit: pre-existing test leak unblocked

`health_check_becomes_unhealthy_when_shim_killed` fails the harness leak
check on main too (reproduced at the merge-base 577f4a5 with this
branch's src flipped back): a live shim survives the test's kill even
though the box reports Stopped — `BoxInfo.pid` is evidently not the
process that wrote `shim.pid`. Since the pre-push hook runs this suite
for any rust change, the test now tears down through the production
remove path (same class of fix as boxlite-ai#615 applied to the sibling test),
with a `FIXME(health)` for the owners on the pid question.

## Tests

- 15 `socket_path` unit tests incl. the **dead-zone regression** (real
dir where `ready.sock` fits but `net.sock-krun.sock` doesn't) and a
**vendor-pin test** that fails if vendored libkrun stops deriving
`-krun.sock`.
- Sandbox regression guards:
`test_canonicalize_keeps_symlink_leaf_literal`,
`test_build_path_access_socket_policy_entries`.
- `BoxConfig` legacy-row deserialization test; `remove_box`
unlinks-symlink assertion; layout-level long-home test.

## Verification

- 721 boxlite lib tests green; `make lint:rust` clean; pre-push VM
integration suite **229/229**.
- Long-home boot probe (the original repro): VM boots;
`/tmp/bl-<uid>/<id>` present, parent 0700; gone after remove.
- Reaper simulation: deleted the entire `/tmp/bl-<uid>` tree under a
running box → fresh process reattach + exec succeeds (self-heal).
- 11-service infra-local integration round-trip passes (24s warm).

## Notes for reviewers

- Rolling back to an older binary after this change: old binaries can't
read new (field-less) config rows; forward migration is seamless.
- The infra-local `.apps-local` workspace consolidation that *surfaced*
this bug is a separate changeset/PR.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Improved socket path handling using symlink-based binding to overcome
filesystem path length constraints.
* Enhanced box recovery with automatic cleanup of stale socket
resources.

* **Bug Fixes**
* Fixed handling of legacy persisted configurations containing outdated
socket path references.
* Improved sandbox policy enforcement for socket-related filesystem
permissions.
* Enhanced symlink path canonicalization to preserve literal symlink
references.

* **Improvements**
* Strengthened box cleanup processes to remove socket binding artifacts.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants