test(test-utils): PerTestBoxHome::Drop fails loudly if a shim is still alive#604
Merged
Merged
Conversation
…l alive The original "8 orphan shims in /tmp" report came from tests that created detached boxes and let the home go out of scope without explicitly stopping them. The detached shim survived (per the BoxOptions::detach contract), TempDir cleaned up the directory, the process kept running forever. A reaper-on-drop would have masked the bug: the leak moves from /tmp (visible) into the implicit Drop cleanup (invisible). Instead, drop now asserts that no shim referenced by <home>/boxes/*/shim.pid is still alive — if one is, the test fails with the PID list and a hint about calling box.stop() / auto_remove=true / runtime.remove(). A test panic is the right signal because the production stop paths (box.stop, runtime.remove, auto_remove) are what should reap the shim. A live shim at drop time means one of those paths did not run — either the test forgot to call them or production has a gap. Either way the test author needs to look. `std::thread::panicking()` guard skips the check when another panic is in progress so the original test failure stays the primary signal. Two-side verified by drop_panics_if_shim_still_alive: with an empty Drop body the catch_unwind returns Ok and the assertion fires; with the full Drop body it returns Err (panic) and the test passes.
DorianZheng
pushed a commit
that referenced
this pull request
May 27, 2026
…606) #604 added `impl Drop for PerTestBoxHome` but missed updating `structured_image_registries_validate_at_runtime_start` — the rest of the file (line 14) and every other test file (e.g. shutdown.rs, 10+ sites) were updated to `home.path.clone()`, but line 40 still moves `home.path` directly, triggering E0509 on a Drop-implementing type. Reproduce on origin/main (2b9b545): $ cargo build --tests -p boxlite error[E0509]: cannot move out of type `PerTestBoxHome`, which implements the `Drop` trait --> src/boxlite/tests/image_registries.rs:40:19 After this change `cargo build --tests -p boxlite` compiles clean. Co-authored-by: ngolosong <ngolosong@ngolosongdeMacBook-Pro.local> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
added a commit
to G4614/boxlite
that referenced
this pull request
May 27, 2026
When an image declares EXPOSE ports (e.g., redis:6379, docker:dind:2375), boxlite previously bound them 1:1 to the host. If the desired host port was already in use, gvproxy_create would fail and the box could not start — even though the user never explicitly requested that host port. resolve_expose_host_port() now probes the desired host port; on collision it falls back to an OS-allocated ephemeral host port. Guest still listens on the EXPOSE port; only the host binding moves. User-provided -p mappings are never remapped — they continue to fail fast via the gvproxy initErr path (covered by gvproxy_port_conflict regression). ResolvedPortMapping carries the final binding (host_port, guest_port, protocol, source) and is persisted on BoxState.port_mappings, exposed via inspect/list/REST so users can discover the resolved host port for detached boxes. Ported from upstream PR boxlite-ai#568 (only the EXPOSE auto-resolve subset; the --privileged/dind machinery remains in boxlite-ai#568 for separate consideration). Drive-by: tests/image_registries.rs needed `.clone()` after boxlite-ai#604 added Drop to PerTestBoxHome (one-char fix). Regression: expose_auto_remap_falls_back_when_desired_port_busy. Two-side verified — fails without this change ("listen tcp 0.0.0.0:6379: bind: address already in use"), passes with it (inspect shows host:<ephemeral> → guest:6379 source=auto_remap).
G4614
pushed a commit
to G4614/boxlite
that referenced
this pull request
May 28, 2026
…istries clone fix InitReady/IntermediateReady race ================================ libcontainer 0.5.7 on crates.io has a parallel-scheduling race in ContainerBuilder::build(). After `intermediate` forks `init` via clone3(CLONE_PARENT), the two processes run concurrently and both write readiness messages to the same MainReceiver socket. Under CPU pressure or with tenant containers (less init setup work), the init process can call main_sender.init_ready() before intermediate calls main_sender.intermediate_ready(pid). The parent's wait_for_intermediate_ready() then reads InitReady first and aborts build with: spawn_failed: build failed: failed to create container: received unexpected message: InitReady, expected: IntermediateReady(0) In this repo the race triggered ~60% of the time on the local box for test_zygote_concurrent_with_crash (8 concurrent execs) and caused intermittent failures across 16 integration tests (rust 7 + cli 5 + node 1 + 3 snapshot_clone_deep) all sharing the box-exec startup path. Upstream fix: youki PR #3504 (`fix(channel): split intermediate and init readiness channels`) — separates the main receiver into intermediate-owned and init-owned channels so message ordering is no longer racy. Merged 2026-05-18, but no published release (youki v0.6.0 was Feb 2026, predates the fix). Patch to the merge commit until the next youki release ships. Bump src/guest/Cargo.toml from 0.5.6 to 0.6 since the patched version has that manifest. libcgroups (transitive dep) is picked up from the same git workspace automatically. Verified by running each previously-failing test individually with patched binary: - rust integration: copy, 2 execution_shutdown, 4 zygote_integration, 3 snapshot_clone_deep — 10/10 pass - cli integration: 5/5 pass (test_exec_python_command, test_exec_inherits_box_workdir, test_exec_on_running_box, test_pull_nonexistent, test_run_python_json_processing) - node integration: copy.integration > round-trip a directory — pass - test_zygote_concurrent_with_crash looped 8x post-patch — 8/8 pass (vs 3/5 pre-patch). Remaining failures in those suites are independent bugs surfaced now that this race no longer masks them: - jailer SIGABRT in shim stdio (2 rust tests) - C SDK test_simple_api assertion `code == Ok` image_registries.rs clone fix ============================= Compile error from PR boxlite-ai#604 (PerTestBoxHome::Drop fails loudly): the new Drop impl makes `home_dir: home.path` (move) illegal. All other test files were updated to home.path.clone(); image_registries.rs:40 was missed and breaks `cargo test --no-run -p boxlite --test '*'` before any test runs. Add the .clone(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
pushed a commit
to G4614/boxlite
that referenced
this pull request
May 28, 2026
…check PR boxlite-ai#604 added a strict PerTestBoxHome::Drop assertion that fails if any shim referenced by <home>/boxes/*/shim.pid is still alive when the home drops. Two health_check tests tripped it: - health_check_transitions_to_healthy_after_startup: the box stays healthy; RuntimeImpl::Drop's safety-net shutdown_sync skips boxes that aren't plainly Running and does not reliably stop a box with an active health-check actor before the home's leak check fires. Add `t.bx.stop().await.ok()`. - health_check_becomes_unhealthy_when_shim_killed: the test kill -9's the shim, leaving a zombie (dead, unreaped while the runtime parent lives). The leak check counts zombies as live (kill(pid,0)==0), and the now-Stopped box is skipped by shutdown_sync, so nothing reaps it. Add `t.runtime.remove(box_id, true).await.ok()` which reaps the child. Test-only change. The box already reports Stopped correctly; only the test-side cleanup was missing. The production safety-net gap is noted in the comments but left unchanged to avoid unprompted scope creep. (image_registries.rs .clone() compile fix dropped from this branch — upstream boxlite-ai#606 landed the same fix; rebased onto current main.) Two-side verified: pre-fix: 2 tests run: 0 passed, 2 failed (both "PerTestBoxHome dropped with 1 live shim") post-fix: 2 tests run: 2 passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
added a commit
to G4614/boxlite
that referenced
this pull request
May 28, 2026
When an image declares EXPOSE ports (e.g., redis:6379, docker:dind:2375), boxlite previously bound them 1:1 to the host. If the desired host port was already in use, gvproxy_create would fail and the box could not start — even though the user never explicitly requested that host port. resolve_expose_host_port() now probes the desired host port; on collision it falls back to an OS-allocated ephemeral host port. Guest still listens on the EXPOSE port; only the host binding moves. User-provided -p mappings are never remapped — they continue to fail fast via the gvproxy initErr path (covered by gvproxy_port_conflict regression). ResolvedPortMapping carries the final binding (host_port, guest_port, protocol, source) and is persisted on BoxState.port_mappings, exposed via inspect/list/REST so users can discover the resolved host port for detached boxes. Ported from upstream PR boxlite-ai#568 (only the EXPOSE auto-resolve subset; the --privileged/dind machinery remains in boxlite-ai#568 for separate consideration). Drive-by: tests/image_registries.rs needed `.clone()` after boxlite-ai#604 added Drop to PerTestBoxHome (one-char fix). Regression: expose_auto_remap_falls_back_when_desired_port_busy. Two-side verified — fails without this change ("listen tcp 0.0.0.0:6379: bind: address already in use"), passes with it (inspect shows host:<ephemeral> → guest:6379 source=auto_remap).
This was referenced May 28, 2026
DorianZheng
pushed a commit
that referenced
this pull request
May 28, 2026
…eak a running shim (#615) health_check_transitions_to_healthy_after_startup ends with the box still Running. RuntimeImpl::Drop's safety-net shutdown_sync only sends SIGTERM and returns without waiting, so the shim is still mid-shutdown when PerTestBoxHome::Drop (#604) runs its leak assertion — `1 live shim(s)`. Adding t.bx.stop() (which stops and waits) clears it. The reaper (#613) does not help here: it reaps dead shims, not a still-running one. Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-149.ap-southeast-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614
added a commit
to G4614/boxlite
that referenced
this pull request
Jun 1, 2026
When an image declares EXPOSE ports (e.g., redis:6379, docker:dind:2375), boxlite previously bound them 1:1 to the host. If the desired host port was already in use, gvproxy_create would fail and the box could not start — even though the user never explicitly requested that host port. resolve_expose_host_port() now probes the desired host port; on collision it falls back to an OS-allocated ephemeral host port. Guest still listens on the EXPOSE port; only the host binding moves. User-provided -p mappings are never remapped — they continue to fail fast via the gvproxy initErr path (covered by gvproxy_port_conflict regression). ResolvedPortMapping carries the final binding (host_port, guest_port, protocol, source) and is persisted on BoxState.port_mappings, exposed via inspect/list/REST so users can discover the resolved host port for detached boxes. Ported from upstream PR boxlite-ai#568 (only the EXPOSE auto-resolve subset; the --privileged/dind machinery remains in boxlite-ai#568 for separate consideration). Drive-by: tests/image_registries.rs needed `.clone()` after boxlite-ai#604 added Drop to PerTestBoxHome (one-char fix). Regression: expose_auto_remap_falls_back_when_desired_port_busy. Two-side verified — fails without this change ("listen tcp 0.0.0.0:6379: bind: address already in use"), passes with it (inspect shows host:<ephemeral> → guest:6379 source=auto_remap).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The original "8 orphan shims in /tmp" report came from tests that created detached boxes and let
PerTestBoxHomego out of scope without explicitly stopping them. The detached shim survived (per theBoxOptions::detachcontract from #601),TempDircleaned up the directory, and the process kept running forever.A reaper-on-drop would have masked the bug: the leak just moves from
/tmp/(visible) into the implicitDropcleanup (invisible). Instead,Dropnow asserts that no shim referenced by<home>/boxes/*/shim.pidis still alive. If one is, the test fails with the PID list and a hint pointing at the missing stop call.A test panic is the right signal because the production stop paths (
box.stop(),runtime.remove(),auto_remove=true) are what should reap the shim. A live shim at drop time means one of those paths did not run — either the test forgot the call or production has a gap. Either way the test author needs to look.The
std::thread::panicking()guard skips the check when another panic is already in progress so the original test failure stays the primary signal.Test plan
drop_panics_if_shim_still_alive:Dropbody:catch_unwindreturnsOk(_)→ test fails with "PerTestBoxHome::drop should have panicked on a live shim."Dropbody:catch_unwindreturnsErr(_)→ test passes.