Skip to content

test(test-utils): PerTestBoxHome::Drop fails loudly if a shim is still alive#604

Merged
DorianZheng merged 1 commit into
mainfrom
fix/test-utils-reap-detached-shims
May 27, 2026
Merged

test(test-utils): PerTestBoxHome::Drop fails loudly if a shim is still alive#604
DorianZheng merged 1 commit into
mainfrom
fix/test-utils-reap-detached-shims

Conversation

@DorianZheng

Copy link
Copy Markdown
Member

Summary

The original "8 orphan shims in /tmp" report came from tests that created detached boxes and let PerTestBoxHome go out of scope without explicitly stopping them. The detached shim survived (per the BoxOptions::detach contract from #601), TempDir cleaned up the directory, and the process kept running forever.

A reaper-on-drop would have masked the bug: the leak just moves from /tmp/ (visible) into the implicit Drop cleanup (invisible). Instead, Drop now asserts that no shim referenced by <home>/boxes/*/shim.pid is still alive. If one is, the test fails with the PID list and a hint pointing at the missing stop call.

A test panic is the right signal because the production stop paths (box.stop(), runtime.remove(), auto_remove=true) are what should reap the shim. A live shim at drop time means one of those paths did not run — either the test forgot the call or production has a gap. Either way the test author needs to look.

The std::thread::panicking() guard skips the check when another panic is already in progress so the original test failure stays the primary signal.

Test plan

  • Two-side verified by drop_panics_if_shim_still_alive:
    • With an empty Drop body: catch_unwind returns Ok(_) → test fails with "PerTestBoxHome::drop should have panicked on a live shim."
    • With the full Drop body: catch_unwind returns Err(_) → test passes.
  • Existing 42 test-utils tests still pass.

…l alive

The original "8 orphan shims in /tmp" report came from tests that
created detached boxes and let the home go out of scope without
explicitly stopping them. The detached shim survived (per the
BoxOptions::detach contract), TempDir cleaned up the directory, the
process kept running forever.

A reaper-on-drop would have masked the bug: the leak moves from /tmp
(visible) into the implicit Drop cleanup (invisible). Instead, drop
now asserts that no shim referenced by <home>/boxes/*/shim.pid is
still alive — if one is, the test fails with the PID list and a
hint about calling box.stop() / auto_remove=true / runtime.remove().

A test panic is the right signal because the production stop paths
(box.stop, runtime.remove, auto_remove) are what should reap the
shim. A live shim at drop time means one of those paths did not run
— either the test forgot to call them or production has a gap.
Either way the test author needs to look.

`std::thread::panicking()` guard skips the check when another panic
is in progress so the original test failure stays the primary signal.

Two-side verified by drop_panics_if_shim_still_alive: with an empty
Drop body the catch_unwind returns Ok and the assertion fires; with
the full Drop body it returns Err (panic) and the test passes.
@DorianZheng DorianZheng merged commit 2b9b545 into main May 27, 2026
11 checks passed
@DorianZheng DorianZheng deleted the fix/test-utils-reap-detached-shims branch May 27, 2026 12:54
DorianZheng pushed a commit that referenced this pull request May 27, 2026
…606)

#604 added `impl Drop for PerTestBoxHome` but missed updating
`structured_image_registries_validate_at_runtime_start` — the rest of
the file (line 14) and every other test file (e.g. shutdown.rs, 10+
sites) were updated to `home.path.clone()`, but line 40 still moves
`home.path` directly, triggering E0509 on a Drop-implementing type.

Reproduce on origin/main (2b9b545):

    $ cargo build --tests -p boxlite
    error[E0509]: cannot move out of type `PerTestBoxHome`,
                  which implements the `Drop` trait
      --> src/boxlite/tests/image_registries.rs:40:19

After this change `cargo build --tests -p boxlite` compiles clean.

Co-authored-by: ngolosong <ngolosong@ngolosongdeMacBook-Pro.local>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit to G4614/boxlite that referenced this pull request May 27, 2026
When an image declares EXPOSE ports (e.g., redis:6379, docker:dind:2375),
boxlite previously bound them 1:1 to the host. If the desired host port
was already in use, gvproxy_create would fail and the box could not
start — even though the user never explicitly requested that host port.

resolve_expose_host_port() now probes the desired host port; on collision
it falls back to an OS-allocated ephemeral host port. Guest still listens
on the EXPOSE port; only the host binding moves. User-provided -p
mappings are never remapped — they continue to fail fast via the
gvproxy initErr path (covered by gvproxy_port_conflict regression).

ResolvedPortMapping carries the final binding (host_port, guest_port,
protocol, source) and is persisted on BoxState.port_mappings, exposed
via inspect/list/REST so users can discover the resolved host port for
detached boxes.

Ported from upstream PR boxlite-ai#568 (only the EXPOSE auto-resolve subset;
the --privileged/dind machinery remains in boxlite-ai#568 for separate
consideration).

Drive-by: tests/image_registries.rs needed `.clone()` after boxlite-ai#604 added
Drop to PerTestBoxHome (one-char fix).

Regression: expose_auto_remap_falls_back_when_desired_port_busy.
Two-side verified — fails without this change ("listen tcp 0.0.0.0:6379:
bind: address already in use"), passes with it (inspect shows
host:<ephemeral> → guest:6379 source=auto_remap).
G4614 pushed a commit to G4614/boxlite that referenced this pull request May 28, 2026
…istries clone fix

InitReady/IntermediateReady race
================================
libcontainer 0.5.7 on crates.io has a parallel-scheduling race in
ContainerBuilder::build(). After `intermediate` forks `init` via
clone3(CLONE_PARENT), the two processes run concurrently and both
write readiness messages to the same MainReceiver socket. Under CPU
pressure or with tenant containers (less init setup work), the init
process can call main_sender.init_ready() before intermediate calls
main_sender.intermediate_ready(pid). The parent's
wait_for_intermediate_ready() then reads InitReady first and aborts
build with:

  spawn_failed: build failed: failed to create container:
  received unexpected message: InitReady, expected: IntermediateReady(0)

In this repo the race triggered ~60% of the time on the local box for
test_zygote_concurrent_with_crash (8 concurrent execs) and caused
intermittent failures across 16 integration tests (rust 7 + cli 5 +
node 1 + 3 snapshot_clone_deep) all sharing the box-exec startup path.

Upstream fix: youki PR #3504 (`fix(channel): split intermediate and
init readiness channels`) — separates the main receiver into
intermediate-owned and init-owned channels so message ordering is no
longer racy. Merged 2026-05-18, but no published release (youki
v0.6.0 was Feb 2026, predates the fix).

Patch to the merge commit until the next youki release ships. Bump
src/guest/Cargo.toml from 0.5.6 to 0.6 since the patched version has
that manifest. libcgroups (transitive dep) is picked up from the same
git workspace automatically.

Verified by running each previously-failing test individually with
patched binary:
- rust integration: copy, 2 execution_shutdown, 4 zygote_integration,
  3 snapshot_clone_deep — 10/10 pass
- cli integration: 5/5 pass (test_exec_python_command,
  test_exec_inherits_box_workdir, test_exec_on_running_box,
  test_pull_nonexistent, test_run_python_json_processing)
- node integration: copy.integration > round-trip a directory — pass
- test_zygote_concurrent_with_crash looped 8x post-patch — 8/8 pass
  (vs 3/5 pre-patch).

Remaining failures in those suites are independent bugs surfaced now
that this race no longer masks them:
- jailer SIGABRT in shim stdio (2 rust tests)
- C SDK test_simple_api assertion `code == Ok`

image_registries.rs clone fix
=============================
Compile error from PR boxlite-ai#604 (PerTestBoxHome::Drop fails loudly): the
new Drop impl makes `home_dir: home.path` (move) illegal. All other
test files were updated to home.path.clone(); image_registries.rs:40
was missed and breaks `cargo test --no-run -p boxlite --test '*'`
before any test runs. Add the .clone().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 pushed a commit to G4614/boxlite that referenced this pull request May 28, 2026
…check

PR boxlite-ai#604 added a strict PerTestBoxHome::Drop assertion that fails if any
shim referenced by <home>/boxes/*/shim.pid is still alive when the home
drops. Two health_check tests tripped it:

- health_check_transitions_to_healthy_after_startup: the box stays
  healthy; RuntimeImpl::Drop's safety-net shutdown_sync skips boxes that
  aren't plainly Running and does not reliably stop a box with an active
  health-check actor before the home's leak check fires. Add
  `t.bx.stop().await.ok()`.

- health_check_becomes_unhealthy_when_shim_killed: the test kill -9's
  the shim, leaving a zombie (dead, unreaped while the runtime parent
  lives). The leak check counts zombies as live (kill(pid,0)==0), and
  the now-Stopped box is skipped by shutdown_sync, so nothing reaps it.
  Add `t.runtime.remove(box_id, true).await.ok()` which reaps the child.

Test-only change. The box already reports Stopped correctly; only the
test-side cleanup was missing. The production safety-net gap is noted
in the comments but left unchanged to avoid unprompted scope creep.

(image_registries.rs .clone() compile fix dropped from this branch —
upstream boxlite-ai#606 landed the same fix; rebased onto current main.)

Two-side verified:
  pre-fix:  2 tests run: 0 passed, 2 failed (both "PerTestBoxHome dropped with 1 live shim")
  post-fix: 2 tests run: 2 passed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit to G4614/boxlite that referenced this pull request May 28, 2026
When an image declares EXPOSE ports (e.g., redis:6379, docker:dind:2375),
boxlite previously bound them 1:1 to the host. If the desired host port
was already in use, gvproxy_create would fail and the box could not
start — even though the user never explicitly requested that host port.

resolve_expose_host_port() now probes the desired host port; on collision
it falls back to an OS-allocated ephemeral host port. Guest still listens
on the EXPOSE port; only the host binding moves. User-provided -p
mappings are never remapped — they continue to fail fast via the
gvproxy initErr path (covered by gvproxy_port_conflict regression).

ResolvedPortMapping carries the final binding (host_port, guest_port,
protocol, source) and is persisted on BoxState.port_mappings, exposed
via inspect/list/REST so users can discover the resolved host port for
detached boxes.

Ported from upstream PR boxlite-ai#568 (only the EXPOSE auto-resolve subset;
the --privileged/dind machinery remains in boxlite-ai#568 for separate
consideration).

Drive-by: tests/image_registries.rs needed `.clone()` after boxlite-ai#604 added
Drop to PerTestBoxHome (one-char fix).

Regression: expose_auto_remap_falls_back_when_desired_port_busy.
Two-side verified — fails without this change ("listen tcp 0.0.0.0:6379:
bind: address already in use"), passes with it (inspect shows
host:<ephemeral> → guest:6379 source=auto_remap).
DorianZheng pushed a commit that referenced this pull request May 28, 2026
…eak a running shim (#615)

health_check_transitions_to_healthy_after_startup ends with the box still
Running. RuntimeImpl::Drop's safety-net shutdown_sync only sends SIGTERM and
returns without waiting, so the shim is still mid-shutdown when
PerTestBoxHome::Drop (#604) runs its leak assertion — `1 live shim(s)`.
Adding t.bx.stop() (which stops and waits) clears it. The reaper (#613) does
not help here: it reaps dead shims, not a still-running one.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-149.ap-southeast-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G4614 added a commit to G4614/boxlite that referenced this pull request Jun 1, 2026
When an image declares EXPOSE ports (e.g., redis:6379, docker:dind:2375),
boxlite previously bound them 1:1 to the host. If the desired host port
was already in use, gvproxy_create would fail and the box could not
start — even though the user never explicitly requested that host port.

resolve_expose_host_port() now probes the desired host port; on collision
it falls back to an OS-allocated ephemeral host port. Guest still listens
on the EXPOSE port; only the host binding moves. User-provided -p
mappings are never remapped — they continue to fail fast via the
gvproxy initErr path (covered by gvproxy_port_conflict regression).

ResolvedPortMapping carries the final binding (host_port, guest_port,
protocol, source) and is persisted on BoxState.port_mappings, exposed
via inspect/list/REST so users can discover the resolved host port for
detached boxes.

Ported from upstream PR boxlite-ai#568 (only the EXPOSE auto-resolve subset;
the --privileged/dind machinery remains in boxlite-ai#568 for separate
consideration).

Drive-by: tests/image_registries.rs needed `.clone()` after boxlite-ai#604 added
Drop to PerTestBoxHome (one-char fix).

Regression: expose_auto_remap_falls_back_when_desired_port_busy.
Two-side verified — fails without this change ("listen tcp 0.0.0.0:6379:
bind: address already in use"), passes with it (inspect shows
host:<ephemeral> → guest:6379 source=auto_remap).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant