Skip to content

fix(runtime): BoxSockets — bind box sockets via short /tmp symlinks (sun_path)#742

Merged
law-chain-hot merged 2 commits into
mainfrom
fix/boxsockets-sun-path
Jun 12, 2026
Merged

fix(runtime): BoxSockets — bind box sockets via short /tmp symlinks (sun_path)#742
law-chain-hot merged 2 commits into
mainfrom
fix/boxsockets-sun-path

Conversation

@DorianZheng

@DorianZheng DorianZheng commented Jun 12, 2026

Copy link
Copy Markdown
Member

Problem

A deep BOXLITE_HOME (e.g. a repo-scoped <repo>/.apps-local/boxlite) pushes boxes/<id>/sockets/net.sock-krun.sock past the 104-byte sun_path limit on macOS. libkrun's bind fails silently: gvproxy comes up, devices configure, but no vCPU is ever created — empty console, 30s guest-ready timeout, zero diagnostics. (net.sock-krun.sock is derived by vendored libkrun by appending -krun.sock to the gvproxy socket path — vendor/libkrun/src/devices/src/virtio/net/unixgram.rs:79.)

The repo already contained an OVS-style SocketShortener for exactly this — but it was dead code (zero production callers), and its threshold was sized against ready.sock instead of the true longest name, leaving a 9-char dead zone where our sockets bind but krun's silently fails. Two further construction sites (filenames::unix_socket_path, a raw join("ready.sock")) bypassed the layout getters entirely and persisted the resulting long paths into the DB via BoxConfig.

Design

BoxSockets (net/socket_path.rs) becomes the single authority for box socket paths:

  • Unconditional short binding: every socket binds/dials through /tmp/bl-<uid>/<box_id><box_home>/sockets/. One code path for every box — no length-conditional behavior, no dead zones, and the sun_path budget constant is demoted to a sanity check. Precedent: OVS shorten_name_via_symlink() (lib/socket-util-unix.c), containerd's always-short SocketAddress() (runtime/v2/shim/util_unix.go).
  • tmux-style parent dir: /tmp/bl-<uid> is created 0700 and owner-verified (permissions repaired, foreign ownership refused loudly) — predictable symlink names can't be squatted by other local users.
  • Paths derive from identity, never persisted: BoxConfig.transport/ready_socket_path fields are deleted; transport()/ready_transport() derive from (box_home, id) at point of use. Legacy DB rows with the old fields still deserialize (unknown-field tolerance, tested).
  • Lifecycle: created at prepare(); every removal path unlinks it; runtime startup sweeps dangling leftovers (crash/temp-home runs); guest-connect re-ensures — self-healing against the macOS periodic /tmp cleaner. Upgrade invariant: host and shim may know a socket under different strings; correctness is resolve-equality (same inode), never string equality.
  • Seatbelt: canonicalize_or_original keeps symlink leaves literal (parent-canonicalized) — full resolution emitted policy rules for the target and the sandbox denied the bind at the symlink path. The socket policy entries (real dir RW, binding symlink RW, parent RO) are emitted deterministically, gated only on the real dir existing.

Second commit: pre-existing test leak unblocked

health_check_becomes_unhealthy_when_shim_killed fails the harness leak check on main too (reproduced at the merge-base 577f4a5 with this branch's src flipped back): a live shim survives the test's kill even though the box reports Stopped — BoxInfo.pid is evidently not the process that wrote shim.pid. Since the pre-push hook runs this suite for any rust change, the test now tears down through the production remove path (same class of fix as #615 applied to the sibling test), with a FIXME(health) for the owners on the pid question.

Tests

  • 15 socket_path unit tests incl. the dead-zone regression (real dir where ready.sock fits but net.sock-krun.sock doesn't) and a vendor-pin test that fails if vendored libkrun stops deriving -krun.sock.
  • Sandbox regression guards: test_canonicalize_keeps_symlink_leaf_literal, test_build_path_access_socket_policy_entries.
  • BoxConfig legacy-row deserialization test; remove_box unlinks-symlink assertion; layout-level long-home test.

Verification

  • 721 boxlite lib tests green; make lint:rust clean; pre-push VM integration suite 229/229.
  • Long-home boot probe (the original repro): VM boots; /tmp/bl-<uid>/<id> present, parent 0700; gone after remove.
  • Reaper simulation: deleted the entire /tmp/bl-<uid> tree under a running box → fresh process reattach + exec succeeds (self-heal).
  • 11-service infra-local integration round-trip passes (24s warm).

Notes for reviewers

  • Rolling back to an older binary after this change: old binaries can't read new (field-less) config rows; forward migration is seamless.
  • The infra-local .apps-local workspace consolidation that surfaced this bug is a separate changeset/PR.

Summary by CodeRabbit

  • New Features

    • Improved socket path handling using symlink-based binding to overcome filesystem path length constraints.
    • Enhanced box recovery with automatic cleanup of stale socket resources.
  • Bug Fixes

    • Fixed handling of legacy persisted configurations containing outdated socket path references.
    • Improved sandbox policy enforcement for socket-related filesystem permissions.
    • Enhanced symlink path canonicalization to preserve literal symlink references.
  • Improvements

    • Strengthened box cleanup processes to remove socket binding artifacts.

…sun_path)

A deep BOXLITE_HOME pushes boxes/<id>/sockets/net.sock-krun.sock past the
104-byte sun_path limit. libkrun's bind fails silently: no vCPU, empty
console, and a 30s guest-ready timeout with zero console bytes. The
existing SocketShortener was dead code, and its threshold was sized
against ready.sock instead of krun's derived net.sock-krun.sock — a
9-char dead zone where our sockets bind but krun's silently fails.

Replace it with BoxSockets, the single authority for box socket paths:

- Unconditional short binding: every socket binds/dials through
  /tmp/bl-<uid>/<box_id> -> <box_home>/sockets (OVS symlink indirection;
  containerd-style always-short addresses). One code path, no
  length-conditional behavior, no dead zones.
- tmux-style 0700 owner-verified per-uid parent dir — predictable names
  can't be squatted by other local users (loud refusal, never silent).
- BoxConfig stops persisting transport/ready_socket_path; both derive
  from (box_home, id) at point of use, so stored paths can never go
  stale. Legacy DB rows with the old fields still deserialize (tested).
- Every box-removal path unlinks the binding symlink; runtime startup
  sweeps dangling ones (crash/temp-home leftovers); guest-connect
  re-ensures the symlink, self-healing the macOS /tmp reaper case.
- Seatbelt profile generation keeps symlink leaves literal when
  canonicalizing — full resolution emitted rules for the target and the
  sandbox denied the bind at the symlink path. Both sandbox behaviors
  carry regression tests.
- Vendor-pin test fails if vendored libkrun stops deriving -krun.sock,
  so the sanity-check constant cannot drift silently.

Verified: 721 boxlite lib tests; long-home boot probe (the original
repro); reaper-simulation reattach+exec; 11-service infra-local
integration round-trip.
health_check_becomes_unhealthy_when_shim_killed leaves a live shim at
PerTestBoxHome drop — pre-existing on main (reproduces identically with
src/boxlite flipped to parent commit 577f4a5): the pid the test kills
(BoxInfo.pid) is evidently not the process that wrote shim.pid, so a
live shim survives even though the box reports Stopped. This fails the
harness leak check and blocks every rust push via the pre-push
test:changed hook.

Tear down through the production remove path per the leak check's own
contract (same class of fix as #615 applied to the sibling healthy-box
test). FIXME(health) left for the owners to reconcile BoxInfo.pid with
shim.pid and identify the surviving process.
@DorianZheng DorianZheng requested a review from a team as a code owner June 12, 2026 03:20
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR refactors socket path management in Boxlite by replacing the SocketShortener pattern with a new BoxSockets per-box abstraction. Socket paths are now derived from box_home and box identity at runtime rather than persisted, and symlink lifecycle (ensure, remove, sweep_stale) is integrated throughout provisioning, initialization, and cleanup paths.

Changes

Socket Path Refactoring

Layer / File(s) Summary
BoxSockets abstraction and socket path authority
src/boxlite/src/net/socket_path.rs
New BoxSockets struct replaces SocketShortener with explicit socket path constructors (box.sock, ready.sock, net.sock) using /tmp/bl-{uid}/{box_id} symlink binding. Implements idempotent ensure() with sun_path validation, remove() cleanup, policy_paths() for sandbox access, and sweep_stale() for stale symlink removal. Removed old SocketShortener, resolve_socket_path(), and cleanup_stale_symlinks() APIs.
BoxConfig API refactoring
src/boxlite/src/litebox/config.rs
BoxConfig removes persisted transport and ready_socket_path fields, exposes box_home, and adds sockets() method to derive BoxSockets from (id, box_home). New transport() and ready_transport() accessors construct Unix-socket Transport values from derived sockets at call time.
Filesystem layout integration
src/boxlite/src/runtime/layout.rs
BoxFilesystemLayout adds sockets() method, updates socket path getters to delegate to BoxSockets, calls self.sockets().ensure() during prepare() to create binding symlink before socket binding, and calls self.sockets().remove() during cleanup(). Tests verify deep-path symlink shortening and multi-box socket collision regression.
Runtime socket lifecycle management
src/boxlite/src/runtime/rt_impl.rs
Runtime startup sweeps stale socket symlinks. Box provision and init omit transport/ready_socket_path from BoxConfig; box removal (both DB-backed and in-memory) unlinks socket binding symlinks. Provisioning failure and recovery cleanup remove symlinks. Orphan cleanup explicitly removes socket binding symlinks before directory deletion. Tests updated and removal assertions added.
Jailer sandbox policy and path canonicalization
src/boxlite/src/jailer/mod.rs, src/boxlite/src/jailer/sandbox/seatbelt.rs
build_path_access removes sockets_dir from initial writable-dir loop, conditionally enumerates layout.sockets().policy_paths() for sandbox rules (real dir writable, binding symlink writable, parent read-only). canonicalize_or_original special-cases symlink leafs to keep them literal. New regression tests verify socket policy entries and symlink-leaf behavior.
Guest connection and init updates
src/boxlite/src/litebox/init/tasks/guest_connect.rs, src/boxlite/src/litebox/init/types.rs
GuestConnectTask::run calls layout.sockets().ensure() for self-healing, uses ctx.config.transport() and ctx.config.ready_transport() accessors. Init types test removes Transport import and updates BoxConfig construction.
Deprecated socket-path helpers cleanup
src/boxlite/src/runtime/constants.rs
Remove public helper functions box_home() and unix_socket_path() from filenames module.
Test suite updates across modules
src/boxlite/src/db/boxes.rs, src/boxlite/src/litebox/box_impl.rs, src/boxlite/src/litebox/manager.rs, src/boxlite/src/runtime/types.rs, src/boxlite/tests/health_check.rs
All test modules updated to match new BoxConfig initialization (omit socket-path fields, set box_home). New regression test validates legacy persisted socket-path fields are ignored on deserialization. Manager and box_impl test helpers expanded with temp filesystem/DB setup. Health check test adds explicit teardown to avoid shim process leaks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

  • law-chain-hot

Poem

🐰 Socket paths now dance in /tmp/bl, symlinks binding true,
No more persisted strings to muddy the morning dew—
BoxSockets ensures they're swept when boxes disappear,
Derived at every call, keeping the runtime crystal clear! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: introducing BoxSockets to bind box sockets via short /tmp symlinks to address the sun_path length limit issue. It is specific, concise, and directly reflects the main purpose of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/boxsockets-sun-path

Comment @coderabbitai help to get the list of available commands and usage tips.

@law-chain-hot law-chain-hot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@law-chain-hot law-chain-hot added this pull request to the merge queue Jun 12, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/boxlite/src/jailer/mod.rs`:
- Around line 207-210: The policy generation incorrectly relies on on-disk state
for the writable binding directory: call sockets.ensure() (or otherwise ensure
layout.sockets_dir() binding symlink exists) immediately before building the
sandbox context so the subsequent call to layout.sockets().policy_paths() yields
directory-aware entries; alternatively, modify PathAccess creation in that code
path to carry explicit directory intent (e.g., mark the binding_dir entry as a
directory/writable) so seatbelt::build_dynamic_* receives a directory grant
regardless of /tmp state. Ensure the change targets the code that iterates
layout.sockets().policy_paths() and produces PathAccess so descendant socket
grants for net.sock/ready.sock are emitted as subpath-capable directory rules.

In `@src/boxlite/src/net/socket_path.rs`:
- Around line 155-177: Replace string equality checks against self.real_dir()
with an inode/dev comparison: when you see
std::fs::read_link(&symlink_path).ok().as_deref() == Some(self.real_dir()) (in
the initial Ok(meta) branch) and the racing_winner_ok check that reads the link,
instead call std::fs::read_link(&symlink_path) to get the target PathBuf and
then std::fs::metadata(&target).ok() and std::fs::metadata(self.real_dir()).ok()
and compare their device and inode via std::os::unix::fs::MetadataExt::dev() and
::ino(); treat them as equal if both metadata calls succeed and dev/ino match.
Update both the "already correct" early return and the AlreadyExists race check
to use this dev/ino comparison and fall back to the original string check only
if metadata info is unavailable.

In `@src/boxlite/tests/health_check.rs`:
- Around line 153-162: The teardown using t.runtime.remove(box_id.as_str(),
true) can mask the PID-mismatch bug because remove only kills when
state.status.is_active() or state.pid.is_some(), but the health-check path
clears state.pid and sets BoxStatus::Stopped; update the test to assert the shim
process is actually dead before calling remove (or split into a separate
regression): capture the shim pid (from shim.pid or BoxInfo.pid) after starting
the box, verify the process identity is not running (e.g., check /proc/<pid> or
attempt a signal) and assert that it is dead when info.status ==
BoxStatus::Stopped, and only then perform t.runtime.remove(...); alternatively
create a dedicated test that reproduces PID mismatch and asserts the surviving
shim is killed by remove when appropriate, referencing remove, BoxInfo,
BoxStatus::Stopped, state.pid, and kill_process to locate the relevant code
paths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e1773de5-7dcc-458a-8f01-8686e56eb24f

📥 Commits

Reviewing files that changed from the base of the PR and between ba7458d and 5aaa7d9.

📒 Files selected for processing (14)
  • src/boxlite/src/db/boxes.rs
  • src/boxlite/src/jailer/mod.rs
  • src/boxlite/src/jailer/sandbox/seatbelt.rs
  • src/boxlite/src/litebox/box_impl.rs
  • src/boxlite/src/litebox/config.rs
  • src/boxlite/src/litebox/init/tasks/guest_connect.rs
  • src/boxlite/src/litebox/init/types.rs
  • src/boxlite/src/litebox/manager.rs
  • src/boxlite/src/net/socket_path.rs
  • src/boxlite/src/runtime/constants.rs
  • src/boxlite/src/runtime/layout.rs
  • src/boxlite/src/runtime/rt_impl.rs
  • src/boxlite/src/runtime/types.rs
  • src/boxlite/tests/health_check.rs
💤 Files with no reviewable changes (5)
  • src/boxlite/src/litebox/manager.rs
  • src/boxlite/src/litebox/box_impl.rs
  • src/boxlite/src/runtime/constants.rs
  • src/boxlite/src/litebox/init/types.rs
  • src/boxlite/src/runtime/types.rs

Comment on lines +207 to +210
if layout.sockets_dir().exists() {
for (path, writable) in layout.sockets().policy_paths() {
paths.push(PathAccess { path, writable });
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Re-ensure the binding symlink before emitting its PathAccess.

Line 207 only gates on layout.sockets_dir().exists(), but the writable binding_dir entry from Line 208 is still interpreted later from live filesystem state. If /tmp/bl-{uid}/{box_id} has been cleaned up when the Seatbelt policy is built, seatbelt::build_dynamic_* will treat that path as non-directory and emit only a literal rule, so descendant binds/connects under .../net.sock / .../ready.sock lose the required subpath grant. The new test here already has to call sockets.ensure() to make the binding path behave like a directory. Please either ensure() immediately before building the sandbox context or carry explicit directory intent in PathAccess so the policy no longer depends on /tmp state at generation time.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/boxlite/src/jailer/mod.rs` around lines 207 - 210, The policy generation
incorrectly relies on on-disk state for the writable binding directory: call
sockets.ensure() (or otherwise ensure layout.sockets_dir() binding symlink
exists) immediately before building the sandbox context so the subsequent call
to layout.sockets().policy_paths() yields directory-aware entries;
alternatively, modify PathAccess creation in that code path to carry explicit
directory intent (e.g., mark the binding_dir entry as a directory/writable) so
seatbelt::build_dynamic_* receives a directory grant regardless of /tmp state.
Ensure the change targets the code that iterates layout.sockets().policy_paths()
and produces PathAccess so descendant socket grants for net.sock/ready.sock are
emitted as subpath-capable directory rules.

Comment on lines 155 to +177
match std::fs::symlink_metadata(&symlink_path) {
Ok(meta) if meta.file_type().is_symlink() => {
// Stale symlink — safe to replace
if std::fs::read_link(&symlink_path).ok().as_deref() == Some(self.real_dir()) {
return Ok(()); // already correct
}
// Points elsewhere (stale from a previous box) — replace.
let _ = std::fs::remove_file(&symlink_path);
}
Ok(_) => {
// Regular file or directory — refuse to overwrite
// Regular file or directory — refuse to overwrite.
return Err(BoxliteError::Internal(format!(
"{} exists but is not a symlink — refusing to overwrite",
symlink_path.display(),
)));
}
Err(_) => {
// Doesn't exist — good
}
Err(_) => {} // doesn't exist — good
}

std::os::unix::fs::symlink(sockets_dir, &symlink_path).map_err(|e| {
BoxliteError::Storage(format!(
"Failed to create socket symlink {} → {}: {}",
symlink_path.display(),
sockets_dir.display(),
e,
))
})?;
if let Err(e) = std::os::unix::fs::symlink(self.real_dir(), &symlink_path) {
// Tolerate a concurrent ensure() for the same box as long as the
// winner points at our target.
let racing_winner_ok = e.kind() == std::io::ErrorKind::AlreadyExists
&& std::fs::read_link(&symlink_path).ok().as_deref() == Some(self.real_dir());

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Compare the existing binding target by inode, not by read_link() string.

Lines 157-177 still treat the binding as "correct" only when the symlink target string exactly matches self.real_dir(). That breaks the upgrade/self-heal invariant described in this refactor: two callers can legitimately describe the same sockets directory with different path strings (BOXLITE_HOME through a symlink, .. segments, canonicalized vs original). In that case ensure() will either replace a valid binding or fail the concurrent-winner path with AlreadyExists, even though both paths resolve to the same directory. Please compare metadata() dev/inode instead.

Suggested fix
 impl BoxSockets {
+    fn binding_matches_real_dir(&self, path: &Path) -> bool {
+        let Ok(bound_meta) = std::fs::metadata(path) else {
+            return false;
+        };
+        let Ok(real_meta) = std::fs::metadata(self.real_dir()) else {
+            return false;
+        };
+        bound_meta.dev() == real_meta.dev() && bound_meta.ino() == real_meta.ino()
+    }
+
     /// Ensure the binding symlink exists and is correct. Idempotent;
     /// tolerates concurrent callers for the same box.
     pub fn ensure(&self) -> BoxliteResult<()> {
@@
         match std::fs::symlink_metadata(&symlink_path) {
             Ok(meta) if meta.file_type().is_symlink() => {
-                if std::fs::read_link(&symlink_path).ok().as_deref() == Some(self.real_dir()) {
+                if self.binding_matches_real_dir(&symlink_path) {
                     return Ok(()); // already correct
                 }
                 // Points elsewhere (stale from a previous box) — replace.
                 let _ = std::fs::remove_file(&symlink_path);
             }
@@
         if let Err(e) = std::os::unix::fs::symlink(self.real_dir(), &symlink_path) {
             // Tolerate a concurrent ensure() for the same box as long as the
             // winner points at our target.
             let racing_winner_ok = e.kind() == std::io::ErrorKind::AlreadyExists
-                && std::fs::read_link(&symlink_path).ok().as_deref() == Some(self.real_dir());
+                && self.binding_matches_real_dir(&symlink_path);
             if !racing_winner_ok {
                 return Err(BoxliteError::Storage(format!(
                     "Failed to create socket symlink {} → {}: {}",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
match std::fs::symlink_metadata(&symlink_path) {
Ok(meta) if meta.file_type().is_symlink() => {
// Stale symlink — safe to replace
if std::fs::read_link(&symlink_path).ok().as_deref() == Some(self.real_dir()) {
return Ok(()); // already correct
}
// Points elsewhere (stale from a previous box) — replace.
let _ = std::fs::remove_file(&symlink_path);
}
Ok(_) => {
// Regular file or directory — refuse to overwrite
// Regular file or directory — refuse to overwrite.
return Err(BoxliteError::Internal(format!(
"{} exists but is not a symlink — refusing to overwrite",
symlink_path.display(),
)));
}
Err(_) => {
// Doesn't exist — good
}
Err(_) => {} // doesn't exist — good
}
std::os::unix::fs::symlink(sockets_dir, &symlink_path).map_err(|e| {
BoxliteError::Storage(format!(
"Failed to create socket symlink {} → {}: {}",
symlink_path.display(),
sockets_dir.display(),
e,
))
})?;
if let Err(e) = std::os::unix::fs::symlink(self.real_dir(), &symlink_path) {
// Tolerate a concurrent ensure() for the same box as long as the
// winner points at our target.
let racing_winner_ok = e.kind() == std::io::ErrorKind::AlreadyExists
&& std::fs::read_link(&symlink_path).ok().as_deref() == Some(self.real_dir());
fn binding_matches_real_dir(&self, path: &Path) -> bool {
let Ok(bound_meta) = std::fs::metadata(path) else {
return false;
};
let Ok(real_meta) = std::fs::metadata(self.real_dir()) else {
return false;
};
bound_meta.dev() == real_meta.dev() && bound_meta.ino() == real_meta.ino()
}
/// Ensure the binding symlink exists and is correct. Idempotent;
/// tolerates concurrent callers for the same box.
pub fn ensure(&self) -> BoxliteResult<()> {
match std::fs::symlink_metadata(&symlink_path) {
Ok(meta) if meta.file_type().is_symlink() => {
if self.binding_matches_real_dir(&symlink_path) {
return Ok(()); // already correct
}
// Points elsewhere (stale from a previous box) — replace.
let _ = std::fs::remove_file(&symlink_path);
}
Ok(_) => {
// Regular file or directory — refuse to overwrite.
return Err(BoxliteError::Internal(format!(
"{} exists but is not a symlink — refusing to overwrite",
symlink_path.display(),
)));
}
Err(_) => {} // doesn't exist — good
}
if let Err(e) = std::os::unix::fs::symlink(self.real_dir(), &symlink_path) {
// Tolerate a concurrent ensure() for the same box as long as the
// winner points at our target.
let racing_winner_ok = e.kind() == std::io::ErrorKind::AlreadyExists
&& self.binding_matches_real_dir(&symlink_path);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/boxlite/src/net/socket_path.rs` around lines 155 - 177, Replace string
equality checks against self.real_dir() with an inode/dev comparison: when you
see std::fs::read_link(&symlink_path).ok().as_deref() == Some(self.real_dir())
(in the initial Ok(meta) branch) and the racing_winner_ok check that reads the
link, instead call std::fs::read_link(&symlink_path) to get the target PathBuf
and then std::fs::metadata(&target).ok() and
std::fs::metadata(self.real_dir()).ok() and compare their device and inode via
std::os::unix::fs::MetadataExt::dev() and ::ino(); treat them as equal if both
metadata calls succeed and dev/ino match. Update both the "already correct"
early return and the AlreadyExists race check to use this dev/ino comparison and
fall back to the original string check only if metadata info is unavailable.

Comment on lines +153 to +162
// Teardown through the production path. Without this, PerTestBoxHome's
// leak check fails: a live shim remains even though status is Stopped —
// the pid killed above (BoxInfo.pid) is evidently not the process that
// wrote shim.pid. Pre-existing on main (reproduces at 577f4a57).
// FIXME(health): reconcile BoxInfo.pid with shim.pid and decide what
// the surviving process is.
t.runtime
.remove(box_id.as_str(), true)
.await
.expect("force remove at teardown");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect whether the harness has an independent leak check and whether
# remove(force=true) can bypass it without killing the real shim.
fd -i 'health_check.rs|box_test.rs|box_test_base.rs|per_test_box_home.rs|rt_impl.rs|box_impl.rs|types.rs' src tests . 2>/dev/null | while read -r f; do
  printf '\n===== %s =====\n' "$f"
  rg -n -C3 'PerTestBoxHome|leak|shim\.pid|BoxInfo|pid|remove\(|force_status|set_pid|kill_process' "$f" || true
done

Repository: boxlite-ai/boxlite

Length of output: 50374


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1) Locate and print the relevant portion of remove_box in rt_impl.rs
RT_IMPL="src/boxlite/src/runtime/rt_impl.rs"
echo "===== remove_box in $RT_IMPL ====="
rg -n "fn remove_box\\(" "$RT_IMPL"
line=$(rg -n "fn remove_box\\(" "$RT_IMPL" | head -n1 | cut -d: -f1)
start=$((line-40))
end=$((line+180))
sed -n "${start},${end}p" "$RT_IMPL" | nl -ba | sed -n '1,260p'

# 2) Inspect the health-check failure branch around where it sets pid/status
HC="src/boxlite/tests/health_check.rs"
echo
echo "===== health_check.rs (context around teardown/setup) ====="
rg -n "health failed|Unhealthy|set_pid\\(None\\)|force_status\\(|remove\\(" "$HC"
sed -n '70,200p' "$HC" | nl -ba

# 3) Locate PerTestBoxHome leak check implementation
PE="src"
echo
echo "===== PerTestBoxHome implementation (leak check) ====="
home_file=$(fd -i "per_test_box_home.rs" "$PE" | head -n1 || true)
echo "Found: $home_file"
if [ -n "${home_file:-}" ]; then
  rg -n "leak|Drop|assert|shim\\.pid|pid_file|is_process_alive" "$home_file" || true
  # Print Drop impl vicinity if present
  drop_line=$(rg -n "impl Drop for PerTestBoxHome|impl\\s+Drop\\s+for\\s+PerTestBoxHome" "$home_file" | head -n1 | cut -d: -f1 || true)
  if [ -n "${drop_line:-}" ]; then
    start=$((drop_line-60)); end=$((drop_line+140))
    sed -n "${start},${end}p" "$home_file" | nl -ba | sed -n '1,260p'
  else
    # fallback: print around leak assertions
    leak_line=$(rg -n "leak|shim\\.pid|is_process_alive" "$home_file" | head -n1 | cut -d: -f1 || true)
    if [ -n "${leak_line:-}" ]; then
      start=$((leak_line-60)); end=$((leak_line+140))
      sed -n "${start},${end}p" "$home_file" | nl -ba | sed -n '1,260p'
    fi
  fi
fi

# 4) Also inspect any test helpers for teardown/remove semantics (optional)
BTB="src/test-utils/src/box_test.rs"
echo
echo "===== box_test.rs context around t.runtime.remove(force=true) ====="
rg -n "remove\\(.*true\\)" "$BTB" || true

Repository: boxlite-ai/boxlite

Length of output: 273


🏁 Script executed:

#!/bin/bash
set -euo pipefail

RT_IMPL="src/boxlite/src/runtime/rt_impl.rs"
HC="src/boxlite/tests/health_check.rs"

echo "===== remove_box control flow (rt_impl.rs, around 828) ====="
# show enough to capture the force/kill decision
sed -n '800,910p' "$RT_IMPL" | cat -n

echo
echo "===== health_check.rs around teardown ====="
sed -n '130,190p' "$HC" | cat -n

echo
echo "===== PerTestBoxHome location ====="
HOME_FILE="$(fd -i 'per_test_box_home\.rs' src tests . | head -n 1 || true)"
echo "Found: ${HOME_FILE:-<none>}"

if [ -n "${HOME_FILE:-}" ]; then
  echo
  echo "===== PerTestBoxHome leak check / Drop (context) ====="
  # print the Drop impl plus nearby leak assertions
  DROP_LINE="$(rg -n 'impl\s+Drop\s+for\s+PerTestBoxHome|impl\s+Drop\s+for\s+.*PerTestBoxHome' "$HOME_FILE" | head -n1 | cut -d: -f1 || true)"
  if [ -n "${DROP_LINE:-}" ]; then
    start=$((DROP_LINE-80)); end=$((DROP_LINE+220))
    sed -n "${start},${end}p" "$HOME_FILE" | cat -n
  else
    # fallback: print first leak assertion block
    rg -n 'leak|shim\.pid|pid_file|is_process_alive|assert!|Drop' "$HOME_FILE" | head -n 120
  fi
fi

echo
echo "===== box_test helper calls (t.runtime.remove(force=true)) ====="
sed -n '1,260p' src/test-utils/src/box_test.rs | cat -n

Repository: boxlite-ai/boxlite

Length of output: 17230


Teardown can mask the PID-mismatch bug—remove(force=true) won’t kill when health-check has cleared state.pid
remove_box(force=true) only calls kill_process when state.status.is_active() or state.pid.is_some(), and it kills only if state.pid is Some(pid). The health-check failure path sets BoxStatus::Stopped and state.pid = None, and the test teardown runs after asserting info.status == Stopped, so this cleanup can remove DB/files without killing the surviving shim process—weakening the FIXME’s regression signal. Add an explicit assertion that the shim PID (or pidfile process identity) is dead before/around teardown, or split this into a dedicated regression that checks PID mismatch behavior without relying on remove(force=true).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/boxlite/tests/health_check.rs` around lines 153 - 162, The teardown using
t.runtime.remove(box_id.as_str(), true) can mask the PID-mismatch bug because
remove only kills when state.status.is_active() or state.pid.is_some(), but the
health-check path clears state.pid and sets BoxStatus::Stopped; update the test
to assert the shim process is actually dead before calling remove (or split into
a separate regression): capture the shim pid (from shim.pid or BoxInfo.pid)
after starting the box, verify the process identity is not running (e.g., check
/proc/<pid> or attempt a signal) and assert that it is dead when info.status ==
BoxStatus::Stopped, and only then perform t.runtime.remove(...); alternatively
create a dedicated test that reproduces PID mismatch and asserts the surviving
shim is killed by remove when appropriate, referencing remove, BoxInfo,
BoxStatus::Stopped, state.pid, and kill_process to locate the relevant code
paths.

Source: Coding guidelines

Merged via the queue into main with commit 20873af Jun 12, 2026
35 checks passed
@law-chain-hot law-chain-hot deleted the fix/boxsockets-sun-path branch June 12, 2026 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants