Skip to content

fix: recover detached boxes after stdin shim config#492

Closed
zxyasfas wants to merge 5 commits into
boxlite-ai:mainfrom
zxyasfas:codex/fix-detached-recovery-pid-check
Closed

fix: recover detached boxes after stdin shim config#492
zxyasfas wants to merge 5 commits into
boxlite-ai:mainfrom
zxyasfas:codex/fix-detached-recovery-pid-check

Conversation

@zxyasfas

@zxyasfas zxyasfas commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary

  • tag spawned shim processes with BOXLITE_BOX_ID so recovery can validate the owning box without putting the InstanceSpec back in argv
  • update Linux PID validation to require a boxlite-shim process with the matching env marker instead of checking for the box id in /proc/<pid>/cmdline
  • add regression coverage for the Linux env-based process check

Fixes #434.

Testing

  • git diff --check
  • Not run: cargo fmt --check / cargo test -p boxlite util::process because cargo is not installed in this Windows environment and WSL commands timed out before reaching the repo.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7901f252e1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@@ -1,598 +0,0 @@
//! Process validation utilities for PID checking and verification.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Keep util::process module available to the crate

Deleting this file breaks compilation because src/boxlite/src/util/mod.rs still declares pub mod process;, so Rust will emit E0583 (module file not found) as soon as this commit is checked out. Either keep process.rs (or move its contents) and update module wiring in the same commit.

Useful? React with 👍 / 👎.

@@ -1,293 +0,0 @@
//! Subprocess spawning for boxlite-shim binary.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Keep controller::spawn module available to the crate

Removing this file also introduces a hard compile failure: src/boxlite/src/vmm/controller/mod.rs still contains mod spawn;, so the module resolver cannot find spawn.rs. This commit must include corresponding module rewiring or retain the file to keep the crate buildable.

Useful? React with 👍 / 👎.

@zxyasfas zxyasfas force-pushed the codex/fix-detached-recovery-pid-check branch 4 times, most recently from 2b4bc54 to df9ff08 Compare May 6, 2026 03:07
@zxyasfas

zxyasfas commented May 9, 2026

Copy link
Copy Markdown
Contributor Author

Status update: the PR is still mergeable and the relevant CI is green. Rust formatting, Rust tests across darwin/linux targets, Clippy, config checks, and codecov patch have all passed. The diff remains scoped to shim process tagging and detached box recovery validation.

@DorianZheng

Copy link
Copy Markdown
Member

Hi @zxyasfas. Could you check why there are lots of irrelevant diff?

@zxyasfas zxyasfas force-pushed the codex/fix-detached-recovery-pid-check branch from df9ff08 to cedc836 Compare May 10, 2026 05:20
@zxyasfas

Copy link
Copy Markdown
Contributor Author

Thanks for catching this. I rebuilt the branch on top of the current main and force-pushed the same scoped change, so the irrelevant diff is gone now.

The PR diff is back to the intended files only:

  • src/boxlite/src/util/process.rs
  • src/boxlite/src/vmm/controller/spawn.rs

Current compare is 1 commit ahead / 0 behind, with the diff limited to the shim env marker and PID recovery validation changes. CI is re-running on the cleaned branch.

@DorianZheng

Copy link
Copy Markdown
Member

two things look off to me:

  1. read_to_string on /proc/<pid>/environ will fail if any inherited env var isn't valid utf-8. one bad byte and we false-negative → live box gets marked stopped. should read as bytes and split on \0.
  2. boxes already detached from before this PR don't have BOXLITE_BOX_ID set. after upgrade, recovery will think they're dead and delete their pid file. need a fallback or treat unmarked live shims as Unknown instead of stale.

direction is right though, just want these two covered with tests before we merge.

@zxyasfas

Copy link
Copy Markdown
Contributor Author

Thanks, pushed a follow-up for both cases in b0f3ca2.

  • /proc//environ is now read as bytes and split on NUL, so an unrelated non-UTF-8 env var does not make a correctly tagged shim look stale.
  • PID validation now distinguishes match, mismatch, and unverified. Recovery keeps an unmarked live shim's PID file and marks the box Unknown instead of Stopped, which preserves detached shims created before BOXLITE_BOX_ID existed.
  • Added Linux coverage for tagged shims with non-UTF-8 env entries, unmarked legacy shims, and the recovery Unknown path.

@zxyasfas

Copy link
Copy Markdown
Contributor Author

Fixed the export/format follow-up in 35c4648. CI is green now.

@DorianZheng

Copy link
Copy Markdown
Member

looked at this more closely and the Unknown + live pid model has issues across the lifecycle, not just rm:

  1. rm orphans the vmcan_remove(Unknown) = true and the kill branch in remove_box only runs when is_active(). so on a legacy detached shim recovered as Unknown, boxlite rm <id> (no force) deletes the db row + box dir while the vm keeps running. (rt_impl.rs:776-801, state.rs:91-96, state.rs:52-54)

  2. stop also orphans itbox_impl.rs:314-442: for a recovered Unknown box, self.live.get() returns None (recovery only writes the db, never rebuilds LiveState), so the VM-shutdown block at 344-362 is skipped. then it removes the pid file, clears state.pid, marks Stopped, returns Ok. user sees success, runtime drops the handle, vm keeps running.

  3. exec panicslive_stateinit_live_stateget_execution_plan(Unknown) hits _ => panic!("Invalid BoxStatus for initialization: {:?}", status). crashes the runtime.

small thing separate from the above: process_identity_linux returns Unverified when fs::read(environ_path) errors. legacy shims have a readable environ (just no marker), so that Err branch is only reached on different-uid pid reuse — should be Mismatch.

1/2/3 share a root: Unknown historically meant "weird db state, no live process", and the lifecycle ops never check pid.is_some().

@zxyasfas

Copy link
Copy Markdown
Contributor Author

Addressed the lifecycle issues from the latest review in 88b30dc.

  • Legacy unmarked shims now recover as Running instead of Unknown, so rm/exec keep the box in the active lifecycle path.
  • Non-force rm now refuses any persisted live PID, including unusual recovered states; force rm still kills and removes it.
  • stop now kills a recovered PID even when no LiveState has been rebuilt, so it cannot remove the PID file and return success while leaving the VM process alive.
  • Unknown initialization now returns InvalidState instead of panicking, and unreadable /proc environ is treated as a mismatch.

CI is green across formatting, Clippy, Rust tests, and codecov patch.

@nieyy

nieyy commented May 12, 2026

Copy link
Copy Markdown

@zxyasfas could you add several test cases?

@law-chain-hot

Copy link
Copy Markdown
Contributor

Really appreciate the iteration here. The env-marker, 3-state ProcessIdentity, legacy-shim recovery path, and lifecycle fallbacks are all coming together nicely. Three things we'd like to fold in before merging, if you have cycles.


1. Identity check: cross-platform alignment

Linux side is solid. macOS is weaker: process_identity only checks the process name there, can't tell which box, no PID-reuse defense.

┌──────────┬─────────────────────────────┬───────────────────────┐
│ Platform │ Current check                │ What it gives us       │
├──────────┼─────────────────────────────┼───────────────────────┤
│ Linux    │ /proc/<pid>/environ for      │ Identifies the box;    │
│          │ BOXLITE_BOX_ID               │ implicitly PID-reuse-  │
│          │                              │ safe ✓                 │
├──────────┼─────────────────────────────┼───────────────────────┤
│ macOS    │ sysinfo, process-name only   │ Only confirms "this is │
│          │ (`util/process.rs`:          │ a shim", can't tell    │
│          │  `let _ = box_id; //         │ which box, no PID-     │
│          │   Unused on macOS`)          │ reuse defense ⚠️       │
└──────────┴─────────────────────────────┴───────────────────────┘

Since macOS is a first-class platform, we'd like to align both sides on (pid, start_time), the standard pattern (systemd PidRef, postgres postmaster.pid, runc; pidfd_open is the kernel version). Store start_time at spawn via sysinfo, compare on recovery; mismatch = PID reused.

# On spawn
──────────────────────────────────────────────
cmd.spawn() → pid
     ↓
sysinfo::Process::start_time(pid) → t0    (seconds, UTC epoch)
     ↓                                    (same API on Linux & macOS)
BoxState { pid, shim_start_time = Some(t0) } → DB


# On recovery
──────────────────────────────────────────────
DB → (pid, expected = t0)
     ↓
sysinfo::Process::start_time(pid) → current = t1
     ↓
match (expected, current):
  (Some(t0), Some(t1)) if t0 == t1  → Matches     (same process)
  (Some(_),  Some(_))               → Mismatch    (PID was reused)
  (None,     Some(_))               → Unverified  (legacy box)
  (_,        None)                  → Mismatch    (process gone)

Landing plan:

  • BoxState gets shim_start_time: Option<u64> with #[serde(default)], same trick health_status uses, no SQL migration.
  • process_identity collapses into one cross-platform fn; Linux environ parsing goes away.
  • BOXLITE_BOX_ID env can stay as a debug tag, just not part of validation.
  • Legacy rows (no start_time) → Unverified, backfill on first recovery.

2. Graceful shutdown: use the existing ShimHandler

The new else branch calls kill_process(pid) = straight SIGKILL. That skips the shim's own SIGTERM handler (src/shim/src/main.rs:243-310, does Guest.Shutdown RPC + qcow2 flush), which is exactly the qcow2 corruption case that handler exists to prevent. Recovered boxes hitting stop() are effectively getting a power-cord pull.

ShimHandler already covers this: from_pid (shim.rs:70) + stop() (line 86) does SIGTERM → waitpid poll → SIGKILL fallback. The if Some(live) branch right above already uses it (owned-mode). Both can converge:

Today (three paths, each on its own)
────────────────────────────────────────────────────────────

box_impl.rs::stop()
    ├── if Some(live)  ──► live.handler.stop()     (= ShimHandler ✓)
    └── else           ──► kill_process(pid)        (SIGKILL ✗ new in PR)

rt_impl.rs::shutdown_sync (lines 693-714)
                       ──► 20+ lines of inline SIGTERM loop + kill_process
                           (correct, but duplicates ShimHandler::stop)

Proposed (everything converges on ShimHandler)
────────────────────────────────────────────────────────────

box_impl.rs::stop()
    ├── if Some(live)  ──► live.handler.stop()                   ┐
    └── else           ──► ShimHandler::from_pid(pid).stop()     ├──► shared
                                                                 │
rt_impl.rs::shutdown_sync                                        │
                       ──► ShimHandler::from_pid(pid).stop()     ┘

                           Internals (unified):
                           SIGTERM → waitpid poll (with reap) → SIGKILL fallback

Both spots become:

let mut h = ShimHandler::from_pid(pid, self.config.id.clone());
h.stop()?;

rt_impl.rs:788's remove_box(force=true) we'd leave alone, respecting the "force = kill directly" intent.

Also: GRACEFUL_SHUTDOWN_TIMEOUT_MS = 2000 (shim.rs:90) < shim's own 8s budget (main.rs:237-241), so parent SIGKILLs mid-Guest.Shutdown. Bump to 10_000 with a one-line comment (parent wait ≥ shim budget). Different things, no shared constant needed.


3. Tests

Lining up with @nieyy, we'd like the test cases to follow:

(a) the test reproduces the bug   ── red before the fix
(b) the test passes after the fix ── green after

A possible follow-up we're kicking around (not for this PR): could recover_boxes rebuild LiveState via ShimHandler::from_pid at recovery? That way stop / exec / etc. would all flow through the normal if Some(live) path and the "no LiveState" fallbacks could be retired over time. Either way recover_boxes doesn't need to be touched in this PR — flagging it mainly so the picture is clear. Curious what you and @DorianZheng think.

Push back on anything that looks off.

@DorianZheng

Copy link
Copy Markdown
Member

Hi @zxyasfas. I have make some refinement on this PR. If you think it's fine, I will push the commits in this PR

@DorianZheng

Copy link
Copy Markdown
Member

Closed for new PR #590

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

listInfo() reports detached boxes as "stopped" while VM is still running

4 participants