Skip to content

fix(recovery): PID-fingerprint identity + crash-aware attach pipeline#590

Merged
DorianZheng merged 1 commit into
mainfrom
fix/recovery-pid-identity-attach-hardening
May 25, 2026
Merged

fix(recovery): PID-fingerprint identity + crash-aware attach pipeline#590
DorianZheng merged 1 commit into
mainfrom
fix/recovery-pid-identity-attach-hardening

Conversation

@DorianZheng

Copy link
Copy Markdown
Member

Summary

  • Two-line shim.pid carries (pid, start_time); ProcessIdentity is the canonical "is this our shim?" gate across recovery / attach / stop, replacing the argv-substring check that became blind after the shim spec moved from argv to stdin (issue listInfo() reports detached boxes as "stopped" while VM is still running #434).
  • vmm_attach reads PID from the file (not the DB cache), gates on ProcessIdentity, and surfaces crashes via CrashReport.user_message from the exit file. PreflightCrashCheckTask deleted — vmm_attach now covers both safety and crash-enrichment.
  • stop() collapsed into one graceful path through live_state().await for Running boxes; cleanup-only for Configured/Failed/Stopped (no accidental spawn).
  • Exit files stash to exit.previous on successful start and on recovery-with-alive-shim; preserves prior-lifecycle crash evidence and frees the canonical slot.

Builds on the design direction of #492 (cc @zxyasfas) with start-time fingerprint replacing the env-marker approach for unforgeable PID-reuse defense.

Test plan

  • `cargo test -p boxlite --lib` — 708/708
  • `cargo test -p boxlite --test recovery` — 9/9 (incl. new `recovery_marks_box_failed_when_exit_file_present`, `successful_start_stashes_stale_exit_file`)
  • `cargo test -p boxlite --test detach` — 5/5
  • `cargo test -p boxlite --test pid_file` — 9/9
  • `cargo test -p boxlite --lib rt_impl::tests` — 31/31 (incl. updated `test_stop_does_not_kill_unrelated_process_repro` using fingerprint mismatch, new `test_stop_configured_does_not_spawn` + `test_stop_failed_does_not_spawn`)
  • Pre-push integration suite (228 tests via nextest) — clean after lifecycle assertion fix

Resolves the silent reattach hazards exposed when shim spec moved from
argv to stdin (issue #434) and hardens the attach boundary:

- Two-line `shim.pid` written by the child in pre_exec captures
  `(pid, start_time)` so recovery can verify identity across PID reuse.
  New `PidRecord` / `PidFileReader` / `PidFileWriter` split: pure codec,
  parent-side I/O, and pre-allocated signal-safe child-side writer.
- `ProcessIdentity::{Verified, Legacy, Absent}` from `PidFileReader::
  process_identity()` becomes the canonical "is this our shim?" check.
- `vmm_attach` now gates on ProcessIdentity (reads PID from the file,
  not the DB cache) and surfaces crashes via CrashReport.user_message
  when an exit file is parseable. PreflightCrashCheckTask removed —
  vmm_attach covers both safety and enrichment.
- `recover_boxes` reconciles via ProcessIdentity first; exit_file only
  matters when no shim is alive. Verified/Legacy with a stale exit_file
  stashes it to `exit.previous` with a warn so the next crash gets the
  canonical slot.
- Successful start stashes any leftover `exit` to `exit.previous` so
  preflight (now in vmm_attach) doesn't trip on prior-lifecycle
  artifacts.
- `stop()` collapsed: one graceful path via `live_state().await` for
  Running boxes, cleanup-only for Configured/Failed/Stopped. The
  state.status pre-filter prevents init_live_state from accidentally
  spawning a new VM during stop.
- `BoxImpl` owns its `BoxFilesystemLayout` (centralizes per-box paths,
  removes open-coded `boxes_dir().join(id).join("shim.pid")` sites).
- `get_execution_plan` returns Result instead of panicking on Unknown
  status; duplicate allow-list guard in init_live_state removed.
- `BoxID::parse` discipline documented: never silently mint, never
  accept identity from in-VM RPC bodies (MEMORY.md REST SDK bug class).

Tests:
- recovery: crash detection (Failed + reason), successful_start stash
  to exit.previous (preserves crash record across restart cycles)
- rt_impl: stop on Configured/Failed must not spawn; stop on recovered
  Running with mismatched fingerprint must not kill foreign process
- All 708 lib tests + recovery (9) + detach (5) + pid_file (9) green.

Out of scope for this PR (filed as follow-ups in plan file):
- Stopped+Verified orphan drain (rare manual-intervention edge case)
- DB-less orphan filesystem scan (recover_boxes only iterates DB rows)
- Feature C — HMAC challenge-response for host↔guest defense in depth

Co-authored-by: zxyasfas <zxyasfas@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant