fix(recovery): PID-fingerprint identity + crash-aware attach pipeline#590
Merged
Merged
Conversation
Resolves the silent reattach hazards exposed when shim spec moved from argv to stdin (issue #434) and hardens the attach boundary: - Two-line `shim.pid` written by the child in pre_exec captures `(pid, start_time)` so recovery can verify identity across PID reuse. New `PidRecord` / `PidFileReader` / `PidFileWriter` split: pure codec, parent-side I/O, and pre-allocated signal-safe child-side writer. - `ProcessIdentity::{Verified, Legacy, Absent}` from `PidFileReader:: process_identity()` becomes the canonical "is this our shim?" check. - `vmm_attach` now gates on ProcessIdentity (reads PID from the file, not the DB cache) and surfaces crashes via CrashReport.user_message when an exit file is parseable. PreflightCrashCheckTask removed — vmm_attach covers both safety and enrichment. - `recover_boxes` reconciles via ProcessIdentity first; exit_file only matters when no shim is alive. Verified/Legacy with a stale exit_file stashes it to `exit.previous` with a warn so the next crash gets the canonical slot. - Successful start stashes any leftover `exit` to `exit.previous` so preflight (now in vmm_attach) doesn't trip on prior-lifecycle artifacts. - `stop()` collapsed: one graceful path via `live_state().await` for Running boxes, cleanup-only for Configured/Failed/Stopped. The state.status pre-filter prevents init_live_state from accidentally spawning a new VM during stop. - `BoxImpl` owns its `BoxFilesystemLayout` (centralizes per-box paths, removes open-coded `boxes_dir().join(id).join("shim.pid")` sites). - `get_execution_plan` returns Result instead of panicking on Unknown status; duplicate allow-list guard in init_live_state removed. - `BoxID::parse` discipline documented: never silently mint, never accept identity from in-VM RPC bodies (MEMORY.md REST SDK bug class). Tests: - recovery: crash detection (Failed + reason), successful_start stash to exit.previous (preserves crash record across restart cycles) - rt_impl: stop on Configured/Failed must not spawn; stop on recovered Running with mismatched fingerprint must not kill foreign process - All 708 lib tests + recovery (9) + detach (5) + pid_file (9) green. Out of scope for this PR (filed as follow-ups in plan file): - Stopped+Verified orphan drain (rare manual-intervention edge case) - DB-less orphan filesystem scan (recover_boxes only iterates DB rows) - Feature C — HMAC challenge-response for host↔guest defense in depth Co-authored-by: zxyasfas <zxyasfas@users.noreply.github.com>
6e09b0a to
fe9a9a3
Compare
This was referenced May 25, 2026
This was referenced May 26, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
shim.pidcarries(pid, start_time);ProcessIdentityis the canonical "is this our shim?" gate across recovery / attach / stop, replacing the argv-substring check that became blind after the shim spec moved from argv to stdin (issuelistInfo()reports detached boxes as "stopped" while VM is still running #434).vmm_attachreads PID from the file (not the DB cache), gates on ProcessIdentity, and surfaces crashes viaCrashReport.user_messagefrom the exit file.PreflightCrashCheckTaskdeleted — vmm_attach now covers both safety and crash-enrichment.stop()collapsed into one graceful path throughlive_state().awaitfor Running boxes; cleanup-only for Configured/Failed/Stopped (no accidental spawn).exit.previouson successful start and on recovery-with-alive-shim; preserves prior-lifecycle crash evidence and frees the canonical slot.Builds on the design direction of #492 (cc @zxyasfas) with start-time fingerprint replacing the env-marker approach for unforgeable PID-reuse defense.
Test plan