fix(updates): gate auto-restart on boot-ready to avoid ORT teardown SIGSEGV (#3622)#3660
Conversation
…IGSEGV (#3622) The auto-updater calls `tauri::AppHandle::restart` → `std::process::exit`, which runs C++ static destructors in onnxruntime. If `AudioManager::new` is still mid-`create_session` on the server worker thread, the global DataTypeRegistry is torn down while `PlannerImpl::GetElementSize` is still walking it → SIGSEGV at 0x2c8 (see #3622, #3557 for stack). Existing `catch_panic_into_error` (PR #3290) only catches Rust panics; this is a C++ segfault during process teardown, observability-blind to Sentry because the process dies before the event ships. Add `health::wait_for_boot_ready` and call it in the auto-restart path before the user-facing 30s countdown. In the common case boot is already "ready" and the wait returns immediately. If startup is genuinely stuck past 5 min, defer the restart — pending update stays downloaded, banner visible, next check cycle retries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@Anshgrover23 @divanshu-go can u test? |
…tests Iteration on #3622 fix in response to self-review: Edge cases the first commit missed: - Error-phase wait would spin until timeout instead of failing fast - Lock poisoning silently returned "not ready" (would force 5-min wait on a poisoned lock) - Banner-click and rollback paths bypassed the gate — Windows downloadAndInstall calls process::exit internally, same race - Zero tests for any of it Changes: - health.rs: introduce `BootReadiness { Ready, Errored, Pending }` enum. `wait_for_boot_ready` now returns the readiness so callers can distinguish error from timeout. Lock-poisoning uses the existing `unwrap_or_else(|e| e.into_inner())` recovery pattern. - updates.rs: extract `await_restart_gate(timeout, label) -> RestartGate`. Auto-update block now calls it via the shared helper. Adds `await_safe_restart` Tauri command exposing the same gate to the frontend with a 60 s default cap. - update-banner.tsx: awaits `await_safe_restart` before relaunch / downloadAndInstall. On `errored` or `pending` shows a toast and aborts cleanly — no crash, no stale banner. - main.rs: register the new Tauri command. - 7 unit tests covering: each variant of `boot_readiness`, immediate return when ready, fail-fast on error phase, timeout while pending, and observation of a mid-wait transition to ready. All pass in ~1 s. Verified: cargo check clean, cargo test passes 7/7, bun build clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Iteration based on self-reviewAfter looking again, the first commit was solid but not "amazing." Pushed df2989568 to address the gaps: Edge cases the first commit missed
Changes
Verified: |
|
Still reproducing on v2.4.288 / macOS 26.5 (Apple Silicon, MacBookPro18,3), ~3h after launch. Same race as diagnosed in #3557 — Thread 1 Workaround for anyone hitting this in the meantime: disable speaker diarization in settings, or wait until the 30s updater countdown is past before touching audio. Happy to test a TestFlight / nightly build of this branch if useful. |
… auto-collect - Resolve main.rs conflict in favor of main's tauri_collect_commands!() auto-collection (PR #3679) — drops the manual generate_handler! list. - Add #[specta::specta] to await_safe_restart so the auto-collector actually registers it (without it the command is silently excluded and the frontend raw invoke("await_safe_restart") would fail at runtime). get_boot_phase already had both attrs. - Cargo.lock: take main's superset (#3660 adds no deps). - Add crates/screenpipe-audio/examples/race_repro.rs: a standalone harness that reproduces the ORT teardown SIGSEGV (#3622) 40/40 in RACE mode and stays clean 40/40 in GATED mode (validating the boot-ready gate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
health::is_boot_ready/wait_for_boot_readyand gates the auto-update restart on itThe bug
tauri::AppHandle::restart→std::process::exitruns C++ static destructors in onnxruntime. IfAudioManager::newis still mid-init on thescreenpipe-serverworker thread (first boot, slow model download, large DB migration), the globalDataTypeRegistrysingleton is torn down whilePlannerImpl::GetElementSizeis still walking it →EXC_BAD_ACCESS at 0x2c8.Full stack in #3557.
Why existing fixes don't catch this
catch_panic_into_errorfrom PR #3290 wrapscreate_sessionwithstd::panic::catch_unwind. That catches Rust panics — but this is a C++ segfault inside__cxa_finalize_ranges(static dtor teardown). Different bug class, different fix.Sentry never sees this crash (process dies before the event ships). Confirmed by searching
mediar/screenpipe-appforonnxruntime/DataTypeRegistry/PlannerImpl/SIGSEGV— zero matches for this signature. Observability-blind, which is exactly why a lifecycle gate matters.The fix
health.rs: two helpers reusing existingBOOT_PHASEinfra.is_boot_ready()—trueiff current phase =="ready".wait_for_boot_ready(timeout)— async poll loop with cap.updates.rs: in theauto_update && update_installedblock, callwait_for_boot_ready(5 min)before the 30s user-facing countdown. If boot is already ready (~all production cases) the wait returns immediately and behavior is unchanged. If startup is genuinely stuck past 5 min, defer the restart —update_availableandupdate_installedstay true, banner stays visible, next manual restart picks up the staged bundle.Why before the 30s countdown (not after)
If we waited after, the UI would emit
update-restarting { delay_secs: 30 }and then potentially block up to 5 min more — the displayed countdown would lie. Waiting first means the 30s countdown remains an honest 30s.What I considered but didn't do
catch_unwindwon't see it, structured logging from Rust can't intercept it. Issue body's "step 1" framing was wrong.AudioManager::new: much larger surface, would touch every model-load path, and doesn't solve the underlying race — the gate at the restart site is the narrow, correct fix.update_available = falseon defer: would re-download on next check (wasteful, and on Windows the download triggers the installer). Leaving it true means the banner stays accurate and the user can restart manually.Test plan
cargo check -p screenpipe-appclean (1 pre-existing warning unrelated to this PR)bun run buildunderapps/screenpipe-app-tauri/cleanServerCore::startreaches"ready", verify log lineauto-update v...: boot phase not ready ... deferring restartand no crashauto_update=true, verify the 30s countdown fires immediately and restart succeeds as beforeRelated
OnceLockinit — still useful, separate concernSCREENPIPE-APP-9X(124 events, 86 users) — Rust ORT-init panic still leaking past fix(audio): catch ort init panic so speaker init can't crash tokio worker #3290's wrapper at v2.4.199; out of scope here, worth a separate look🤖 Generated with Claude Code