fix(updates): gate auto-restart on boot-ready to avoid ORT teardown SIGSEGV (#3622) by louis030195 · Pull Request #3660 · screenpipe/screenpipe

louis030195 · 2026-05-27T23:09:21Z

Summary

Closes onnxruntime panic during auto-updater restart #3622 (and supersedes [bug] Crash (SIGSEGV in onnxruntime) when auto-updater restarts app during audio-manager / speaker-model init #3557, which was duplicate)
Adds health::is_boot_ready / wait_for_boot_ready and gates the auto-update restart on it

The bug

tauri::AppHandle::restart → std::process::exit runs C++ static destructors in onnxruntime. If AudioManager::new is still mid-init on the screenpipe-server worker thread (first boot, slow model download, large DB migration), the global DataTypeRegistry singleton is torn down while PlannerImpl::GetElementSize is still walking it → EXC_BAD_ACCESS at 0x2c8.

Full stack in #3557.

Why existing fixes don't catch this

catch_panic_into_error from PR #3290 wraps create_session with std::panic::catch_unwind. That catches Rust panics — but this is a C++ segfault inside __cxa_finalize_ranges (static dtor teardown). Different bug class, different fix.

Sentry never sees this crash (process dies before the event ships). Confirmed by searching mediar/screenpipe-app for onnxruntime / DataTypeRegistry / PlannerImpl / SIGSEGV — zero matches for this signature. Observability-blind, which is exactly why a lifecycle gate matters.

The fix

health.rs: two helpers reusing existing BOOT_PHASE infra.
- is_boot_ready() — true iff current phase == "ready".
- wait_for_boot_ready(timeout) — async poll loop with cap.
updates.rs: in the auto_update && update_installed block, call wait_for_boot_ready(5 min) before the 30s user-facing countdown. If boot is already ready (~all production cases) the wait returns immediately and behavior is unchanged. If startup is genuinely stuck past 5 min, defer the restart — update_available and update_installed stay true, banner stays visible, next manual restart picks up the staged bundle.

Why before the 30s countdown (not after)

If we waited after, the UI would emit update-restarting { delay_secs: 30 } and then potentially block up to 5 min more — the displayed countdown would lie. Waiting first means the 30s countdown remains an honest 30s.

What I considered but didn't do

Add structured panic logging at ORT boundaries (per onnxruntime panic during auto-updater restart #3622 body): SIGSEGV inside C++ static dtors isn't a Rust panic; catch_unwind won't see it, structured logging from Rust can't intercept it. Issue body's "step 1" framing was wrong.
Cancellation-aware AudioManager::new: much larger surface, would touch every model-load path, and doesn't solve the underlying race — the gate at the restart site is the narrow, correct fix.
Reset update_available = false on defer: would re-download on next check (wasteful, and on Windows the download triggers the installer). Leaving it true means the banner stays accurate and the user can restart manually.

Test plan

cargo check -p screenpipe-app clean (1 pre-existing warning unrelated to this PR)
bun run build under apps/screenpipe-app-tauri/ clean
Manual: launch fresh install (forces first-boot model download), trigger an update check before ServerCore::start reaches "ready", verify log line auto-update v...: boot phase not ready ... deferring restart and no crash
Manual: normal warm-launch with pending update + auto_update=true, verify the 30s countdown fires immediately and restart succeeds as before

fix(audio): catch ort init panic so speaker init can't crash tokio worker #3290 catches ORT Rust panic from OnceLock init — still useful, separate concern
fix: catch ORT panics in rfdetr model initialization #3318 same pattern for rfdetr
fix(audio): quiet TCC-denied retry spam + stop ffmpeg-missing panics #3612 quieted TCC + ffmpeg panic spam
Sentry SCREENPIPE-APP-9X (124 events, 86 users) — Rust ORT-init panic still leaking past fix(audio): catch ort init panic so speaker init can't crash tokio worker #3290's wrapper at v2.4.199; out of scope here, worth a separate look

🤖 Generated with Claude Code

…IGSEGV (#3622) The auto-updater calls `tauri::AppHandle::restart` → `std::process::exit`, which runs C++ static destructors in onnxruntime. If `AudioManager::new` is still mid-`create_session` on the server worker thread, the global DataTypeRegistry is torn down while `PlannerImpl::GetElementSize` is still walking it → SIGSEGV at 0x2c8 (see #3622, #3557 for stack). Existing `catch_panic_into_error` (PR #3290) only catches Rust panics; this is a C++ segfault during process teardown, observability-blind to Sentry because the process dies before the event ships. Add `health::wait_for_boot_ready` and call it in the auto-restart path before the user-facing 30s countdown. In the common case boot is already "ready" and the wait returns immediately. If startup is genuinely stuck past 5 min, defer the restart — pending update stays downloaded, banner visible, next check cycle retries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

louis030195 · 2026-05-27T23:47:01Z

@Anshgrover23 @divanshu-go can u test?

…tests Iteration on #3622 fix in response to self-review: Edge cases the first commit missed: - Error-phase wait would spin until timeout instead of failing fast - Lock poisoning silently returned "not ready" (would force 5-min wait on a poisoned lock) - Banner-click and rollback paths bypassed the gate — Windows downloadAndInstall calls process::exit internally, same race - Zero tests for any of it Changes: - health.rs: introduce `BootReadiness { Ready, Errored, Pending }` enum. `wait_for_boot_ready` now returns the readiness so callers can distinguish error from timeout. Lock-poisoning uses the existing `unwrap_or_else(|e| e.into_inner())` recovery pattern. - updates.rs: extract `await_restart_gate(timeout, label) -> RestartGate`. Auto-update block now calls it via the shared helper. Adds `await_safe_restart` Tauri command exposing the same gate to the frontend with a 60 s default cap. - update-banner.tsx: awaits `await_safe_restart` before relaunch / downloadAndInstall. On `errored` or `pending` shows a toast and aborts cleanly — no crash, no stale banner. - main.rs: register the new Tauri command. - 7 unit tests covering: each variant of `boot_readiness`, immediate return when ready, fail-fast on error phase, timeout while pending, and observation of a mid-wait transition to ready. All pass in ~1 s. Verified: cargo check clean, cargo test passes 7/7, bun build clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

louis030195 · 2026-05-28T00:02:50Z

Iteration based on self-review

After looking again, the first commit was solid but not "amazing." Pushed df2989568 to address the gaps:

Edge cases the first commit missed

Error-phase wait would spin until timeout instead of failing fast
Lock poisoning silently returned "not ready" → would force a 5-min wait on a poisoned lock
Banner-click and rollback restart paths bypassed the gate (on Windows, update.downloadAndInstall() calls process::exit internally — same race)
Zero tests for any of it

Changes

New BootReadiness { Ready, Errored, Pending } enum so callers distinguish error from timeout
Lock poisoning recovered via the existing unwrap_or_else(|e| e.into_inner()) pattern in this file
Extracted await_restart_gate(timeout, label) -> RestartGate as the single internal helper
New await_safe_restart Tauri command exposes the same gate to the frontend (60 s default cap)
update-banner.tsx awaits the gate before downloadAndInstall / relaunch; on errored or pending shows a toast and aborts cleanly
7 unit tests, all passing in ~1 s — cover every variant of boot_readiness, immediate return when ready, fail-fast on error phase, timeout while pending, and observation of a mid-wait transition to ready

Verified: cargo check clean, cargo test -- boot_readiness wait_for_boot_ready → 7/7 passed, bun run build clean.

pleasedodisturb · 2026-05-28T18:32:20Z

Still reproducing on v2.4.288 / macOS 26.5 (Apple Silicon, MacBookPro18,3), ~3h after launch.

Same race as diagnosed in #3557 — Thread 1 UpdatesManager::check_for_updates → AppHandle::restart → std::process::exit running concurrently with Thread 60 AudioManager::new → speaker::create_session → OrtApis::CreateSession. Crash is EXC_BAD_ACCESS in onnxruntime::PlannerImpl::GetElementSize reading a torn pointer (0xeb26…, pointer-auth failure) out of the static DataTypeImpl type map during SessionState::FinalizeSessionState. Thread 81 shows a second screenpipe-server thread also inside ONNX session init (Graph::Resolve → protobuf InternalExtend → free), so it looks like two speaker sessions are being built concurrently while exit() runs static destructors on the main thread — matches the boot-ready gate hypothesis exactly.

Workaround for anyone hitting this in the meantime: disable speaker diarization in settings, or wait until the 30s updater countdown is past before touching audio.

Happy to test a TestFlight / nightly build of this branch if useful.

… auto-collect - Resolve main.rs conflict in favor of main's tauri_collect_commands!() auto-collection (PR #3679) — drops the manual generate_handler! list. - Add #[specta::specta] to await_safe_restart so the auto-collector actually registers it (without it the command is silently excluded and the frontend raw invoke("await_safe_restart") would fail at runtime). get_boot_phase already had both attrs. - Cargo.lock: take main's superset (#3660 adds no deps). - Add crates/screenpipe-audio/examples/race_repro.rs: a standalone harness that reproduces the ORT teardown SIGSEGV (#3622) 40/40 in RACE mode and stays clean 40/40 in GATED mode (validating the boot-ready gate). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

This was referenced May 28, 2026

[feature] built-in pipe to file github bug reports & feature requests with full diagnostic artifacts #3674

Closed

[bug] owned-default /eval endpoint silently no-ops — returns success:true but JS does not execute against page #3676

Closed

louis030195 merged commit 505b1ce into main May 30, 2026
24 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(updates): gate auto-restart on boot-ready to avoid ORT teardown SIGSEGV (#3622)#3660

fix(updates): gate auto-restart on boot-ready to avoid ORT teardown SIGSEGV (#3622)#3660
louis030195 merged 3 commits into
mainfrom
fix/3622-startup-complete-gate

louis030195 commented May 27, 2026

Uh oh!

louis030195 commented May 27, 2026

Uh oh!

louis030195 commented May 28, 2026

Uh oh!

pleasedodisturb commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

louis030195 commented May 27, 2026

Summary

The bug

Why existing fixes don't catch this

The fix

Why before the 30s countdown (not after)

What I considered but didn't do

Test plan

Related

Uh oh!

louis030195 commented May 27, 2026

Uh oh!

louis030195 commented May 28, 2026

Iteration based on self-review

Uh oh!

pleasedodisturb commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants