Skip to content

fix(updates): gate auto-restart on boot-ready to avoid ORT teardown SIGSEGV (#3622)#3660

Merged
louis030195 merged 3 commits into
mainfrom
fix/3622-startup-complete-gate
May 30, 2026
Merged

fix(updates): gate auto-restart on boot-ready to avoid ORT teardown SIGSEGV (#3622)#3660
louis030195 merged 3 commits into
mainfrom
fix/3622-startup-complete-gate

Conversation

@louis030195

Copy link
Copy Markdown
Collaborator

Summary

The bug

tauri::AppHandle::restartstd::process::exit runs C++ static destructors in onnxruntime. If AudioManager::new is still mid-init on the screenpipe-server worker thread (first boot, slow model download, large DB migration), the global DataTypeRegistry singleton is torn down while PlannerImpl::GetElementSize is still walking it → EXC_BAD_ACCESS at 0x2c8.

Full stack in #3557.

Why existing fixes don't catch this

catch_panic_into_error from PR #3290 wraps create_session with std::panic::catch_unwind. That catches Rust panics — but this is a C++ segfault inside __cxa_finalize_ranges (static dtor teardown). Different bug class, different fix.

Sentry never sees this crash (process dies before the event ships). Confirmed by searching mediar/screenpipe-app for onnxruntime / DataTypeRegistry / PlannerImpl / SIGSEGV — zero matches for this signature. Observability-blind, which is exactly why a lifecycle gate matters.

The fix

  • health.rs: two helpers reusing existing BOOT_PHASE infra.
    • is_boot_ready()true iff current phase == "ready".
    • wait_for_boot_ready(timeout) — async poll loop with cap.
  • updates.rs: in the auto_update && update_installed block, call wait_for_boot_ready(5 min) before the 30s user-facing countdown. If boot is already ready (~all production cases) the wait returns immediately and behavior is unchanged. If startup is genuinely stuck past 5 min, defer the restart — update_available and update_installed stay true, banner stays visible, next manual restart picks up the staged bundle.

Why before the 30s countdown (not after)

If we waited after, the UI would emit update-restarting { delay_secs: 30 } and then potentially block up to 5 min more — the displayed countdown would lie. Waiting first means the 30s countdown remains an honest 30s.

What I considered but didn't do

  • Add structured panic logging at ORT boundaries (per onnxruntime panic during auto-updater restart #3622 body): SIGSEGV inside C++ static dtors isn't a Rust panic; catch_unwind won't see it, structured logging from Rust can't intercept it. Issue body's "step 1" framing was wrong.
  • Cancellation-aware AudioManager::new: much larger surface, would touch every model-load path, and doesn't solve the underlying race — the gate at the restart site is the narrow, correct fix.
  • Reset update_available = false on defer: would re-download on next check (wasteful, and on Windows the download triggers the installer). Leaving it true means the banner stays accurate and the user can restart manually.

Test plan

  • cargo check -p screenpipe-app clean (1 pre-existing warning unrelated to this PR)
  • bun run build under apps/screenpipe-app-tauri/ clean
  • Manual: launch fresh install (forces first-boot model download), trigger an update check before ServerCore::start reaches "ready", verify log line auto-update v...: boot phase not ready ... deferring restart and no crash
  • Manual: normal warm-launch with pending update + auto_update=true, verify the 30s countdown fires immediately and restart succeeds as before

Related

🤖 Generated with Claude Code

…IGSEGV (#3622)

The auto-updater calls `tauri::AppHandle::restart` → `std::process::exit`,
which runs C++ static destructors in onnxruntime. If `AudioManager::new` is
still mid-`create_session` on the server worker thread, the global
DataTypeRegistry is torn down while `PlannerImpl::GetElementSize` is still
walking it → SIGSEGV at 0x2c8 (see #3622, #3557 for stack).

Existing `catch_panic_into_error` (PR #3290) only catches Rust panics; this
is a C++ segfault during process teardown, observability-blind to Sentry
because the process dies before the event ships.

Add `health::wait_for_boot_ready` and call it in the auto-restart path
before the user-facing 30s countdown. In the common case boot is already
"ready" and the wait returns immediately. If startup is genuinely stuck
past 5 min, defer the restart — pending update stays downloaded, banner
visible, next check cycle retries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@louis030195

Copy link
Copy Markdown
Collaborator Author

@Anshgrover23 @divanshu-go can u test?

…tests

Iteration on #3622 fix in response to self-review:

Edge cases the first commit missed:
- Error-phase wait would spin until timeout instead of failing fast
- Lock poisoning silently returned "not ready" (would force 5-min wait
  on a poisoned lock)
- Banner-click and rollback paths bypassed the gate — Windows
  downloadAndInstall calls process::exit internally, same race
- Zero tests for any of it

Changes:
- health.rs: introduce `BootReadiness { Ready, Errored, Pending }` enum.
  `wait_for_boot_ready` now returns the readiness so callers can
  distinguish error from timeout. Lock-poisoning uses the existing
  `unwrap_or_else(|e| e.into_inner())` recovery pattern.
- updates.rs: extract `await_restart_gate(timeout, label) -> RestartGate`.
  Auto-update block now calls it via the shared helper. Adds
  `await_safe_restart` Tauri command exposing the same gate to the
  frontend with a 60 s default cap.
- update-banner.tsx: awaits `await_safe_restart` before relaunch /
  downloadAndInstall. On `errored` or `pending` shows a toast and
  aborts cleanly — no crash, no stale banner.
- main.rs: register the new Tauri command.
- 7 unit tests covering: each variant of `boot_readiness`, immediate
  return when ready, fail-fast on error phase, timeout while pending,
  and observation of a mid-wait transition to ready. All pass in ~1 s.

Verified: cargo check clean, cargo test passes 7/7, bun build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@louis030195

Copy link
Copy Markdown
Collaborator Author

Iteration based on self-review

After looking again, the first commit was solid but not "amazing." Pushed df2989568 to address the gaps:

Edge cases the first commit missed

  • Error-phase wait would spin until timeout instead of failing fast
  • Lock poisoning silently returned "not ready" → would force a 5-min wait on a poisoned lock
  • Banner-click and rollback restart paths bypassed the gate (on Windows, update.downloadAndInstall() calls process::exit internally — same race)
  • Zero tests for any of it

Changes

  • New BootReadiness { Ready, Errored, Pending } enum so callers distinguish error from timeout
  • Lock poisoning recovered via the existing unwrap_or_else(|e| e.into_inner()) pattern in this file
  • Extracted await_restart_gate(timeout, label) -> RestartGate as the single internal helper
  • New await_safe_restart Tauri command exposes the same gate to the frontend (60 s default cap)
  • update-banner.tsx awaits the gate before downloadAndInstall / relaunch; on errored or pending shows a toast and aborts cleanly
  • 7 unit tests, all passing in ~1 s — cover every variant of boot_readiness, immediate return when ready, fail-fast on error phase, timeout while pending, and observation of a mid-wait transition to ready

Verified: cargo check clean, cargo test -- boot_readiness wait_for_boot_ready → 7/7 passed, bun run build clean.

@pleasedodisturb

Copy link
Copy Markdown

Still reproducing on v2.4.288 / macOS 26.5 (Apple Silicon, MacBookPro18,3), ~3h after launch.

Same race as diagnosed in #3557 — Thread 1 UpdatesManager::check_for_updatesAppHandle::restartstd::process::exit running concurrently with Thread 60 AudioManager::newspeaker::create_sessionOrtApis::CreateSession. Crash is EXC_BAD_ACCESS in onnxruntime::PlannerImpl::GetElementSize reading a torn pointer (0xeb26…, pointer-auth failure) out of the static DataTypeImpl type map during SessionState::FinalizeSessionState. Thread 81 shows a second screenpipe-server thread also inside ONNX session init (Graph::Resolve → protobuf InternalExtendfree), so it looks like two speaker sessions are being built concurrently while exit() runs static destructors on the main thread — matches the boot-ready gate hypothesis exactly.

Workaround for anyone hitting this in the meantime: disable speaker diarization in settings, or wait until the 30s updater countdown is past before touching audio.

Happy to test a TestFlight / nightly build of this branch if useful.

… auto-collect

- Resolve main.rs conflict in favor of main's tauri_collect_commands!()
  auto-collection (PR #3679) — drops the manual generate_handler! list.
- Add #[specta::specta] to await_safe_restart so the auto-collector actually
  registers it (without it the command is silently excluded and the frontend
  raw invoke("await_safe_restart") would fail at runtime). get_boot_phase
  already had both attrs.
- Cargo.lock: take main's superset (#3660 adds no deps).
- Add crates/screenpipe-audio/examples/race_repro.rs: a standalone harness
  that reproduces the ORT teardown SIGSEGV (#3622) 40/40 in RACE mode and
  stays clean 40/40 in GATED mode (validating the boot-ready gate).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@louis030195 louis030195 merged commit 505b1ce into main May 30, 2026
24 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

onnxruntime panic during auto-updater restart

2 participants