Skip to content

fix(db): recover write queue from persistent disk-I/O wedges#3953

Merged
louis030195 merged 1 commit into
mainfrom
claude/exciting-shannon-15985d
Jun 10, 2026
Merged

fix(db): recover write queue from persistent disk-I/O wedges#3953
louis030195 merged 1 commit into
mainfrom
claude/exciting-shannon-15985d

Conversation

@louis030195

Copy link
Copy Markdown
Collaborator

write queue wedge recovery

Problem

A persistent fatal disk error (disk I/O error / database disk image is malformed / pool lost) makes every write batch fail at acquire() / BEGIN IMMEDIATE. The drain loop retried the same poisoned pool 3× then dropped the batch — and the next batch did the same, forever. Result: all writes (audio chunks, OCR, transcriptions…) silently dropped for 10–15 min, the app stays "Running" with no crash report and no signal, and only a manual restart recovers it.

This is the root cause behind the recurring "screenpipe closed suddenly" reports. Occurrences are escalating: Jun 6: 123 · Jun 7: 84 · Jun 8: 4,277 · Jun 9: 1,509 per app log. mmap=0 (#3889) fixed a different (corruption) failure mode and never touched this path.

Fix — tiered, in-process recovery + an escalation seam

execute_batch now returns a BatchOutcome; the drain loop counts consecutive fatal-connection batches and escalates (thresholds tunable via WriteDrainOpts):

  • Tier 2 — reopen its own write pool in-process (drops poisoned write connections). Cheap, idempotent.
  • Tier 3a — flip a shared WriteQueueHealth { degraded, consecutive_fatal, reopens, … } (the observability production completely lacked).
  • Tier 3b — fire a one-shot on_persistent_failure hook: the seam the app uses to restart the engine — the only thing that rebuilds the shared WAL-index + read pool (the real cure for a process-wide desync).

The read pool is not reopened in-process on purpose: that's 137 self.pool call sites in db.rs = real regression risk. Tier-3's engine restart is the "reopen everything" cure. spawn_write_drain stays as a no-recovery back-compat wrapper; db.rs wires spawn_write_drain_with + a WritePoolRebuilder and exposes DatabaseManager::write_queue_health().

Reproduction (and a non-obvious finding)

A test-only SQLite VFS failpoint (failpoint_vfs.rs) injects a real disk read failure through live sqlx (works because screenpipe-db + sqlx-sqlite share one bundled libsqlite3-sys).

You cannot reproduce the wedge with SQLITE_IOERR_SHORT_READ (522) — SQLite zero-fills and tolerates short reads on read and write paths (an armed INSERT still commits). The wedge needs a hard SQLITE_IOERR (same "disk I/O error" message + recovery path). So production's persistent 522 wasn't a benign short read; reads were genuinely unable to complete — consistent with a WAL-index desync.

The harness models the real cure semantics: the fault heals only when every connection closes (a restart), never on a same-pool retry.

Tests

  • failpoint_injects_disk_io_error_and_heals_only_on_full_close — the harness self-test.
  • write_queue_detects_wedge_signals_restart_and_recovers — end-to-end: arm → writes fail → degraded + in-process reopens + restart hook fires → fault clears → writes recover, durably.
  • 90/90 screenpipe-db lib tests pass (17 existing write-queue tests unchanged), screenpipe-engine builds, fmt + clippy clean.

Not in this PR (follow-up)

The app-side wiring: set the on_persistent_failure hook to trigger the recording restart, and surface is_degraded() in routes/health.rs. Until that lands, production gets tier-2 + the degraded flag but not the auto-restart (the hook is None in db.rs). That step changes live-engine restart behavior, so it's worth a separate verification pass.

🤖 Generated with Claude Code

A persistent fatal disk error ("disk I/O error" / malformed / pool lost) made every write batch fail at acquire/BEGIN IMMEDIATE. The drain loop retried the same poisoned pool 3x then dropped the batch forever, silently losing all writes (audio, OCR, etc.) until a manual restart. The app stayed "Running" with no crash report and no signal. This is the root cause behind the recurring "screenpipe closed suddenly" reports (escalating: Jun8 had 4277 such errors).

execute_batch now returns a BatchOutcome; the drain loop counts consecutive fatal-connection batches and escalates (thresholds tunable via WriteDrainOpts): tier-2 reopens its own write pool in-process; tier-3a flips a shared WriteQueueHealth{degraded,...}; tier-3b fires a one-shot on_persistent_failure hook — the seam the app uses to restart the engine (the only cure for a shared WAL-index desync; the read pool is intentionally NOT reopened in-process to avoid churning 137 call sites). spawn_write_drain stays as a no-recovery back-compat wrapper; db.rs wires spawn_write_drain_with + a WritePoolRebuilder and exposes DatabaseManager::write_queue_health().

Reproduced with a test-only SQLite VFS failpoint (failpoint_vfs.rs) that injects a real disk I/O error through live sqlx. Key finding: SQLITE_IOERR_SHORT_READ (522) is zero-filled/tolerated by SQLite and does NOT wedge writes — the wedge needs a hard SQLITE_IOERR (same "disk I/O error" message + recovery path). Tests: failpoint self-test + write_queue_detects_wedge_signals_restart_and_recovers (end-to-end: arm -> writes fail -> degraded+reopens+hook fired -> clear -> recovers durably). All 90 screenpipe-db lib tests pass; screenpipe-engine builds; fmt+clippy clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Diarization eval results

Source: crates/screenpipe-audio-eval/evals/ · VoxConverse dev (CC-BY-4.0) + composed workday templates + screenpipe-shaped LibriSpeech fixtures

fixture DER VAD FA VAD FN boundary err (s) continuity predicted / true spk
interrupted_meeting 0.186 0.01 0.063 20.286 0.833 9 / 5
long_silence_day 0.437 0.011 0.145 11.46 0.7 14 / 10
screenpipe_meeting_rapid_handoffs 0.241 0.196 0.099 2.305 1 5 / 3
screenpipe_background_24_7_day 0.315 0.025 0.159 2.203 1 4 / 3
screenpipe_short_backchannels 0.561 0.915 0.064 0.488 n/a 3 / 3
screenpipe_mic_system_echo_leakage 0.275 0.198 0.084 3.045 0.667 5 / 3
screenpipe_overlap_crosstalk 0.254 0.84 0.042 0.667 n/a 3 / 3
abjxc 0.016 0.098 0.002 1.151 n/a 2 / 1
bxpwa 0.111 0.453 0.029 20.793 0.714 8 / 5
dhorc 0.143 0.461 0.034 3.681 1 5 / 4

DER, VAD FA, VAD FN, boundary err: lower is better. Continuity: higher is better, 1.0 = same hyp cluster across all silence gaps. Composed workday rows and screenpipe_* rows exercise screenpipe-shaped usage: meetings, background gaps, backchannels, echo leakage, and crosstalk. Raw VoxConverse rows score broadcast-quality stems for comparison. See crates/screenpipe-audio-eval/evals/README.md for methodology.

Pipeline replay matrix

Source: generated screenpipe_* fixtures materialized into temp screenpipe SQLite DBs, then read back through search_audio. This catches storage/search regressions that pure DER scoring misses.

scenarios passed failed skipped avg background DER avg background speaker err Deepgram
41 40 0 1 0.329 0.183 skip

The no-secret CI matrix runs local diarization under Parakeet/Whisper engine labels across live/background and mic/system device profiles. Real Deepgram/screenpipe-cloud smoke can be run locally with --deepgram required when credentials are present.

Transcription quality

Source: LibriSpeech test-clean (CC-BY-4.0) · per-model utterance cap · normalized lowercased word-level Levenshtein

model utterances WER CER throughput (samples/s)
tiny 50 0.085 0.033 68896
whisper-large-v3-turbo-quantized 20 0.042 0.009 1847
parakeet 50 0.04 0.026 102928

WER + CER on read-aloud speech. Per-model utterance caps keep wall time bounded — tiny/parakeet at 50, the heavier large-v3-turbo-quantized at 20. See README for normalization rules.

@louis030195 louis030195 merged commit 62a6810 into main Jun 10, 2026
21 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant