fix(db): recover write queue from persistent disk-I/O wedges#3953
Conversation
A persistent fatal disk error ("disk I/O error" / malformed / pool lost) made every write batch fail at acquire/BEGIN IMMEDIATE. The drain loop retried the same poisoned pool 3x then dropped the batch forever, silently losing all writes (audio, OCR, etc.) until a manual restart. The app stayed "Running" with no crash report and no signal. This is the root cause behind the recurring "screenpipe closed suddenly" reports (escalating: Jun8 had 4277 such errors).
execute_batch now returns a BatchOutcome; the drain loop counts consecutive fatal-connection batches and escalates (thresholds tunable via WriteDrainOpts): tier-2 reopens its own write pool in-process; tier-3a flips a shared WriteQueueHealth{degraded,...}; tier-3b fires a one-shot on_persistent_failure hook — the seam the app uses to restart the engine (the only cure for a shared WAL-index desync; the read pool is intentionally NOT reopened in-process to avoid churning 137 call sites). spawn_write_drain stays as a no-recovery back-compat wrapper; db.rs wires spawn_write_drain_with + a WritePoolRebuilder and exposes DatabaseManager::write_queue_health().
Reproduced with a test-only SQLite VFS failpoint (failpoint_vfs.rs) that injects a real disk I/O error through live sqlx. Key finding: SQLITE_IOERR_SHORT_READ (522) is zero-filled/tolerated by SQLite and does NOT wedge writes — the wedge needs a hard SQLITE_IOERR (same "disk I/O error" message + recovery path). Tests: failpoint self-test + write_queue_detects_wedge_signals_restart_and_recovers (end-to-end: arm -> writes fail -> degraded+reopens+hook fired -> clear -> recovers durably). All 90 screenpipe-db lib tests pass; screenpipe-engine builds; fmt+clippy clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Diarization eval resultsSource:
DER, VAD FA, VAD FN, boundary err: lower is better. Continuity: higher is better, 1.0 = same hyp cluster across all silence gaps. Composed workday rows and Pipeline replay matrixSource: generated
The no-secret CI matrix runs local diarization under Parakeet/Whisper engine labels across live/background and mic/system device profiles. Real Deepgram/screenpipe-cloud smoke can be run locally with Transcription qualitySource: LibriSpeech test-clean (CC-BY-4.0) · per-model utterance cap · normalized lowercased word-level Levenshtein
WER + CER on read-aloud speech. Per-model utterance caps keep wall time bounded — tiny/parakeet at 50, the heavier large-v3-turbo-quantized at 20. See README for normalization rules. |
Problem
A persistent fatal disk error (
disk I/O error/database disk image is malformed/ pool lost) makes every write batch fail atacquire()/BEGIN IMMEDIATE. The drain loop retried the same poisoned pool 3× then dropped the batch — and the next batch did the same, forever. Result: all writes (audio chunks, OCR, transcriptions…) silently dropped for 10–15 min, the app stays "Running" with no crash report and no signal, and only a manual restart recovers it.This is the root cause behind the recurring "screenpipe closed suddenly" reports. Occurrences are escalating: Jun 6: 123 · Jun 7: 84 · Jun 8: 4,277 · Jun 9: 1,509 per app log.
mmap=0(#3889) fixed a different (corruption) failure mode and never touched this path.Fix — tiered, in-process recovery + an escalation seam
execute_batchnow returns aBatchOutcome; the drain loop counts consecutive fatal-connection batches and escalates (thresholds tunable viaWriteDrainOpts):WriteQueueHealth { degraded, consecutive_fatal, reopens, … }(the observability production completely lacked).on_persistent_failurehook: the seam the app uses to restart the engine — the only thing that rebuilds the shared WAL-index + read pool (the real cure for a process-wide desync).The read pool is not reopened in-process on purpose: that's 137
self.poolcall sites indb.rs= real regression risk. Tier-3's engine restart is the "reopen everything" cure.spawn_write_drainstays as a no-recovery back-compat wrapper;db.rswiresspawn_write_drain_with+ aWritePoolRebuilderand exposesDatabaseManager::write_queue_health().Reproduction (and a non-obvious finding)
A test-only SQLite VFS failpoint (
failpoint_vfs.rs) injects a real disk read failure through live sqlx (works becausescreenpipe-db+sqlx-sqliteshare one bundledlibsqlite3-sys).The harness models the real cure semantics: the fault heals only when every connection closes (a restart), never on a same-pool retry.
Tests
failpoint_injects_disk_io_error_and_heals_only_on_full_close— the harness self-test.write_queue_detects_wedge_signals_restart_and_recovers— end-to-end: arm → writes fail →degraded+ in-process reopens + restart hook fires → fault clears → writes recover, durably.screenpipe-dblib tests pass (17 existing write-queue tests unchanged),screenpipe-enginebuilds,fmt+clippyclean.Not in this PR (follow-up)
The app-side wiring: set the
on_persistent_failurehook to trigger the recording restart, and surfaceis_degraded()inroutes/health.rs. Until that lands, production gets tier-2 + the degraded flag but not the auto-restart (the hook isNoneindb.rs). That step changes live-engine restart behavior, so it's worth a separate verification pass.🤖 Generated with Claude Code