fix(db): recover write queue from persistent disk-I/O wedges by louis030195 · Pull Request #3953 · screenpipe/screenpipe

louis030195 · 2026-06-09T23:48:09Z

Problem

A persistent fatal disk error (disk I/O error / database disk image is malformed / pool lost) makes every write batch fail at acquire() / BEGIN IMMEDIATE. The drain loop retried the same poisoned pool 3× then dropped the batch — and the next batch did the same, forever. Result: all writes (audio chunks, OCR, transcriptions…) silently dropped for 10–15 min, the app stays "Running" with no crash report and no signal, and only a manual restart recovers it.

This is the root cause behind the recurring "screenpipe closed suddenly" reports. Occurrences are escalating: Jun 6: 123 · Jun 7: 84 · Jun 8: 4,277 · Jun 9: 1,509 per app log. mmap=0 (#3889) fixed a different (corruption) failure mode and never touched this path.

Fix — tiered, in-process recovery + an escalation seam

execute_batch now returns a BatchOutcome; the drain loop counts consecutive fatal-connection batches and escalates (thresholds tunable via WriteDrainOpts):

Tier 2 — reopen its own write pool in-process (drops poisoned write connections). Cheap, idempotent.
Tier 3a — flip a shared WriteQueueHealth { degraded, consecutive_fatal, reopens, … } (the observability production completely lacked).
Tier 3b — fire a one-shot on_persistent_failure hook: the seam the app uses to restart the engine — the only thing that rebuilds the shared WAL-index + read pool (the real cure for a process-wide desync).

The read pool is not reopened in-process on purpose: that's 137 self.pool call sites in db.rs = real regression risk. Tier-3's engine restart is the "reopen everything" cure. spawn_write_drain stays as a no-recovery back-compat wrapper; db.rs wires spawn_write_drain_with + a WritePoolRebuilder and exposes DatabaseManager::write_queue_health().

Reproduction (and a non-obvious finding)

A test-only SQLite VFS failpoint (failpoint_vfs.rs) injects a real disk read failure through live sqlx (works because screenpipe-db + sqlx-sqlite share one bundled libsqlite3-sys).

You cannot reproduce the wedge with SQLITE_IOERR_SHORT_READ (522) — SQLite zero-fills and tolerates short reads on read and write paths (an armed INSERT still commits). The wedge needs a hard SQLITE_IOERR (same "disk I/O error" message + recovery path). So production's persistent 522 wasn't a benign short read; reads were genuinely unable to complete — consistent with a WAL-index desync.

The harness models the real cure semantics: the fault heals only when every connection closes (a restart), never on a same-pool retry.

Tests

failpoint_injects_disk_io_error_and_heals_only_on_full_close — the harness self-test.
write_queue_detects_wedge_signals_restart_and_recovers — end-to-end: arm → writes fail → degraded + in-process reopens + restart hook fires → fault clears → writes recover, durably.
90/90 screenpipe-db lib tests pass (17 existing write-queue tests unchanged), screenpipe-engine builds, fmt + clippy clean.

Not in this PR (follow-up)

The app-side wiring: set the on_persistent_failure hook to trigger the recording restart, and surface is_degraded() in routes/health.rs. Until that lands, production gets tier-2 + the degraded flag but not the auto-restart (the hook is None in db.rs). That step changes live-engine restart behavior, so it's worth a separate verification pass.

🤖 Generated with Claude Code

A persistent fatal disk error ("disk I/O error" / malformed / pool lost) made every write batch fail at acquire/BEGIN IMMEDIATE. The drain loop retried the same poisoned pool 3x then dropped the batch forever, silently losing all writes (audio, OCR, etc.) until a manual restart. The app stayed "Running" with no crash report and no signal. This is the root cause behind the recurring "screenpipe closed suddenly" reports (escalating: Jun8 had 4277 such errors). execute_batch now returns a BatchOutcome; the drain loop counts consecutive fatal-connection batches and escalates (thresholds tunable via WriteDrainOpts): tier-2 reopens its own write pool in-process; tier-3a flips a shared WriteQueueHealth{degraded,...}; tier-3b fires a one-shot on_persistent_failure hook — the seam the app uses to restart the engine (the only cure for a shared WAL-index desync; the read pool is intentionally NOT reopened in-process to avoid churning 137 call sites). spawn_write_drain stays as a no-recovery back-compat wrapper; db.rs wires spawn_write_drain_with + a WritePoolRebuilder and exposes DatabaseManager::write_queue_health(). Reproduced with a test-only SQLite VFS failpoint (failpoint_vfs.rs) that injects a real disk I/O error through live sqlx. Key finding: SQLITE_IOERR_SHORT_READ (522) is zero-filled/tolerated by SQLite and does NOT wedge writes — the wedge needs a hard SQLITE_IOERR (same "disk I/O error" message + recovery path). Tests: failpoint self-test + write_queue_detects_wedge_signals_restart_and_recovers (end-to-end: arm -> writes fail -> degraded+reopens+hook fired -> clear -> recovers durably). All 90 screenpipe-db lib tests pass; screenpipe-engine builds; fmt+clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-10T00:23:16Z

Diarization eval results

Source: crates/screenpipe-audio-eval/evals/ · VoxConverse dev (CC-BY-4.0) + composed workday templates + screenpipe-shaped LibriSpeech fixtures

fixture	DER	VAD FA	VAD FN	boundary err (s)	continuity	predicted / true spk
interrupted_meeting	0.186	0.01	0.063	20.286	0.833	9 / 5
long_silence_day	0.437	0.011	0.145	11.46	0.7	14 / 10
screenpipe_meeting_rapid_handoffs	0.241	0.196	0.099	2.305	1	5 / 3
screenpipe_background_24_7_day	0.315	0.025	0.159	2.203	1	4 / 3
screenpipe_short_backchannels	0.561	0.915	0.064	0.488	n/a	3 / 3
screenpipe_mic_system_echo_leakage	0.275	0.198	0.084	3.045	0.667	5 / 3
screenpipe_overlap_crosstalk	0.254	0.84	0.042	0.667	n/a	3 / 3
abjxc	0.016	0.098	0.002	1.151	n/a	2 / 1
bxpwa	0.111	0.453	0.029	20.793	0.714	8 / 5
dhorc	0.143	0.461	0.034	3.681	1	5 / 4

_{DER, VAD FA, VAD FN, boundary err: lower is better. Continuity: higher is better, 1.0 = same hyp cluster across all silence gaps. Composed workday rows and screenpipe_* rows exercise screenpipe-shaped usage: meetings, background gaps, backchannels, echo leakage, and crosstalk. Raw VoxConverse rows score broadcast-quality stems for comparison. See crates/screenpipe-audio-eval/evals/README.md for methodology.}

Pipeline replay matrix

Source: generated screenpipe_* fixtures materialized into temp screenpipe SQLite DBs, then read back through search_audio. This catches storage/search regressions that pure DER scoring misses.

scenarios	passed	failed	skipped	avg background DER	avg background speaker err	Deepgram
41	40	0	1	0.329	0.183	skip

_{The no-secret CI matrix runs local diarization under Parakeet/Whisper engine labels across live/background and mic/system device profiles. Real Deepgram/screenpipe-cloud smoke can be run locally with --deepgram required when credentials are present.}

Transcription quality

Source: LibriSpeech test-clean (CC-BY-4.0) · per-model utterance cap · normalized lowercased word-level Levenshtein

model	utterances	WER	CER	throughput (samples/s)
tiny	50	0.085	0.033	68896
whisper-large-v3-turbo-quantized	20	0.042	0.009	1847
parakeet	50	0.04	0.026	102928

_{WER + CER on read-aloud speech. Per-model utterance caps keep wall time bounded — tiny/parakeet at 50, the heavier large-v3-turbo-quantized at 20. See README for normalization rules.}

louis030195 merged commit 62a6810 into main Jun 10, 2026
21 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(db): recover write queue from persistent disk-I/O wedges#3953

fix(db): recover write queue from persistent disk-I/O wedges#3953
louis030195 merged 1 commit into
mainfrom
claude/exciting-shannon-15985d

louis030195 commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

louis030195 commented Jun 9, 2026

Problem

Fix — tiered, in-process recovery + an escalation seam

Reproduction (and a non-obvious finding)

Tests

Not in this PR (follow-up)

Uh oh!

github-actions Bot commented Jun 10, 2026

Diarization eval results

Pipeline replay matrix

Transcription quality

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant