fix(db): disable mmap + back off redact worker on SQLite corruption#3889
Conversation
Memory-mapped I/O (mmap_size=256MB) maps the SQLite DB file *writably* into screenpipe's address space. A stray write from any native component (capture, CoreAudio, ONNX/PII models, sqlite_vec) can silently corrupt DB pages on disk, surfacing as "database disk image is malformed" (SQLITE_CORRUPT). It is intermittent and lands on the hottest table (ui_events) — the mmap stray-write signature, and why the otherwise-correct WAL + synchronous=NORMAL config does not prevent it. - Set mmap_size=0 on all device tiers (defaults.rs). Buffered I/O via the page cache removes the entire corruption class; the minor read-throughput cost is worth it for a capture product where data integrity is paramount. - Redact reconciliation worker now detects SQLITE_CORRUPT and backs off 5min (logging once) instead of retrying every 2s. The 2s spin on a corrupt DB was pinning a CPU core and spamming the log — the user-visible "sudden high CPU". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
db_config_test still asserted the old per-tier mmap_size values (32/128/256 MB) but DbConfig now sets mmap_size=0 on every tier to prevent DB corruption. Update the three assertions and the doc comment to match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Corruption previously surfaced only later, via worker query errors (which
used to spin a CPU core retrying a malformed DB). Add a one-shot background
PRAGMA quick_check(1) ~10s after boot in DatabaseManager::new() that logs a
loud, actionable error pointing at `screenpipe db recover` when the DB is
malformed. Backgrounded so it adds no boot latency on multi-GB databases
(quick_check still scans every page).
Also classify SQLITE_NOTADB ("file is not a database", code 26) as fatal
alongside "malformed" so the write queue drops the poisoned handle instead
of cascading errors across the batch. Unit-tested.
Deliberately not auto-running recovery in-process: the existing
`screenpipe db recover` is designed to run as a separate process under a PID
lock while the app is closed (the app refuses to boot while it is held).
Detection plus guidance is the safe layer here.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Diarization eval resultsSource:
DER, VAD FA, VAD FN, boundary err: lower is better. Continuity: higher is better, 1.0 = same hyp cluster across all silence gaps. Composed workday rows and Pipeline replay matrixSource: generated
The no-secret CI matrix runs local diarization under Parakeet/Whisper engine labels across live/background and mic/system device profiles. Real Deepgram/screenpipe-cloud smoke can be run locally with Transcription qualitySource: LibriSpeech test-clean (CC-BY-4.0) · per-model utterance cap · normalized lowercased word-level Levenshtein
WER + CER on read-aloud speech. Per-model utterance caps keep wall time bounded — tiny/parakeet at 50, the heavier large-v3-turbo-quantized at 20. See README for normalization rules. |
Problem
Recurring SQLite corruption:
database disk image is malformed(SQLITE_CORRUPT, code 11). It is intermittent and lands on the hottest table (ui_events). When it hits, the redact reconciliation worker retries the failing query every 2s indefinitely, pinning a CPU core and spamming the log. To the user this looks like screenpipe "suddenly using a lot of CPU".Root cause
Memory-mapped I/O (
mmap_size = 256MB) maps the database file writably into screenpipe's address space. screenpipe is a very native-code-dense process (ScreenCaptureKit, CoreAudio taps, the accessibility tree walker, the ONNX runtime for PII/redaction models, thesqlite_vecC extension, FFmpeg). A stray pointer write, buffer overrun, or use-after-free in any of that native code can silently scribble onto the mapped DB pages and write the corruption straight to disk, bypassing SQLite entirely.This matches every symptom: corruption on the hottest table (its pages are resident in the mmap window), intermittent timing (a rare wild write, not a deterministic bug), and the fact that the otherwise-correct WAL +
synchronous=NORMALconfiguration does not prevent it. Disabling mmap on corruption is also SQLite's own documented guidance. The other suspects were ruled out by evidence: connection/pragma logic is correct (single-writer semaphore, write queue, WAL pre-conversion), the MCP server only does a brief read-onlySELECT,incremental_vacuumis a no-op (auto_vacuum=NONE), andVACUUM INTOwrites a separate snapshot file.Fix
mmap_size = 0on all device tiers (screenpipe-config/defaults.rs). Buffered I/O through the page cache removes the entire stray-write corruption class. The minor read-throughput cost is the right trade for a capture product where data integrity is paramount (the 64MB page cache stays).SQLITE_CORRUPT(screenpipe-redact/worker/mod.rs). It now detects non-transient corruption, logs once, and backs off 5 minutes instead of retrying every 2s. This removes the CPU spin even if corruption ever recurs from another source.Added hardening
screenpipe-db/db.rs).DatabaseManager::new()spawns a one-shot backgroundPRAGMA quick_check(1)~10s after boot. On failure it logs a loud, actionable error pointing at the existingscreenpipe db recovercommand. Backgrounded so it adds zero boot latency on multi-GB databases (quick_check still scans every page). Previously, corruption was only discovered later via worker errors, never surfaced cleanly with the fix command.SQLITE_NOTADB(screenpipe-db/sqlite_error.rs). "file is not a database" (code 26) is now treated as fatal alongside "malformed", so the write queue drops the poisoned handle instead of cascading errors across the batch.screenpipe-db/tests/db_config_test.rs, by @louis030195). Updated the per-tier assertions to expectmmap_size=0.Deliberately not auto-running recovery in-process:
screenpipe db recoveris designed to run as a separate process under a PID lock while the app is closed (the app refuses to boot while the lock is held). Detection plus guidance is the safe layer; auto-heal-on-boot would be a larger, separate change.Flow
flowchart TB subgraph Before["Before: mmap enabled"] A1["DB file mapped WRITABLE into process address space"] A2["stray native write (capture, CoreAudio, ONNX/PII, sqlite_vec)"] A3["corrupt page on disk (ui_events)"] A4["redact worker hits SQLITE_CORRUPT"] A5["retry every 2s forever: CPU core pinned, log spam"] A1 --> A2 --> A3 --> A4 --> A5 end subgraph After["After: mmap=0 + corrupt backoff + boot check"] B1["DB NOT mapped writable: buffered I/O via page cache"] B2["stray-write corruption path removed"] B3["if corrupt anyway: boot quick_check logs recovery hint; worker backs off 5 min"] B1 --> B2 endTesting
cargo check -p screenpipe-config -p screenpipe-redact -p screenpipe-db: clean.cargo test -p screenpipe-config db_config: pass.cargo test -p screenpipe-db sqlite_error: 2 passed (incl. new NOTADB cases).cargo test -p screenpipe-db --test db_config_test: 5 passed (constructsDatabaseManager, exercising the new startup path; asserts mmap=0 across tiers).ui_eventsbtree corruption was recovered via.recover+ FTS rebuild (integrity_check = ok), confirming the corruption signature and the recovery path.Risk
mmap_size=0slightly reduces read throughput vs memory mapping, mitigated by the existing page cache. No schema or API changes. Behavior changes are limited to the DB connection pragma, the redact worker's error backoff, and a read-only background integrity check.🤖 Generated with Claude Code