Skip to content

fix(db): disable mmap + back off redact worker on SQLite corruption#3889

Merged
louis030195 merged 3 commits into
mainfrom
claude/happy-easley-f2eab6
Jun 8, 2026
Merged

fix(db): disable mmap + back off redact worker on SQLite corruption#3889
louis030195 merged 3 commits into
mainfrom
claude/happy-easley-f2eab6

Conversation

@louis030195

@louis030195 louis030195 commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Problem

Recurring SQLite corruption: database disk image is malformed (SQLITE_CORRUPT, code 11). It is intermittent and lands on the hottest table (ui_events). When it hits, the redact reconciliation worker retries the failing query every 2s indefinitely, pinning a CPU core and spamming the log. To the user this looks like screenpipe "suddenly using a lot of CPU".

Root cause

Memory-mapped I/O (mmap_size = 256MB) maps the database file writably into screenpipe's address space. screenpipe is a very native-code-dense process (ScreenCaptureKit, CoreAudio taps, the accessibility tree walker, the ONNX runtime for PII/redaction models, the sqlite_vec C extension, FFmpeg). A stray pointer write, buffer overrun, or use-after-free in any of that native code can silently scribble onto the mapped DB pages and write the corruption straight to disk, bypassing SQLite entirely.

This matches every symptom: corruption on the hottest table (its pages are resident in the mmap window), intermittent timing (a rare wild write, not a deterministic bug), and the fact that the otherwise-correct WAL + synchronous=NORMAL configuration does not prevent it. Disabling mmap on corruption is also SQLite's own documented guidance. The other suspects were ruled out by evidence: connection/pragma logic is correct (single-writer semaphore, write queue, WAL pre-conversion), the MCP server only does a brief read-only SELECT, incremental_vacuum is a no-op (auto_vacuum=NONE), and VACUUM INTO writes a separate snapshot file.

Fix

  1. mmap_size = 0 on all device tiers (screenpipe-config/defaults.rs). Buffered I/O through the page cache removes the entire stray-write corruption class. The minor read-throughput cost is the right trade for a capture product where data integrity is paramount (the 64MB page cache stays).
  2. Redact worker backs off on SQLITE_CORRUPT (screenpipe-redact/worker/mod.rs). It now detects non-transient corruption, logs once, and backs off 5 minutes instead of retrying every 2s. This removes the CPU spin even if corruption ever recurs from another source.

Added hardening

  1. Startup integrity check (screenpipe-db/db.rs). DatabaseManager::new() spawns a one-shot background PRAGMA quick_check(1) ~10s after boot. On failure it logs a loud, actionable error pointing at the existing screenpipe db recover command. Backgrounded so it adds zero boot latency on multi-GB databases (quick_check still scans every page). Previously, corruption was only discovered later via worker errors, never surfaced cleanly with the fix command.
  2. Classify SQLITE_NOTADB (screenpipe-db/sqlite_error.rs). "file is not a database" (code 26) is now treated as fatal alongside "malformed", so the write queue drops the poisoned handle instead of cascading errors across the batch.
  3. Test fix (screenpipe-db/tests/db_config_test.rs, by @louis030195). Updated the per-tier assertions to expect mmap_size=0.

Deliberately not auto-running recovery in-process: screenpipe db recover is designed to run as a separate process under a PID lock while the app is closed (the app refuses to boot while the lock is held). Detection plus guidance is the safe layer; auto-heal-on-boot would be a larger, separate change.

Flow

flowchart TB
    subgraph Before["Before: mmap enabled"]
        A1["DB file mapped WRITABLE into process address space"]
        A2["stray native write (capture, CoreAudio, ONNX/PII, sqlite_vec)"]
        A3["corrupt page on disk (ui_events)"]
        A4["redact worker hits SQLITE_CORRUPT"]
        A5["retry every 2s forever: CPU core pinned, log spam"]
        A1 --> A2 --> A3 --> A4 --> A5
    end
    subgraph After["After: mmap=0 + corrupt backoff + boot check"]
        B1["DB NOT mapped writable: buffered I/O via page cache"]
        B2["stray-write corruption path removed"]
        B3["if corrupt anyway: boot quick_check logs recovery hint; worker backs off 5 min"]
        B1 --> B2
    end
Loading

Testing

  • cargo check -p screenpipe-config -p screenpipe-redact -p screenpipe-db: clean.
  • cargo test -p screenpipe-config db_config: pass.
  • cargo test -p screenpipe-db sqlite_error: 2 passed (incl. new NOTADB cases).
  • cargo test -p screenpipe-db --test db_config_test: 5 passed (constructs DatabaseManager, exercising the new startup path; asserts mmap=0 across tiers).
  • Field validation: a real corrupted DB showing this exact ui_events btree corruption was recovered via .recover + FTS rebuild (integrity_check = ok), confirming the corruption signature and the recovery path.

Risk

mmap_size=0 slightly reduces read throughput vs memory mapping, mitigated by the existing page cache. No schema or API changes. Behavior changes are limited to the DB connection pragma, the redact worker's error backoff, and a read-only background integrity check.

🤖 Generated with Claude Code

Louis Beaumont and others added 3 commits June 6, 2026 13:37
Memory-mapped I/O (mmap_size=256MB) maps the SQLite DB file *writably* into
screenpipe's address space. A stray write from any native component (capture,
CoreAudio, ONNX/PII models, sqlite_vec) can silently corrupt DB pages on disk,
surfacing as "database disk image is malformed" (SQLITE_CORRUPT). It is
intermittent and lands on the hottest table (ui_events) — the mmap stray-write
signature, and why the otherwise-correct WAL + synchronous=NORMAL config does
not prevent it.

- Set mmap_size=0 on all device tiers (defaults.rs). Buffered I/O via the page
  cache removes the entire corruption class; the minor read-throughput cost is
  worth it for a capture product where data integrity is paramount.
- Redact reconciliation worker now detects SQLITE_CORRUPT and backs off 5min
  (logging once) instead of retrying every 2s. The 2s spin on a corrupt DB was
  pinning a CPU core and spamming the log — the user-visible "sudden high CPU".

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
db_config_test still asserted the old per-tier mmap_size values (32/128/256 MB)
but DbConfig now sets mmap_size=0 on every tier to prevent DB corruption.
Update the three assertions and the doc comment to match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Corruption previously surfaced only later, via worker query errors (which
used to spin a CPU core retrying a malformed DB). Add a one-shot background
PRAGMA quick_check(1) ~10s after boot in DatabaseManager::new() that logs a
loud, actionable error pointing at `screenpipe db recover` when the DB is
malformed. Backgrounded so it adds no boot latency on multi-GB databases
(quick_check still scans every page).

Also classify SQLITE_NOTADB ("file is not a database", code 26) as fatal
alongside "malformed" so the write queue drops the poisoned handle instead
of cascading errors across the batch. Unit-tested.

Deliberately not auto-running recovery in-process: the existing
`screenpipe db recover` is designed to run as a separate process under a PID
lock while the app is closed (the app refuses to boot while it is held).
Detection plus guidance is the safe layer here.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@louis030195 louis030195 merged commit bb1457a into main Jun 8, 2026
19 of 23 checks passed
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Diarization eval results

Source: crates/screenpipe-audio-eval/evals/ · VoxConverse dev (CC-BY-4.0) + composed workday templates + screenpipe-shaped LibriSpeech fixtures

fixture DER VAD FA VAD FN boundary err (s) continuity predicted / true spk
interrupted_meeting 0.186 0.01 0.063 20.286 0.833 9 / 5
long_silence_day 0.437 0.011 0.145 11.46 0.7 14 / 10
screenpipe_meeting_rapid_handoffs 0.241 0.196 0.099 2.305 1 5 / 3
screenpipe_background_24_7_day 0.315 0.025 0.159 2.203 1 4 / 3
screenpipe_short_backchannels 0.561 0.915 0.064 0.488 n/a 3 / 3
screenpipe_mic_system_echo_leakage 0.275 0.198 0.084 3.045 0.667 5 / 3
screenpipe_overlap_crosstalk 0.254 0.84 0.042 0.667 n/a 3 / 3
abjxc 0.016 0.098 0.002 1.151 n/a 2 / 1
bxpwa 0.111 0.453 0.029 20.793 0.714 8 / 5
dhorc 0.143 0.461 0.034 3.681 1 5 / 4

DER, VAD FA, VAD FN, boundary err: lower is better. Continuity: higher is better, 1.0 = same hyp cluster across all silence gaps. Composed workday rows and screenpipe_* rows exercise screenpipe-shaped usage: meetings, background gaps, backchannels, echo leakage, and crosstalk. Raw VoxConverse rows score broadcast-quality stems for comparison. See crates/screenpipe-audio-eval/evals/README.md for methodology.

Pipeline replay matrix

Source: generated screenpipe_* fixtures materialized into temp screenpipe SQLite DBs, then read back through search_audio. This catches storage/search regressions that pure DER scoring misses.

scenarios passed failed skipped avg background DER avg background speaker err Deepgram
41 40 0 1 0.329 0.183 skip

The no-secret CI matrix runs local diarization under Parakeet/Whisper engine labels across live/background and mic/system device profiles. Real Deepgram/screenpipe-cloud smoke can be run locally with --deepgram required when credentials are present.

Transcription quality

Source: LibriSpeech test-clean (CC-BY-4.0) · per-model utterance cap · normalized lowercased word-level Levenshtein

model utterances WER CER throughput (samples/s)
tiny 50 0.085 0.033 68707
whisper-large-v3-turbo-quantized 20 0.042 0.009 1924
parakeet 50 0.04 0.026 107024

WER + CER on read-aloud speech. Per-model utterance caps keep wall time bounded — tiny/parakeet at 50, the heavier large-v3-turbo-quantized at 20. See README for normalization rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant