Skip to content

[bug] Transcription reconciliation worker silently stops, accumulates days of pending chunks before app restart drains them #3498

@pleasedodisturb

Description

@pleasedodisturb

Summary

The audio-transcription reconciliation worker stopped processing pending chunks at some point. Audio capture and live transcription kept working — but chunks that fell into the "pending" state never got picked back up, accumulating ~3000 chunks over ~6 days. The health-check warning fires correctly (audio transcription backlog stalled — N chunk(s) pending, oldest Ns old) but the dispatcher does not self-recover. Restarting the app reliably triggers reconciliation to run and drain the backlog at ~50 chunks per 2-3 min.

This causes silent data loss in user-visible behavior: audio files are written to disk but never transcribed, so they don't appear in search/UI for days.

Environment

  • macOS 26.5 (Tahoe) on Apple Silicon (M1 Pro)
  • Screenpipe app v2.4.247 (Tauri shell), engine v0.3.307
  • Transcription backend: Deepgram via screenpipe-cloud (api.screenpi.pe), mode Batch
  • Pro license active, cloud trial active
  • Audio devices: Shure MV7+ (input) + macOS System Audio (output)
  • ~/.screenpipe/db.sqlite ~1.2 GB at time of incident
  • No exotic config — defaults except retention (mode=media, retention_days=14)

Timeline (from ~/.screenpipe/screenpipe-app.2026-05-21.log)

  • ~2026-05-15 (extrapolated from oldest 517251s old first observed): reconciliation worker stops draining pending chunks. Live capture/transcription continues normally — no user-visible signal anywhere in the UI.
  • 2026-05-21 07:46:42 Z: capture transcription configured: background_engine=Deepgram ... transcription_mode=Batch — app restart on this date triggered reconciliation re-init.
  • 2026-05-21 07:49:33 Z: first audio transcription backlog stalled — 3128 chunk(s) pending, oldest 517251s old | pool: read=4/4 idle, write=2/2 idle (6 days of pending audio).
  • 2026-05-21 07:49:48 Z onward: reconciliation: transcribed 50 orphaned chunks every 2-3 min — backlog starts draining.
  • 2026-05-21 12:49 Z: backlog down to 31 chunks, last stalled warning emitted.
  • 2026-05-21 13:50+ Z: only reconciliation: transcribed N orphaned chunks log lines, no more stall warnings — system healthy.

Time-to-drain ~5 hours from restart, with capture continuously feeding the queue.

Reproduction

We do not know the precise trigger. Hypotheses below. The repeatable observation:

  1. Run screenpipe-app with cloud Deepgram batch mode for several days.
  2. At some point — possibly tied to network drop / laptop sleep / cloud-side rate limiting / dispatcher state corruption — reconciliation stops processing pending chunks.
  3. From that moment on: pending count grows linearly with capture rate. oldest pending age grows by 1 second every 1 second. Health-check warns. Live path remains unaffected, so the UI looks healthy.
  4. Restart the app. Reconciliation re-runs at startup, drains the accumulated backlog.

Possible triggers I cannot rule out from this single incident:

  • macOS sleep / wake cycle interrupting an in-flight HTTP request to api.screenpi.pe
  • Cloud-side 429 / 5xx that wasn't handled with retry-with-backoff but with silent abort
  • Token refresh failure (the screenpipe-cloud bearer token has a refresh dance)
  • DNS / TLS error during a request that left the worker task in an awaiting-future state

Logs around the suspected stall start point are no longer available (only daily log files; 6 days back is rotated out). If you can hint at what to grep for, I can capture it for the next recurrence.

Expected behavior

When pending_count > 20 AND oldest_pending_age > 2 * AUDIO_RECONCILIATION_FRESHNESS_DELAY_SECS (20 min), the reconciliation worker should either:

  1. Auto-recover by restarting its task and re-claiming pending chunks, OR
  2. Surface a user-visible notification "transcription stalled — restart recommended" so users know the indexed data they see is stale by days. The health-check WARN goes to screenpipe-app.YYYY-MM-DD.log, which 99% of users will never read.

Actual behavior

Reconciliation worker continues to log nothing. Health-check warning fires once per minute but is log-only. Pool is 8/8 idle — workers exist but no chunks are being dispatched to them. Live path proceeds independently (new captures still transcribed promptly), masking the problem in the UI.

Source-code observations

From crates/screenpipe-engine/src/routes/health.rs:474-518 — comment is accurate, the heuristic correctly detects the stall:

// Direct measurement: count chunks stuck in 'pending' status. This
// replaces the previous pool-idle + stale-metric heuristic, which
// fired false positives whenever the live path's dedup short-circuit
// ate batches of common short words and went silent on the write
// pool. ...
//
// A real stall now means: the reconciliation worker has pending
// chunks older than the freshness window — i.e. they should have
// been processed by now and haven't.
let stalled = pending_count > 20
    && oldest_pending_age_secs
        > (AUDIO_RECONCILIATION_FRESHNESS_DELAY_SECS as u64).saturating_mul(2);

crates/screenpipe-audio/src/audio_manager/reconciliation.rs:28:

const RECONCILIATION_FRESHNESS_DELAY_SECS: i64 = 10 * 60;

So the detection threshold is 20 min stale + 20+ chunks. In my incident, both conditions were satisfied for days, not minutes. The reconciliation worker's scheduling / retry logic is the suspect.

Workaround

Quit and relaunch screenpipe.app. Reconciliation drains automatically over ~5h depending on backlog size (~50 chunks per 2-3 min in my run, cloud rate-limit dependent).

Suggested fixes (in roughly increasing complexity)

  1. User-facing notification when health-check stalled flag is true for >5 consecutive checks. "Transcription backlog hasn't drained in 5 minutes — restart screenpipe to recover."
  2. Watchdog timer on the reconciliation worker. If the task hasn't made progress (touched a row in audio_chunks WHERE status='pending') in 2 * RECONCILIATION_FRESHNESS_DELAY_SECS, the watchdog kills and restarts the task.
  3. Periodic re-init on schedule (every 30 min?), not just on app startup. Reconciliation re-init seems to be the fix; doing it proactively avoids the multi-day silent failure.
  4. Bounded retry with exponential backoff on cloud Deepgram failures, then dead-letter-style logging when retries exhausted. If a chunk fails N times, mark it so the dispatcher moves on instead of retrying it forever (if that's what's happening).
  5. Trace logs around reconciliation worker start/stop with the reason. Right now there's no log line that says "reconciliation worker stopped because X" — only the WARN that says "it's been stopped for a while." Adding a line at the exit point of the worker task (panic / clean shutdown / awaiting-forever) would make root-cause obvious next time.

Severity

Medium-High in my judgment. No data lost from disk (audio files remain), but indexed/searchable transcripts can fall days behind without user awareness. For a tool whose value proposition is "every word you said is searchable," silent multi-day staleness is a serious UX failure.

I can help debug next recurrence

I'm a heavy user with a contributor setup ready (mediar-ai/screenpipe cloned, full build environment on Mac Mini). If you want me to drop in additional instrumentation behind a feature flag and run on prod for a week to catch the next stall in the act, happy to.


Incident date: 2026-05-21 (drained), backlog onset estimated 2026-05-15
Filed by: @pleasedodisturb
Related: #3466 (port collision / silent recorder failure) — distinct symptom, same general theme of silent failure modes that the health-check sees but the user doesn't.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions