[bug] Transcription reconciliation worker silently stops, accumulates days of pending chunks before app restart drains them

## Summary

The audio-transcription reconciliation worker stopped processing pending chunks at some point. Audio capture and **live** transcription kept working — but chunks that fell into the "pending" state never got picked back up, accumulating ~3000 chunks over ~6 days. The health-check warning fires correctly (`audio transcription backlog stalled — N chunk(s) pending, oldest Ns old`) but the dispatcher does not self-recover. Restarting the app reliably triggers reconciliation to run and drain the backlog at ~50 chunks per 2-3 min.

This causes silent data loss in user-visible behavior: audio files are written to disk but never transcribed, so they don't appear in search/UI for days.

## Environment

- macOS 26.5 (Tahoe) on Apple Silicon (M1 Pro)
- Screenpipe app v2.4.247 (Tauri shell), engine v0.3.307
- Transcription backend: `Deepgram` via `screenpipe-cloud` (`api.screenpi.pe`), mode `Batch`
- Pro license active, cloud trial active
- Audio devices: `Shure MV7+ (input)` + macOS `System Audio (output)`
- `~/.screenpipe/db.sqlite` ~1.2 GB at time of incident
- No exotic config — defaults except retention (`mode=media, retention_days=14`)

## Timeline (from `~/.screenpipe/screenpipe-app.2026-05-21.log`)

- **~2026-05-15** (extrapolated from `oldest 517251s old` first observed): reconciliation worker stops draining pending chunks. Live capture/transcription continues normally — *no user-visible signal anywhere in the UI*.
- **2026-05-21 07:46:42** Z: `capture transcription configured: background_engine=Deepgram ... transcription_mode=Batch` — app restart on this date triggered reconciliation re-init.
- **2026-05-21 07:49:33** Z: first `audio transcription backlog stalled — 3128 chunk(s) pending, oldest 517251s old | pool: read=4/4 idle, write=2/2 idle` (6 days of pending audio).
- **2026-05-21 07:49:48** Z onward: `reconciliation: transcribed 50 orphaned chunks` every 2-3 min — backlog starts draining.
- **2026-05-21 12:49** Z: backlog down to **31 chunks**, last `stalled` warning emitted.
- **2026-05-21 13:50+** Z: only `reconciliation: transcribed N orphaned chunks` log lines, no more stall warnings — system healthy.

Time-to-drain ~5 hours from restart, with capture continuously feeding the queue.

## Reproduction

We do not know the precise trigger. Hypotheses below. The repeatable observation:

1. Run screenpipe-app with cloud Deepgram batch mode for several days.
2. At some point — possibly tied to network drop / laptop sleep / cloud-side rate limiting / dispatcher state corruption — reconciliation stops processing pending chunks.
3. From that moment on: `pending` count grows linearly with capture rate. `oldest pending age` grows by 1 second every 1 second. Health-check warns. Live path remains unaffected, so the UI looks healthy.
4. Restart the app. Reconciliation re-runs at startup, drains the accumulated backlog.

Possible triggers I cannot rule out from this single incident:
- macOS sleep / wake cycle interrupting an in-flight HTTP request to `api.screenpi.pe`
- Cloud-side 429 / 5xx that wasn't handled with retry-with-backoff but with silent abort
- Token refresh failure (the screenpipe-cloud bearer token has a refresh dance)
- DNS / TLS error during a request that left the worker task in an awaiting-future state

Logs around the suspected stall start point are no longer available (only daily log files; 6 days back is rotated out). If you can hint at what to grep for, I can capture it for the next recurrence.

## Expected behavior

When `pending_count > 20` AND `oldest_pending_age > 2 * AUDIO_RECONCILIATION_FRESHNESS_DELAY_SECS` (20 min), the reconciliation worker should either:

1. Auto-recover by restarting its task and re-claiming pending chunks, OR
2. Surface a user-visible notification "transcription stalled — restart recommended" so users know the indexed data they see is stale by days. The health-check WARN goes to `screenpipe-app.YYYY-MM-DD.log`, which 99% of users will never read.

## Actual behavior

Reconciliation worker continues to log nothing. Health-check warning fires once per minute but is log-only. Pool is `8/8 idle` — workers exist but no chunks are being dispatched to them. Live path proceeds independently (new captures still transcribed promptly), masking the problem in the UI.

## Source-code observations

From `crates/screenpipe-engine/src/routes/health.rs:474-518` — comment is accurate, the heuristic correctly detects the stall:

```rust
// Direct measurement: count chunks stuck in 'pending' status. This
// replaces the previous pool-idle + stale-metric heuristic, which
// fired false positives whenever the live path's dedup short-circuit
// ate batches of common short words and went silent on the write
// pool. ...
//
// A real stall now means: the reconciliation worker has pending
// chunks older than the freshness window — i.e. they should have
// been processed by now and haven't.
let stalled = pending_count > 20
    && oldest_pending_age_secs
        > (AUDIO_RECONCILIATION_FRESHNESS_DELAY_SECS as u64).saturating_mul(2);
```

`crates/screenpipe-audio/src/audio_manager/reconciliation.rs:28`:

```rust
const RECONCILIATION_FRESHNESS_DELAY_SECS: i64 = 10 * 60;
```

So the detection threshold is 20 min stale + 20+ chunks. In my incident, both conditions were satisfied for **days**, not minutes. The reconciliation worker's scheduling / retry logic is the suspect.

## Workaround

`Quit and relaunch screenpipe.app`. Reconciliation drains automatically over ~5h depending on backlog size (~50 chunks per 2-3 min in my run, cloud rate-limit dependent).

## Suggested fixes (in roughly increasing complexity)

1. **User-facing notification** when health-check `stalled` flag is true for >5 consecutive checks. "Transcription backlog hasn't drained in 5 minutes — restart screenpipe to recover."
2. **Watchdog timer on the reconciliation worker.** If the task hasn't made progress (touched a row in `audio_chunks WHERE status='pending'`) in `2 * RECONCILIATION_FRESHNESS_DELAY_SECS`, the watchdog kills and restarts the task.
3. **Periodic re-init on schedule** (every 30 min?), not just on app startup. Reconciliation re-init seems to be the fix; doing it proactively avoids the multi-day silent failure.
4. **Bounded retry with exponential backoff** on cloud Deepgram failures, then dead-letter-style logging when retries exhausted. If a chunk fails N times, mark it so the dispatcher moves on instead of retrying it forever (if that's what's happening).
5. **Trace logs around reconciliation worker start/stop** with the reason. Right now there's no log line that says "reconciliation worker stopped because X" — only the WARN that says "it's been stopped for a while." Adding a line at the exit point of the worker task (panic / clean shutdown / awaiting-forever) would make root-cause obvious next time.

## Severity

Medium-High in my judgment. No data lost from disk (audio files remain), but indexed/searchable transcripts can fall *days* behind without user awareness. For a tool whose value proposition is "every word you said is searchable," silent multi-day staleness is a serious UX failure.

## I can help debug next recurrence

I'm a heavy user with a contributor setup ready (`mediar-ai/screenpipe` cloned, full build environment on Mac Mini). If you want me to drop in additional instrumentation behind a feature flag and run on prod for a week to catch the next stall in the act, happy to.

---

*Incident date:* 2026-05-21 (drained), backlog onset estimated 2026-05-15
*Filed by:* @pleasedodisturb
*Related:* [#3466 (port collision / silent recorder failure)](https://github.com/screenpipe/screenpipe/issues/3466) — distinct symptom, same general theme of *silent failure modes that the health-check sees but the user doesn't*.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Transcription reconciliation worker silently stops, accumulates days of pending chunks before app restart drains them #3498

Summary

Environment

Timeline (from `~/.screenpipe/screenpipe-app.2026-05-21.log`)

Reproduction

Expected behavior

Actual behavior

Source-code observations

Workaround

Suggested fixes (in roughly increasing complexity)

Severity

I can help debug next recurrence

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[bug] Transcription reconciliation worker silently stops, accumulates days of pending chunks before app restart drains them #3498

Description

Summary

Environment

Timeline (from ~/.screenpipe/screenpipe-app.2026-05-21.log)

Reproduction

Expected behavior

Actual behavior

Source-code observations

Workaround

Suggested fixes (in roughly increasing complexity)

Severity

I can help debug next recurrence

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Timeline (from `~/.screenpipe/screenpipe-app.2026-05-21.log`)