[bug] Audio capture acts as an active participant in the CoreAudio graph (hijacks default device, re-opens inputs on tap-timeout, forces BT headsets out of A2DP) — should be a passive post-virtual-device listener

## Summary

On macOS, screenpipe's audio capture does not behave like a passive listener. It re-opens devices and re-asserts itself in the CoreAudio graph on tap-timeouts and device changes, which causes three observable problems:

1. **It steals the default output/system-output slot.** After a `coreaudiod` restart (or graph rebuild), `Screenpipe Capture (Undocked)` ends up flagged as **Default Output + Default System Output**. Because that virtual device is 2-in/2-out, system audio gets routed *into* it and monitored back out — the user hears their own mic looped into their headphones until they manually re-select a real output device.

2. **It forces Bluetooth headsets out of A2DP into HFP/SCO.** When screenpipe (re-)opens a device for input that references the BT headset's microphone, macOS drops the headset from A2DP (stereo) to HFP (mono call-mode). This happens *without the user starting a call*, purely from screenpipe re-acquiring inputs. Cf. #3144 (closed, AirPods-specific) and #3020 (Windows BT) — those are symptoms of the same root behavior.

3. **The capture tap silently times out and re-establishes.** `/health` shows `stream_timeouts` climbing and `audio_status: active_no_data` with the System Audio device going `inactive` for long stretches while only the virtual capture device stays live. Each re-establish is another device re-open — i.e. another chance to hijack the default slot / re-trigger the HFP switch.

## Evidence from /health (this machine)

```
version: 0.3.351
status: degraded (503)
audio_status: active_no_data
device_status_details: System Audio (output): inactive (last activity: 1858s ago), Screenpipe Capture (Undocked) (input): active (last activity: 0s ago)
stream_timeouts: 20
audio_devices: ['System Audio (output)', 'Screenpipe Capture (Undocked) (input)']
per_device_audio_level_rms:
  Screenpipe Capture (Undocked) (input): 0.0016
  Screenpipe Capture (Docked) (input): 0.0
  System Audio (output): 0.0
  <BT headset> (input): 0.0
```

Note on the **Docked vs Undocked** variants: to be clear, those two are *not* screenpipe internals — they are devices I built in Loopback as an attempt to auto-identify my hardware stack (e.g. dock attached vs not) and route screenpipe to the right sources/outputs depending on context. That self-built routing logic only made things worse: it multiplied the corner cases and created its own dilemmas (which variant is "current"? what happens on a dock hot-plug mid-capture? which one does screenpipe latch onto?), on top of the underlying hijack behavior. It is more evidence that users are being forced to hand-build brittle routing schemes to compensate for capture not being a passive listener — not a screenpipe feature.

## Environment

- screenpipe 0.3.351
- macOS (Apple Silicon)
- Output device: Sony BT headphones (A2DP)
- Wired mic available (Shure MV7+) + MacBook mic
- Also running SoundSource (per-app routing) and Loopback (Rogue Amoeba)

## Why this matters / the real-world struggle

To even diagnose this, I had to build a **Loopback** virtual device as a stable buffer in front of screenpipe — mic + app sources mixed, monitored out to the headphones, with screenpipe listening on the Loopback device. **I want to be explicit: Loopback here is a test probe, not a solution.** It was a way to isolate and confirm the root cause (screenpipe stops hijacking the real devices once it's pointed at a virtual one it can't fight over), not something I want in my setup. It is *not* an answer to the actual bug — it only papers over it by adding a whole extra layer of routing complexity. For normal daily work I should not need Loopback at all, and the goal of this issue is to make it deletable.

As a workaround it's also genuinely fragile:

- It's extremely easy to accidentally include the BT headset's **mic** as a Loopback source (especially when the Loopback device is *named after* and built on the headset), which guarantees the A2DP→HFP drop the moment anything opens it.
- The monitor/mute-when-capturing matrix is error-prone and produces echo/double-audio.
- It fights with **SoundSource**: I want screenpipe to sit *downstream* of SoundSource's per-app routing on the output stream, not to grab inputs ahead of it and take control away.
- It will also collide with **superwhisper** and chat apps (Discord/Meet/etc.) that need the mic input. When screenpipe overtakes the input device, those either crash or capture nothing.

In short: Loopback was the diagnostic that *proved* the root cause; it should not be the long-term fix, and a correct passive-listener implementation would let me remove it entirely.

## Requested behavior

1. **Be a passive listener.** Capture system audio via a process/output tap that sits downstream of the user's existing chain (post-SoundSource) and **never** sets itself as the default output / system-output device.
2. **Never re-assert default device on tap-timeout or device-change.** On a CoreAudio tap timeout, reconnect to the *current* default output without mutating the graph or claiming the default slot.
3. **Never open a physical/BT input that forces A2DP→HFP.** Don't acquire the BT headset's microphone for system-audio capture. Mic capture should be an explicitly chosen input only, and ideally avoid HFP-forcing devices unless the user picks them.
4. **Co-exist with other audio apps.** Capture should not exclusively claim input/output devices so that superwhisper, Discord, Meet, SoundSource, and Loopback can all run concurrently.
5. **Nice-to-have: first-class SoundSource / Loopback integration / docs** so users don't have to hand-build a virtual device just to get a stable capture point. A documented "capture from this virtual device, output stays on your real device" recipe would remove most of this pain — though the priority is that capture works *without* any such workaround.

If capture were a well-behaved passive listener, the entire Loopback workaround could be deleted — which is the point: it was only ever a test rig to find this bug, never a setup I want to keep.

## The manual-pin path is not a resolution either

The obvious suggestion is "just turn off *Auto-select audio devices* and pin specific devices." That stops screenpipe from *wandering* onto the BT mic / stealing the default slot, but it does **not** solve the problem — it trades random hijacking for a permanent manual-routing burden:

- I now have to decide *which* virtual device variant screenpipe listens on (docked vs undocked), and re-decide it every time I dock/undock.
- I have to hand-manage the monitor / self-passthrough so I still *hear* app audio without doubling or echo, while making sure my own mic is monitored nowhere.
- All of this has to be re-jiggered on every hardware-stack change.

In other words, device-selection is the wrong abstraction. Forcing the user to express capture intent as "pick these exact devices" pushes a docked/undocked + monitor-routing decision matrix onto them that has to be re-managed continuously. The correct abstraction is *intent*: **"listen to whatever I am already hearing, downstream of my routing, passively"** — no device modelling required. A passive post-chain listener is the only thing that removes the burden entirely; pinning just freezes one fragile snapshot of it.

## Reproduction / testing

Steps to reproduce the hijack on macOS (Apple Silicon, BT headset as default output):

1. Set a Bluetooth headset (e.g. Sony WH-series) as the default output; confirm it is in A2DP (stereo, full-quality).
2. In screenpipe settings, enable **Auto-select audio devices**.
3. Let screenpipe run; trigger a graph rebuild (toggle output device, sleep/wake, or `sudo killall coreaudiod`).
4. **Observe:** the BT headset drops A2DP → HFP/SCO (audio goes mono/muffled) the moment screenpipe re-acquires inputs, *without* the user starting any call. And/or the screenpipe virtual capture device grabs the Default-Output slot, looping mic into the headphones.

Health-endpoint probe to confirm the timeout/re-acquire churn (no call active):

```bash
# watch the tap re-establishing itself over time — stream_timeouts climbs,
# System Audio goes 'inactive' for long stretches while only the virtual cap device stays live
for i in 1 2 3 4 5 6; do
  curl -s http://localhost:3030/health -H "Authorization: Bearer \$SCREENPIPE_LOCAL_API_KEY" \
  | python3 -c "import sys,json;d=json.load(sys.stdin);ap=d['audio_pipeline'];print(d.get('audio_status'),'| stream_timeouts=',ap.get('stream_timeouts'),'|',d.get('device_status_details'))"
  sleep 30
done
```

What a fixed build should show under the same probe: `audio_status` stays `active`, `stream_timeouts` does **not** climb, the System Audio tap never goes `inactive` while audio is playing, the BT headset stays in A2DP across a graph rebuild, and the screenpipe virtual device never appears as Default Output. Bonus check: `superwhisper` / Discord can hold the mic input *concurrently* with screenpipe capture without either crashing or going silent.

### Second, distinct failure mode: silent dead tap after a graph rebuild (no reconnect, no timeout increment)

The repro above describes the *churn* case (re-acquire / HFP drop / `stream_timeouts` climbing). But with the **clean no-Loopback, system-audio-only config** (all mics unchecked, only `System Audio` ticked), a graph rebuild produces a quieter and arguably worse failure: screenpipe keeps a **dead System Audio tap forever and never reconnects**. The liveness watchdog does not fire — `stream_timeouts` does not even increment, so nothing notices the tap died.

**Captured test run** (screenpipe v0.3.351, macOS Apple Silicon, config = `audio_devices: ['System Audio (output)']`, no mic in pipeline):

1. Confirm baseline: `audio_status: stale`, `System Audio (output): inactive`, `stream_timeouts: 21`, `audio_level_rms: 0.0`.
2. Run `sudo killall coreaudiod` to force a graph rebuild (step 3 above).
3. Poll `/health` every 10s. The tap never re-attaches; "last activity" just keeps climbing while `stream_timeouts` stays frozen:

```
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 171s ago)
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 181s ago)
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 191s ago)
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 201s ago)
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 211s ago)
```

A follow-up poll ~8 minutes later still showed `inactive (last activity: 471s ago)`, `stream_timeouts: 21`, `rms: 0.0` — i.e. **capture was effectively off until screenpipe was restarted**; it never self-healed.

**Recovery confirmed only via full restart.** After a manual restart of screenpipe, `/health` returned to `status: healthy`, `audio_status: ok`, `System Audio (output): active`, `stream_timeouts: 0`. Capture was then verified to be genuinely live (not just reporting `active` with silence): with a YouTube video playing, `per_device_audio_level_rms` for `System Audio (output)` read `0.0105` — a clear non-zero signal, vs. the flat `0.0` held the entire 8 minutes the tap was dead. So the only recovery path that worked was a full app restart; the watchdog never performed the equivalent reconnect on its own.

**Interpretation:** this is not "the tap re-acquires and trips." It's "after a CoreAudio graph rebuild, screenpipe sits on a stale/dead tap handle, never detects it's dead, never retries, and never increments its own timeout counter." The watchdog/liveness check does not catch a post-rebuild dead handle. A fixed build must detect the dead tap, attempt reconnection to the *current* graph, and recover to `active` within seconds — without a manual restart.

## Related

- #3144 — Audio recording silently stops on BT device handoff (closed, AirPods) — same root: capture loses/re-grabs a BT device.
- #3020 — Windows BT stream build failure on device re-acquire.
- #3228 — Acoustic loopback duplicates transcripts (capture entangled with the monitoring path).
- #3648 — `--audio-device` overridden by persisted store (related capture-config control gap).

## Proposed integration (concrete mechanisms — not yet decided which)

The "first-class integration" nice-to-have above can be made concrete. Any of the following would let screenpipe be fed *passively* by an existing audio-routing layer instead of acquiring devices itself. Listing options without committing to one — the maintainers are best placed to judge feasibility:

1. **Read from a user-designated virtual capture device (SoundSource + Loopback).** The least-new-code path: keep capture as "listen on a chosen input," but make it rock-solid (never re-assert default, never re-enumerate, reconnect to the *current* device on timeout). The user routes per-app output via **SoundSource**, mixes into a **Loopback** virtual device, and screenpipe just reads that one device. This already works today as a workaround; the ask is that it stay stable. Note this is the user's preferred minimal stack — **no Audio Hijack required**, and nothing screenpipe needs to run itself.

2. **Register as an ARK capture target.** Rogue Amoeba's shared engine (`com.rogueamoeba.ARK.driver`) underlies SoundSource, Loopback, and Audio Hijack. If screenpipe could appear as an ARK capture endpoint, the Rogue Amoeba engine would do the tapping and stream audio in — zero device hijacking on screenpipe's side. This is the cleanest "feeds in naturally" option.

3. **Be an Audio Hijack output target.** For users who run Audio Hijack, screenpipe could appear as an output block so a session pipes audio straight to it. (Optional path — should not be a *requirement*; most users won't and shouldn't have to run a third app just to capture.)

All three converge on the same principle: **screenpipe consumes a stream handed to it, downstream of the user's routing, and never grabs or re-asserts physical devices.** The default experience should require none of these — but supporting one or more would make capture robust for power users with existing audio stacks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] Audio capture acts as an active participant in the CoreAudio graph (hijacks default device, re-opens inputs on tap-timeout, forces BT headsets out of A2DP) — should be a passive post-virtual-device listener #3750

Summary

Evidence from /health (this machine)

Environment

Why this matters / the real-world struggle

Requested behavior

The manual-pin path is not a resolution either

Reproduction / testing

Second, distinct failure mode: silent dead tap after a graph rebuild (no reconnect, no timeout increment)

Related

Proposed integration (concrete mechanisms — not yet decided which)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[bug] Audio capture acts as an active participant in the CoreAudio graph (hijacks default device, re-opens inputs on tap-timeout, forces BT headsets out of A2DP) — should be a passive post-virtual-device listener #3750

Description

Summary

Evidence from /health (this machine)

Environment

Why this matters / the real-world struggle

Requested behavior

The manual-pin path is not a resolution either

Reproduction / testing

Second, distinct failure mode: silent dead tap after a graph rebuild (no reconnect, no timeout increment)

Related

Proposed integration (concrete mechanisms — not yet decided which)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions