Skip to content

[bug] Audio capture acts as an active participant in the CoreAudio graph (hijacks default device, re-opens inputs on tap-timeout, forces BT headsets out of A2DP) — should be a passive post-virtual-device listener #3750

@pleasedodisturb

Description

@pleasedodisturb

Summary

On macOS, screenpipe's audio capture does not behave like a passive listener. It re-opens devices and re-asserts itself in the CoreAudio graph on tap-timeouts and device changes, which causes three observable problems:

  1. It steals the default output/system-output slot. After a coreaudiod restart (or graph rebuild), Screenpipe Capture (Undocked) ends up flagged as Default Output + Default System Output. Because that virtual device is 2-in/2-out, system audio gets routed into it and monitored back out — the user hears their own mic looped into their headphones until they manually re-select a real output device.

  2. It forces Bluetooth headsets out of A2DP into HFP/SCO. When screenpipe (re-)opens a device for input that references the BT headset's microphone, macOS drops the headset from A2DP (stereo) to HFP (mono call-mode). This happens without the user starting a call, purely from screenpipe re-acquiring inputs. Cf. Audio recording silently stops mid-session when Bluetooth device (AirPods) is reused by another app, despite "ON" status #3144 (closed, AirPods-specific) and Windows: audio stream build fails with E_INVALIDARG (0x80070057) on Bluetooth headsets #3020 (Windows BT) — those are symptoms of the same root behavior.

  3. The capture tap silently times out and re-establishes. /health shows stream_timeouts climbing and audio_status: active_no_data with the System Audio device going inactive for long stretches while only the virtual capture device stays live. Each re-establish is another device re-open — i.e. another chance to hijack the default slot / re-trigger the HFP switch.

Evidence from /health (this machine)

version: 0.3.351
status: degraded (503)
audio_status: active_no_data
device_status_details: System Audio (output): inactive (last activity: 1858s ago), Screenpipe Capture (Undocked) (input): active (last activity: 0s ago)
stream_timeouts: 20
audio_devices: ['System Audio (output)', 'Screenpipe Capture (Undocked) (input)']
per_device_audio_level_rms:
  Screenpipe Capture (Undocked) (input): 0.0016
  Screenpipe Capture (Docked) (input): 0.0
  System Audio (output): 0.0
  <BT headset> (input): 0.0

Note on the Docked vs Undocked variants: to be clear, those two are not screenpipe internals — they are devices I built in Loopback as an attempt to auto-identify my hardware stack (e.g. dock attached vs not) and route screenpipe to the right sources/outputs depending on context. That self-built routing logic only made things worse: it multiplied the corner cases and created its own dilemmas (which variant is "current"? what happens on a dock hot-plug mid-capture? which one does screenpipe latch onto?), on top of the underlying hijack behavior. It is more evidence that users are being forced to hand-build brittle routing schemes to compensate for capture not being a passive listener — not a screenpipe feature.

Environment

  • screenpipe 0.3.351
  • macOS (Apple Silicon)
  • Output device: Sony BT headphones (A2DP)
  • Wired mic available (Shure MV7+) + MacBook mic
  • Also running SoundSource (per-app routing) and Loopback (Rogue Amoeba)

Why this matters / the real-world struggle

To even diagnose this, I had to build a Loopback virtual device as a stable buffer in front of screenpipe — mic + app sources mixed, monitored out to the headphones, with screenpipe listening on the Loopback device. I want to be explicit: Loopback here is a test probe, not a solution. It was a way to isolate and confirm the root cause (screenpipe stops hijacking the real devices once it's pointed at a virtual one it can't fight over), not something I want in my setup. It is not an answer to the actual bug — it only papers over it by adding a whole extra layer of routing complexity. For normal daily work I should not need Loopback at all, and the goal of this issue is to make it deletable.

As a workaround it's also genuinely fragile:

  • It's extremely easy to accidentally include the BT headset's mic as a Loopback source (especially when the Loopback device is named after and built on the headset), which guarantees the A2DP→HFP drop the moment anything opens it.
  • The monitor/mute-when-capturing matrix is error-prone and produces echo/double-audio.
  • It fights with SoundSource: I want screenpipe to sit downstream of SoundSource's per-app routing on the output stream, not to grab inputs ahead of it and take control away.
  • It will also collide with superwhisper and chat apps (Discord/Meet/etc.) that need the mic input. When screenpipe overtakes the input device, those either crash or capture nothing.

In short: Loopback was the diagnostic that proved the root cause; it should not be the long-term fix, and a correct passive-listener implementation would let me remove it entirely.

Requested behavior

  1. Be a passive listener. Capture system audio via a process/output tap that sits downstream of the user's existing chain (post-SoundSource) and never sets itself as the default output / system-output device.
  2. Never re-assert default device on tap-timeout or device-change. On a CoreAudio tap timeout, reconnect to the current default output without mutating the graph or claiming the default slot.
  3. Never open a physical/BT input that forces A2DP→HFP. Don't acquire the BT headset's microphone for system-audio capture. Mic capture should be an explicitly chosen input only, and ideally avoid HFP-forcing devices unless the user picks them.
  4. Co-exist with other audio apps. Capture should not exclusively claim input/output devices so that superwhisper, Discord, Meet, SoundSource, and Loopback can all run concurrently.
  5. Nice-to-have: first-class SoundSource / Loopback integration / docs so users don't have to hand-build a virtual device just to get a stable capture point. A documented "capture from this virtual device, output stays on your real device" recipe would remove most of this pain — though the priority is that capture works without any such workaround.

If capture were a well-behaved passive listener, the entire Loopback workaround could be deleted — which is the point: it was only ever a test rig to find this bug, never a setup I want to keep.

The manual-pin path is not a resolution either

The obvious suggestion is "just turn off Auto-select audio devices and pin specific devices." That stops screenpipe from wandering onto the BT mic / stealing the default slot, but it does not solve the problem — it trades random hijacking for a permanent manual-routing burden:

  • I now have to decide which virtual device variant screenpipe listens on (docked vs undocked), and re-decide it every time I dock/undock.
  • I have to hand-manage the monitor / self-passthrough so I still hear app audio without doubling or echo, while making sure my own mic is monitored nowhere.
  • All of this has to be re-jiggered on every hardware-stack change.

In other words, device-selection is the wrong abstraction. Forcing the user to express capture intent as "pick these exact devices" pushes a docked/undocked + monitor-routing decision matrix onto them that has to be re-managed continuously. The correct abstraction is intent: "listen to whatever I am already hearing, downstream of my routing, passively" — no device modelling required. A passive post-chain listener is the only thing that removes the burden entirely; pinning just freezes one fragile snapshot of it.

Reproduction / testing

Steps to reproduce the hijack on macOS (Apple Silicon, BT headset as default output):

  1. Set a Bluetooth headset (e.g. Sony WH-series) as the default output; confirm it is in A2DP (stereo, full-quality).
  2. In screenpipe settings, enable Auto-select audio devices.
  3. Let screenpipe run; trigger a graph rebuild (toggle output device, sleep/wake, or sudo killall coreaudiod).
  4. Observe: the BT headset drops A2DP → HFP/SCO (audio goes mono/muffled) the moment screenpipe re-acquires inputs, without the user starting any call. And/or the screenpipe virtual capture device grabs the Default-Output slot, looping mic into the headphones.

Health-endpoint probe to confirm the timeout/re-acquire churn (no call active):

# watch the tap re-establishing itself over time — stream_timeouts climbs,
# System Audio goes 'inactive' for long stretches while only the virtual cap device stays live
for i in 1 2 3 4 5 6; do
  curl -s http://localhost:3030/health -H "Authorization: Bearer \$SCREENPIPE_LOCAL_API_KEY" \
  | python3 -c "import sys,json;d=json.load(sys.stdin);ap=d['audio_pipeline'];print(d.get('audio_status'),'| stream_timeouts=',ap.get('stream_timeouts'),'|',d.get('device_status_details'))"
  sleep 30
done

What a fixed build should show under the same probe: audio_status stays active, stream_timeouts does not climb, the System Audio tap never goes inactive while audio is playing, the BT headset stays in A2DP across a graph rebuild, and the screenpipe virtual device never appears as Default Output. Bonus check: superwhisper / Discord can hold the mic input concurrently with screenpipe capture without either crashing or going silent.

Second, distinct failure mode: silent dead tap after a graph rebuild (no reconnect, no timeout increment)

The repro above describes the churn case (re-acquire / HFP drop / stream_timeouts climbing). But with the clean no-Loopback, system-audio-only config (all mics unchecked, only System Audio ticked), a graph rebuild produces a quieter and arguably worse failure: screenpipe keeps a dead System Audio tap forever and never reconnects. The liveness watchdog does not fire — stream_timeouts does not even increment, so nothing notices the tap died.

Captured test run (screenpipe v0.3.351, macOS Apple Silicon, config = audio_devices: ['System Audio (output)'], no mic in pipeline):

  1. Confirm baseline: audio_status: stale, System Audio (output): inactive, stream_timeouts: 21, audio_level_rms: 0.0.
  2. Run sudo killall coreaudiod to force a graph rebuild (step 3 above).
  3. Poll /health every 10s. The tap never re-attaches; "last activity" just keeps climbing while stream_timeouts stays frozen:
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 171s ago)
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 181s ago)
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 191s ago)
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 201s ago)
stale | stream_timeouts=21 | rms=0.0 | System Audio (output): inactive (last activity: 211s ago)

A follow-up poll ~8 minutes later still showed inactive (last activity: 471s ago), stream_timeouts: 21, rms: 0.0 — i.e. capture was effectively off until screenpipe was restarted; it never self-healed.

Recovery confirmed only via full restart. After a manual restart of screenpipe, /health returned to status: healthy, audio_status: ok, System Audio (output): active, stream_timeouts: 0. Capture was then verified to be genuinely live (not just reporting active with silence): with a YouTube video playing, per_device_audio_level_rms for System Audio (output) read 0.0105 — a clear non-zero signal, vs. the flat 0.0 held the entire 8 minutes the tap was dead. So the only recovery path that worked was a full app restart; the watchdog never performed the equivalent reconnect on its own.

Interpretation: this is not "the tap re-acquires and trips." It's "after a CoreAudio graph rebuild, screenpipe sits on a stale/dead tap handle, never detects it's dead, never retries, and never increments its own timeout counter." The watchdog/liveness check does not catch a post-rebuild dead handle. A fixed build must detect the dead tap, attempt reconnection to the current graph, and recover to active within seconds — without a manual restart.

Related

Proposed integration (concrete mechanisms — not yet decided which)

The "first-class integration" nice-to-have above can be made concrete. Any of the following would let screenpipe be fed passively by an existing audio-routing layer instead of acquiring devices itself. Listing options without committing to one — the maintainers are best placed to judge feasibility:

  1. Read from a user-designated virtual capture device (SoundSource + Loopback). The least-new-code path: keep capture as "listen on a chosen input," but make it rock-solid (never re-assert default, never re-enumerate, reconnect to the current device on timeout). The user routes per-app output via SoundSource, mixes into a Loopback virtual device, and screenpipe just reads that one device. This already works today as a workaround; the ask is that it stay stable. Note this is the user's preferred minimal stack — no Audio Hijack required, and nothing screenpipe needs to run itself.

  2. Register as an ARK capture target. Rogue Amoeba's shared engine (com.rogueamoeba.ARK.driver) underlies SoundSource, Loopback, and Audio Hijack. If screenpipe could appear as an ARK capture endpoint, the Rogue Amoeba engine would do the tapping and stream audio in — zero device hijacking on screenpipe's side. This is the cleanest "feeds in naturally" option.

  3. Be an Audio Hijack output target. For users who run Audio Hijack, screenpipe could appear as an output block so a session pipes audio straight to it. (Optional path — should not be a requirement; most users won't and shouldn't have to run a third app just to capture.)

All three converge on the same principle: screenpipe consumes a stream handed to it, downstream of the user's routing, and never grabs or re-asserts physical devices. The default experience should require none of these — but supporting one or more would make capture robust for power users with existing audio stacks.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions